Sample Weight Adjustments

The objective of this tutorial is to familiarize ourselves with SampleWeight the samplics class for adjusting sample weights. In practice, it is necessary to adjust base or design sample weights obtained directly from the random sample mechanism. These adjustments are done to correct for nonresponse, reduce effects of extreme/large weights, better align with known auxiliary information, and more. Specifically in this tutorial, we will:

To run the code in this notebook, we will use the dataset that was developed in the previous tutorial on sample selection.

import numpy as np
import pandas as pd

from samplics.datasets import load_psu_sample, load_ssu_sample
from samplics.weighting import SampleWeight

Design (base) weight

The design weight is the inverse of the overall probability of selection which is the product of the first and second probabilities of selection.

# Load PSU sample data
psu_sample_dict = load_psu_sample()
psu_sample = psu_sample_dict["data"]

# Load PSU sample data
ssu_sample_dict = load_ssu_sample()
ssu_sample = ssu_sample_dict["data"]

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]], 
    ssu_sample[["cluster", "household", "ssu_prob"]], 
    on="cluster"
)

full_sample["inclusion_prob"] = full_sample["psu_prob"] * \
    full_sample["ssu_prob"]
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"]

full_sample.head(15)
cluster region psu_prob household ssu_prob inclusion_prob design_weight
0 7 North 0.187726 72 0.115385 0.021661 46.166667
1 7 North 0.187726 73 0.115385 0.021661 46.166667
2 7 North 0.187726 75 0.115385 0.021661 46.166667
3 7 North 0.187726 715 0.115385 0.021661 46.166667
4 7 North 0.187726 722 0.115385 0.021661 46.166667
5 7 North 0.187726 724 0.115385 0.021661 46.166667
6 7 North 0.187726 755 0.115385 0.021661 46.166667
7 7 North 0.187726 761 0.115385 0.021661 46.166667
8 7 North 0.187726 764 0.115385 0.021661 46.166667
9 7 North 0.187726 782 0.115385 0.021661 46.166667
10 7 North 0.187726 795 0.115385 0.021661 46.166667
11 7 North 0.187726 7111 0.115385 0.021661 46.166667
12 7 North 0.187726 7112 0.115385 0.021661 46.166667
13 7 North 0.187726 7117 0.115385 0.021661 46.166667
14 7 North 0.187726 7123 0.115385 0.021661 46.166667

For the purposes of this illustration of handling non-response, we first need to incorporate some household non-response into our example. That is, we simulate the non-response status and store it in the variable response_status. The variable response_status has four possible values: ineligible which indicates that the sampling unit is not eligible for the survey, respondent which indicates that the sampling unit responded to the survey, non-respondent which indicates that the sampling unit did not respond to the survey, and unknown means that we are not able to infer the status of the sampling unit i.e. we do not know whether the sampling unit is eligible or not to the survey.

np.random.seed(7)
full_sample["response_status"] = np.random.choice(
    ["ineligible", "respondent", "non-respondent", "unknown"], 
    size=full_sample.shape[0], 
    p=(0.10, 0.70, 0.15, 0.05)
)

full_sample[
    ["cluster", "region", "design_weight", "response_status"]
    ].head(15)
cluster region design_weight response_status
0 7 North 46.166667 ineligible
1 7 North 46.166667 respondent
2 7 North 46.166667 respondent
3 7 North 46.166667 respondent
4 7 North 46.166667 unknown
5 7 North 46.166667 respondent
6 7 North 46.166667 respondent
7 7 North 46.166667 ineligible
8 7 North 46.166667 respondent
9 7 North 46.166667 respondent
10 7 North 46.166667 respondent
11 7 North 46.166667 non-respondent
12 7 North 46.166667 respondent
13 7 North 46.166667 ineligible
14 7 North 46.166667 respondent

Nonresponse adjustment

In general, the sample weights are adjusted to redistribute the sample weights of all eligible units for which there is no sufficient response (unit level nonresponse) to the sampling units that sufficiently responded to the survey. This adjustment is done within adjustment classes or domains. Note that the determination of the response categories (unit response, item response, ineligible, etc.) is outside of the scope of this tutorial.

Also, the weights of the sampling units with unknown eligibility are redistributed to the rest of the sampling units. In general, ineligible sampling units receive weights from the sampling units with unknown eligibility since eligible sampling units can be part of the unknown pool.

The method adjust() has a boolean parameter unknown_to_inelig which controls how the sample weights of the unknown is redistributed. By default, adjust() redistribute the sample weights of the sampling units of the unknown to the ineligibles (unknown_to_inelig=True). If we do not wish to redistribute the sample weights of the unknowns to the ineligibles then we just set the flag to False (unknown_to_inelig=Fasle).

In the snippet of code below, we adjust the weight within clusters that is we use clusters as our adjustment classes. Note that we run the nonresponse adjustment twice, the first time with unknown_to_inelig=True (nr_weight) and the second time with the flag equal to False (nr_weight2). With unknown_to_inelig=True, the ineligible received part of the sample weights from the unknowns. Hence, the sample weights for the respondent is less than when the flag is False. With unknown_to_inelig=Fasle, the ineligible did Not receive any weights from the unknowns. Hence, the sample weights for the ineligible units remain the same before and after adjustment. In a real survey, the statistician may decide on the best non-response strategy based on the available information.

status_mapping = {
    "in": "ineligible", 
    "rr": "respondent", 
    "nr": "non-respondent", 
    "uk": "unknown"
    }

full_sample["nr_weight"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"],
    adj_class=full_sample[["region", "cluster"]],
    resp_status=full_sample["response_status"],
    resp_dict=status_mapping,
)

full_sample["nr_weight2"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"],
    adj_class=full_sample[["region", "cluster"]],
    resp_status=full_sample["response_status"],
    resp_dict=status_mapping,
    unknown_to_inelig=False,
)

full_sample[[
    "cluster", 
    "region", 
    "design_weight", 
    "response_status", 
    "nr_weight", 
    "nr_weight2"
    ]].drop_duplicates().head(15)
cluster region design_weight response_status nr_weight nr_weight2
0 7 North 46.166667 ineligible 49.464286 46.166667
1 7 North 46.166667 respondent 54.410714 55.400000
4 7 North 46.166667 unknown 0.000000 0.000000
11 7 North 46.166667 non-respondent 0.000000 0.000000
15 10 North 50.783333 non-respondent 0.000000 0.000000
16 10 North 50.783333 respondent 70.733929 71.096667
19 10 North 50.783333 ineligible 54.410714 50.783333
21 10 North 50.783333 unknown 0.000000 0.000000
30 16 South 62.149123 respondent 66.588346 66.588346
35 16 South 62.149123 non-respondent 0.000000 0.000000
45 24 South 58.940741 respondent 63.852469 63.852469
47 24 South 58.940741 non-respondent 0.000000 0.000000
55 24 South 58.940741 ineligible 58.940741 58.940741
60 29 South 65.702778 unknown 0.000000 0.000000
61 29 South 65.702778 respondent 101.081197 102.204321
Important

The default call of adjust() expects standard codes for response status that is “in”, “rr”, “nr”, and “uk” where “in” means ineligible, “rr” means respondent, “nr” means non-respondent, and “uk” means unknown eligibility.

In the call above, if we omit the parameter response_dict, then the run would fail with an assertion error message. The current error message is the following: The response status must only contains values in (‘in’, ‘rr’, ‘nr’, ‘uk’) or the mapping should be provided using response_dict parameter. For the call to run without using response_dict, it is necessary that the response status takes only values in the standard codes i.e. (“in”, “rr”, “nr”, “uk”). The variable associated with response_status can contain any code but a mapping is necessary when the response variable is not constructed using the standard codes.

To further illustrate the mapping of response status, let’s assume that we have response_status2 which has the values 100 for ineligible, 200 for non-respondent, 300 for respondent, and 999 for unknown.

response_status2 = np.repeat(100, full_sample["response_status"].shape[0])
response_status2[full_sample["response_status"] == "non-respondent"] = 200
response_status2[full_sample["response_status"] == "respondent"] = 300
response_status2[full_sample["response_status"] == "unknown"] = 999

pd.crosstab(response_status2, full_sample["response_status"])
response_status ineligible non-respondent respondent unknown
row_0
100 16 0 0 0
200 0 23 0 0
300 0 0 106 0
999 0 0 0 5

To use response_status2, we need to map the values 100, 200, 300 and 999 to “in”, “rr”, “nr”, and “uk”. This mapping is done below using the Python dictionary status_mapping2. Using status_mapping2 in the function call adjust() will lead to the same adjustment as in the previous run i.e. nr_weight and nr_weight3 contain the same adjusted weights.

status_mapping2 = {"in": 100, "nr": 200, "rr": 300, "uk": 999}

full_sample["nr_weight3"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"],
    adj_class=full_sample[["region", "cluster"]],
    resp_status=response_status2,
    resp_dict=status_mapping2,
)

full_sample[
    ["cluster", "region", "response_status", "nr_weight", "nr_weight3"]
    ].drop_duplicates().head()
cluster region response_status nr_weight nr_weight3
0 7 North ineligible 49.464286 49.464286
1 7 North respondent 54.410714 54.410714
4 7 North unknown 0.000000 0.000000
11 7 North non-respondent 0.000000 0.000000
15 10 North non-respondent 0.000000 0.000000

If the response status variable only takes values “in”, “nr”, “rr” and “uk”, then it is not necessary to provide the mapping dictionary to the function i.e. resp_dict can be omitted from the function call adjust().

response_status3 = np.repeat("in", full_sample["response_status"].shape[0])
response_status3[full_sample["response_status"] == "non-respondent"] = "nr"
response_status3[full_sample["response_status"] == "respondent"] = "rr"
response_status3[full_sample["response_status"] == "unknown"] = "uk"

full_sample["nr_weight4"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"],
    adj_class=full_sample[["region", "cluster"]],
    resp_status=response_status3,
)

full_sample[
    ["cluster", "region", "response_status", "nr_weight", "nr_weight4"]
    ].drop_duplicates().head()
cluster region response_status nr_weight nr_weight4
0 7 North ineligible 49.464286 49.464286
1 7 North respondent 54.410714 54.410714
4 7 North unknown 0.000000 0.000000
11 7 North non-respondent 0.000000 0.000000
15 10 North non-respondent 0.000000 0.000000
# Just dropping a couple of variables 
# not needed for the rest of the tutorial
full_sample.drop(
    columns=[
        "psu_prob", 
        "ssu_prob", 
        "inclusion_prob", 
        "nr_weight2", 
        "nr_weight3", 
        "nr_weight4"
        ], 
    inplace=True
)

Poststratification

Poststratification is useful to compensate for under-representation of the sample or to correct for nonsampling error. The most common poststratification method consists of adjusting the sample weights to ensure that they sum to known control values from reliable souces by adjustment classes (domains). Poststratification classes can be formed using variables beyond the ones involved in the sampling design. For example, socio-economic variables such as age group, gender, race and education are often used to form poststratification classes/cells.

Warning

poststratifying to totals that are known to be out of date, and thus likely inaccurate and/or unreliable may not improve the estimate. Use this with caution.

Let’s assume that we have a reliable external source e.g. a recent census that provides the number of households by region. The external source has the following control data: 3700 households for East, 1500 for North, 2800 for South and 6500 for West.

We use the method poststratify() to ensure that the poststratified sample weights (ps_weight) sum to the know control totals by region. Note that the control totals are provided using the Python dictionary census_households.

census_households = {"East": 3700, "North": 1500, "South": 2800, "West": 6500}

full_sample["ps_weight"] = SampleWeight().poststratify(
    samp_weight=full_sample["nr_weight"], 
    control=census_households, 
    domain=full_sample["region"]
)

full_sample.head(15)
cluster region household design_weight response_status nr_weight ps_weight
0 7 North 72 46.166667 ineligible 49.464286 51.020408
1 7 North 73 46.166667 respondent 54.410714 56.122449
2 7 North 75 46.166667 respondent 54.410714 56.122449
3 7 North 715 46.166667 respondent 54.410714 56.122449
4 7 North 722 46.166667 unknown 0.000000 0.000000
5 7 North 724 46.166667 respondent 54.410714 56.122449
6 7 North 755 46.166667 respondent 54.410714 56.122449
7 7 North 761 46.166667 ineligible 49.464286 51.020408
8 7 North 764 46.166667 respondent 54.410714 56.122449
9 7 North 782 46.166667 respondent 54.410714 56.122449
10 7 North 795 46.166667 respondent 54.410714 56.122449
11 7 North 7111 46.166667 non-respondent 0.000000 0.000000
12 7 North 7112 46.166667 respondent 54.410714 56.122449
13 7 North 7117 46.166667 ineligible 49.464286 51.020408
14 7 North 7123 46.166667 respondent 54.410714 56.122449

The snippet of code below shows that the poststratified sample weights sum to the expected control totals that is 3700 households for East, 1500 for North, 2800 for South and 6500 for West.

sum_of_weights = full_sample[
    ["region", "nr_weight", "ps_weight"]
    ].groupby("region").sum()
sum_of_weights.reset_index(inplace=True)
sum_of_weights.head()
region nr_weight ps_weight
0 East 3698.703391 3700.0
1 North 1454.250000 1500.0
2 South 2801.889620 2800.0
3 West 6485.783333 6500.0

The crosstable below shows that only one adjustment factor was calculated and applied per adjustment class or region.

full_sample["ps_adj_fct"] = \
    round(full_sample["ps_weight"] / full_sample["nr_weight"], 12)

pd.crosstab(full_sample["ps_adj_fct"], full_sample["region"])
region East North South West
ps_adj_fct
0.999326 0 0 38 0
1.000351 37 0 0 0
1.002192 0 0 0 23
1.031460 0 24 0 0

In some surveys, there is interest in keeping relative distribution of strata to some known distribution. For example, WHO EPI vaccination surveys often postratify sample weights to ensure that relative sizes of strata reflect offcial statistics e.g. census data. In most cases, the strata are based on some administrative divisions.

For example, assume that according to census data that East contains 25% of the households, North contains 10%, South contains 20% and West contains 45%. We can poststratify using the snippet of code below.

known_ratios = {"East": 0.25, "North": 0.10, "South": 0.20, "West": 0.45}
full_sample["ps_weight2"] = SampleWeight().poststratify(
    samp_weight=full_sample["nr_weight"], 
    factor=known_ratios, 
    domain=full_sample["region"]
)

full_sample.head()
cluster region household design_weight response_status nr_weight ps_weight ps_adj_fct ps_weight2
0 7 North 72 46.166667 ineligible 49.464286 51.020408 1.03146 49.117777
1 7 North 73 46.166667 respondent 54.410714 56.122449 1.03146 54.029554
2 7 North 75 46.166667 respondent 54.410714 56.122449 1.03146 54.029554
3 7 North 715 46.166667 respondent 54.410714 56.122449 1.03146 54.029554
4 7 North 722 46.166667 unknown 0.000000 0.000000 NaN 0.000000
sum_of_weights2 = full_sample[
    ["region", "nr_weight", "ps_weight2"]
    ].groupby("region").sum()
sum_of_weights2.reset_index(inplace=True)
sum_of_weights2["ratio"] = \
    sum_of_weights2["ps_weight2"] / sum(sum_of_weights2["ps_weight2"])
sum_of_weights2.head()
region nr_weight ps_weight2 ratio
0 East 3698.703391 3610.156586 0.25
1 North 1454.250000 1444.062634 0.10
2 South 2801.889620 2888.125269 0.20
3 West 6485.783333 6498.281855 0.45

Calibration

Calibration is a more general concept for adjusting sample weights to sum to known constants. In this tutorial, we consider the generalized regression (GREG) class of calibration. Assume that we have \(\hat{\mathbf{Y}} = \sum_{i \in s} w_i y_i\) and know population totals \(\mathbf{X} = (\mathbf{X}_1, ..., \mathbf{X}_p)^T\) are available. Working under the model \(Y_i | \mathbf{x}_i = \mathbf{x}^T_i \mathbf{\beta} + \epsilon_i\), the GREG estimator of the population total is

\[\hat{\mathbf{Y}}_{GR} = \hat{\mathbf{Y}} + (\mathbf{X} - \hat{\mathbf{X}})^T\hat{\mathbf{B}}\]

where \(\hat{\mathbf{B}}\) is the weighted least squares estimate of \(\mathbf{\beta}\) and \(\hat{\mathbf{X}}\) is the survey estimate of \(\mathbf{X}\). The essential of the GREG approach is, under the regression model, to find the adjusted weights \(w^{*}_i\) that are the closest to \(w_i\), to minimize \(h(z) = \frac{\sum_{i \in s} c_i(w_i - z_i)}{w_i}\).

Let us simulate three auxiliary variables that is education, poverty and under_five (number of children under five in the household) and assume that we have the following control totals.

  • Total number of under five children: 6300 in the East, 4000 in the North, 6500 in the South and 14000 in the West.

  • Poverty (Yes: in poverty / No: not in poverty)

    Region   Poverty   Number of households
    East No 2600
    Yes 1200
    North No 1500
    Yes 200
    South No 1800
    Yes 1100
    West No 4500
    Yes 2200
  • Education (Low: less than secondary, Medium: secondary completed, and High: More than secondary)

    Region   Education   Number of households
    East Low 2000
    Medium 1400
    High 350
    North Low 550
    Medium 700
    High 250
    South Low 1300
    Medium 1200
    High 350
    West Low 2100
    Medium 4000
    High 500
np.random.seed(150)
full_sample["education"] = np.random.choice(
    ("Low", "Medium", "High"), 
    size=150, 
    p=(0.40, 0.50, 0.10)
    )
full_sample["poverty"] = np.random.choice((0, 1), size=150, p=(0.70, 0.30))
full_sample["under_five"] = np.random.choice(
    (0, 1, 2, 3, 4, 5), 
    size=150, 
    p=(0.05, 0.35, 0.25, 0.20, 0.10, 0.05)
    )

full_sample[[
    "cluster", 
    "region", 
    "household", 
    "nr_weight", 
    "education", 
    "poverty", 
    "under_five"
    ]].head()
cluster region household nr_weight education poverty under_five
0 7 North 72 49.464286 High 1 1
1 7 North 73 54.410714 Low 0 3
2 7 North 75 54.410714 Medium 0 2
3 7 North 715 54.410714 Medium 1 2
4 7 North 722 0.000000 Medium 0 2

We now will calibrate the nonreponse weight (nr_weight) to ensure that the estimated number of households in poverty is equal to 4,700 and the estimated total number of children under five is 30,8500. The control numbers 4,700 and 30,800 are obtained from the table above.

The class SampleWeight() uses the method calibrate(samp_weight, aux_vars, control, domain, scale, bounded, modified) to adjust the weight using the GREG approach. * The contol values must be stored in a python dictionnary i.e. totals = {“poverty”: 4700, “under_five”: 30800}. In this case, we have two numerical variables poverty with values in {0, 1} and under_five with values in {0, 1, 2, 3, 4, 5}. * aux_vars is the matrix of covariates.

totals = {"poverty": 4700, "under_five": 30800}

full_sample["calib_weight"] = SampleWeight().calibrate(
    full_sample["nr_weight"], full_sample[["poverty", "under_five"]], totals
)

full_sample[["cluster", "region", "household", "nr_weight", "calib_weight"]].head(15)
cluster region household nr_weight calib_weight
0 7 North 72 49.464286 50.432441
1 7 North 73 54.410714 57.233887
2 7 North 75 54.410714 56.292829
3 7 North 715 54.410714 56.416743
4 7 North 722 0.000000 0.000000
5 7 North 724 54.410714 57.233887
6 7 North 755 54.410714 57.233887
7 7 North 761 49.464286 49.464286
8 7 North 764 54.410714 56.292829
9 7 North 782 54.410714 57.233887
10 7 North 795 54.410714 58.174944
11 7 North 7111 0.000000 0.000000
12 7 North 7112 54.410714 59.116002
13 7 North 7117 49.464286 50.319793
14 7 North 7123 54.410714 56.292829

We can confirm that the estimated totals for the auxiliary variables are equal to their control values.

poverty = full_sample["poverty"]
under_5 = full_sample["under_five"]
nr_weight = full_sample["nr_weight"]
calib_weight = full_sample["calib_weight"]

print(
    f"""\nTotal estimated number of poor households was 
    {sum(poverty*nr_weight):.2f} before adjustment and 
    {sum(poverty*calib_weight):.2f} after adjustment.\n"""
)
print(
    f"""Total estimated number of children under 5 was 
    {sum(under_5*nr_weight):.2f} before adjustment and 
    {sum(under_5*calib_weight):.2f} after adjustment.\n"""
)

Total estimated number of poor households was 
    4521.84 before adjustment and 
    4700.00 after adjustment.

Total estimated number of children under 5 was 
    29442.52 before adjustment and 
    30800.00 after adjustment.

If we want to control by domain then we can do so using the parameter domain of calibrate(). First we need to update the python dictionary holding the control values. Now, those values have to be provided for each domain. Note that the dictionary is now a nested dictionary where the higher level keys hold the domain values i.e. East, North, South and West. Then the higher level values of the dictionary are the dictionaries providing mapping for the auxiliary variables and the corresponding control values.

totals_by_domain = {
    "East": {"poverty": 1200, "under_five": 6300},
    "North": {"poverty": 200, "under_five": 4000},
    "South": {"poverty": 1100, "under_five": 6500},
    "West": {"poverty": 2200, "under_five": 14000},
}

full_sample["calib_weight_d"] = SampleWeight().calibrate(
    full_sample["nr_weight"], 
    full_sample[["poverty", "under_five"]], 
    totals_by_domain, full_sample["region"]
)

full_sample[[
    "cluster", 
    "region", 
    "household", 
    "nr_weight", 
    "calib_weight", 
    "calib_weight_d"
    ]].head(15)
cluster region household nr_weight calib_weight calib_weight_d
0 7 North 72 49.464286 50.432441 40.892864
1 7 North 73 54.410714 57.233887 61.852139
2 7 North 75 54.410714 56.292829 59.371664
3 7 North 715 54.410714 56.416743 47.462625
4 7 North 722 0.000000 0.000000 0.000000
5 7 North 724 54.410714 57.233887 61.852139
6 7 North 755 54.410714 57.233887 61.852139
7 7 North 761 49.464286 49.464286 49.464286
8 7 North 764 54.410714 56.292829 59.371664
9 7 North 782 54.410714 57.233887 61.852139
10 7 North 795 54.410714 58.174944 64.332614
11 7 North 7111 0.000000 0.000000 0.000000
12 7 North 7112 54.410714 59.116002 66.813089
13 7 North 7117 49.464286 50.319793 51.719263
14 7 North 7123 54.410714 56.292829 59.371664

Note that the GREG domain estimates above do not have the additive property. That is the GREG domain estimates do not sum to the overal GREG estimate. To illustrate this, let’s assume that we want to estimate the number of households.

print(f"\nThe number of households using the overall GREG is: \
    {sum(full_sample['calib_weight']):.2f} \n")
print(f"The number of households using the domain GREG is: \
    {sum(full_sample['calib_weight_d']):.2f} \n")

The number of households using the overall GREG is:     14960.15 

The number of households using the domain GREG is:     14959.01 
Note

If the additive flag is set to True, the sum of the domain estimates will be equal to the GREG overal estimate.

totals_by_domain = {
    "East": {"poverty": 1200, "under_five": 6300},
    "North": {"poverty": 200, "under_five": 4000},
    "South": {"poverty": 1100, "under_five": 6500},
    "West": {"poverty": 2200, "under_five": 14000},
}

calib_weight3 = SampleWeight().calibrate(
    full_sample["nr_weight"],
    full_sample[["poverty", "under_five"]],
    totals_by_domain,
    full_sample["region"],
    additive=True,
)
under_5 = np.array(full_sample["under_five"])
print(f"\nEach column can be used to estimate a domain: \
{np.sum(np.transpose(calib_weight3) * under_5, axis=1)}\n")

Each column can be used to estimate a domain: [ 6300.  4000.  6500. 14000.]
print(f"The number of households using the overall GREG is: \
{sum(full_sample['calib_weight']):.2f} \n")
The number of households using the overall GREG is: 14960.15 
print(f"The number of households using the domain GREG is: \
{sum(full_sample['calib_weight_d']):.2f} (additive=False)\n")
The number of households using the domain GREG is: 14959.01 (additive=False)
print(f"The number of households using the domain GREG is: \
{np.sum(np.transpose(calib_weight3)):.2f} (additive=True) \n")
The number of households using the domain GREG is: 14960.15 (additive=True) 

Normalization

DHS and MICS normalize the final sample weights to sum to the sample size. We can use the class method normalize() to ensure that the sample weight sum to some constant across the sample or by normalization domain e.g. stratum.

Note

normalization is mostly added here for completeness but it is sheldom to see sample weight normalize in large scale household surveys. One major downside of normalized weights is the Note that estimation of totals does not make sense with normalized weights.

full_sample["norm_weight"] = \
    SampleWeight().normalize(samp_weight=full_sample["nr_weight"])
full_sample[["cluster", "region", "nr_weight", "norm_weight"]].head(25)

print((full_sample.shape[0], full_sample["norm_weight"].sum()))
(150, 150.0)

When normalize() is called with only the parameter sample_weight then the sample weights are normalize to sum to the length of the sample weight vector.

full_sample["norm_weight2"] = \
    SampleWeight().normalize(
        samp_weight=full_sample["nr_weight"], 
        control=300
        )

print(full_sample["norm_weight2"].sum())
300.0
full_sample["norm_weight3"] = SampleWeight().normalize(
    samp_weight=full_sample["nr_weight"], 
    domain=full_sample["region"]
    )

weight_sum = full_sample.groupby(["region"]).sum()
weight_sum.reset_index(inplace=True)
weight_sum[["region", "nr_weight", "norm_weight", "norm_weight3"]]
/var/folders/ld/zzh65kn12yn3sykchz7mcbw40000gn/T/ipykernel_9199/2566768256.py:6: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  weight_sum = full_sample.groupby(["region"]).sum()
region nr_weight norm_weight norm_weight3
0 East 3698.703391 38.419768 45.0
1 North 1454.250000 15.105820 30.0
2 South 2801.889620 29.104239 45.0
3 West 6485.783333 67.370173 30.0
norm_level = {"East": 10, "North": 20, "South": 30, "West": 50}

full_sample["norm_weight4"] = SampleWeight().normalize(
    samp_weight=full_sample["nr_weight"], 
    control=norm_level, 
    domain=full_sample["region"]
    )

weight_sum = full_sample.groupby(["region"]).sum()
weight_sum.reset_index(inplace=True)
weight_sum[["region", "nr_weight", "norm_weight", "norm_weight3", "norm_weight4",]]
/var/folders/ld/zzh65kn12yn3sykchz7mcbw40000gn/T/ipykernel_9199/2547399335.py:9: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  weight_sum = full_sample.groupby(["region"]).sum()
region nr_weight norm_weight norm_weight3 norm_weight4
0 East 3698.703391 38.419768 45.0 10.0
1 North 1454.250000 15.105820 30.0 20.0
2 South 2801.889620 29.104239 45.0 30.0
3 West 6485.783333 67.370173 30.0 50.0