Sample Weight Adjustments

The objective of this tutorial is to familiarize ourselves with SampleWeight the samplics class for adjusting sample weights. In practice, it is necessary to adjust base or design sample weights obtained directly from the random sample mechanism. These adjustments are done to correct for nonresponse, reduce effects of extreme/large weights, better align with known auxiliary information, and more. Specifically in this tutorial, we will:

learn how to use the method adjust() to redistribute sample weights to account for nonresponse and unknown eligibility,
learn how to use the method poststratify() to ensure that the sample weights sum to known control totals or that the relative distribution of domains is force to known distributions,
learn how to use the method calibrate() to adjust sample weight using auxiliary information under the regression model,
learn how to use the method normalize() to ensure that the sample weights sum to known constants.

To run the code in this notebook, we will use the dataset that was developed in the previous tutorial on sample selection.

import numpy as np
import pandas as pd

from samplics.datasets import load_psu_sample, load_ssu_sample
from samplics.weighting import SampleWeight

Design (base) weight

The design weight is the inverse of the overall probability of selection which is the product of the first and second probabilities of selection.

# Load PSU sample data
psu_sample_dict = load_psu_sample()
psu_sample = psu_sample_dict["data"]

# Load PSU sample data
ssu_sample_dict = load_ssu_sample()
ssu_sample = ssu_sample_dict["data"]

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]], 
    ssu_sample[["cluster", "household", "ssu_prob"]], 
    on="cluster"
)

full_sample["inclusion_prob"] = full_sample["psu_prob"] * \
    full_sample["ssu_prob"]
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"]

full_sample.head(15)

	cluster	region	psu_prob	household	ssu_prob	inclusion_prob	design_weight
0	7	North	0.187726	72	0.115385	0.021661	46.166667
1	7	North	0.187726	73	0.115385	0.021661	46.166667
2	7	North	0.187726	75	0.115385	0.021661	46.166667
3	7	North	0.187726	715	0.115385	0.021661	46.166667
4	7	North	0.187726	722	0.115385	0.021661	46.166667
5	7	North	0.187726	724	0.115385	0.021661	46.166667
6	7	North	0.187726	755	0.115385	0.021661	46.166667
7	7	North	0.187726	761	0.115385	0.021661	46.166667
8	7	North	0.187726	764	0.115385	0.021661	46.166667
9	7	North	0.187726	782	0.115385	0.021661	46.166667
10	7	North	0.187726	795	0.115385	0.021661	46.166667
11	7	North	0.187726	7111	0.115385	0.021661	46.166667
12	7	North	0.187726	7112	0.115385	0.021661	46.166667
13	7	North	0.187726	7117	0.115385	0.021661	46.166667
14	7	North	0.187726	7123	0.115385	0.021661	46.166667

For the purposes of this illustration of handling non-response, we first need to incorporate some household non-response into our example. That is, we simulate the non-response status and store it in the variable response_status. The variable response_status has four possible values: ineligible which indicates that the sampling unit is not eligible for the survey, respondent which indicates that the sampling unit responded to the survey, non-respondent which indicates that the sampling unit did not respond to the survey, and unknown means that we are not able to infer the status of the sampling unit i.e. we do not know whether the sampling unit is eligible or not to the survey.

np.random.seed(7)
full_sample["response_status"] = np.random.choice(
    ["ineligible", "respondent", "non-respondent", "unknown"], 
    size=full_sample.shape[0], 
    p=(0.10, 0.70, 0.15, 0.05)
)

full_sample[
    ["cluster", "region", "design_weight", "response_status"]
    ].head(15)

	cluster	region	design_weight	response_status
0	7	North	46.166667	ineligible
1	7	North	46.166667	respondent
2	7	North	46.166667	respondent
3	7	North	46.166667	respondent
4	7	North	46.166667	unknown
5	7	North	46.166667	respondent
6	7	North	46.166667	respondent
7	7	North	46.166667	ineligible
8	7	North	46.166667	respondent
9	7	North	46.166667	respondent
10	7	North	46.166667	respondent
11	7	North	46.166667	non-respondent
12	7	North	46.166667	respondent
13	7	North	46.166667	ineligible
14	7	North	46.166667	respondent

Nonresponse adjustment

In general, the sample weights are adjusted to redistribute the sample weights of all eligible units for which there is no sufficient response (unit level nonresponse) to the sampling units that sufficiently responded to the survey. This adjustment is done within adjustment classes or domains. Note that the determination of the response categories (unit response, item response, ineligible, etc.) is outside of the scope of this tutorial.

Also, the weights of the sampling units with unknown eligibility are redistributed to the rest of the sampling units. In general, ineligible sampling units receive weights from the sampling units with unknown eligibility since eligible sampling units can be part of the unknown pool.

The method adjust() has a boolean parameter unknown_to_inelig which controls how the sample weights of the unknown is redistributed. By default, adjust() redistribute the sample weights of the sampling units of the unknown to the ineligibles (unknown_to_inelig=True). If we do not wish to redistribute the sample weights of the unknowns to the ineligibles then we just set the flag to False (unknown_to_inelig=Fasle).

In the snippet of code below, we adjust the weight within clusters that is we use clusters as our adjustment classes. Note that we run the nonresponse adjustment twice, the first time with unknown_to_inelig=True (nr_weight) and the second time with the flag equal to False (nr_weight2). With unknown_to_inelig=True, the ineligible received part of the sample weights from the unknowns. Hence, the sample weights for the respondent is less than when the flag is False. With unknown_to_inelig=Fasle, the ineligible did Not receive any weights from the unknowns. Hence, the sample weights for the ineligible units remain the same before and after adjustment. In a real survey, the statistician may decide on the best non-response strategy based on the available information.

status_mapping = {
    "in": "ineligible", 
    "rr": "respondent", 
    "nr": "non-respondent", 
    "uk": "unknown"
    }

full_sample["nr_weight"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"],
    adj_class=full_sample[["region", "cluster"]],
    resp_status=full_sample["response_status"],
    resp_dict=status_mapping,
)

full_sample["nr_weight2"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"],
    adj_class=full_sample[["region", "cluster"]],
    resp_status=full_sample["response_status"],
    resp_dict=status_mapping,
    unknown_to_inelig=False,
)

full_sample[[
    "cluster", 
    "region", 
    "design_weight", 
    "response_status", 
    "nr_weight", 
    "nr_weight2"
    ]].drop_duplicates().head(15)

	cluster	region	design_weight	response_status	nr_weight	nr_weight2
0	7	North	46.166667	ineligible	49.464286	46.166667
1	7	North	46.166667	respondent	54.410714	55.400000
4	7	North	46.166667	unknown	0.000000	0.000000
11	7	North	46.166667	non-respondent	0.000000	0.000000
15	10	North	50.783333	non-respondent	0.000000	0.000000
16	10	North	50.783333	respondent	70.733929	71.096667
19	10	North	50.783333	ineligible	54.410714	50.783333
21	10	North	50.783333	unknown	0.000000	0.000000
30	16	South	62.149123	respondent	66.588346	66.588346
35	16	South	62.149123	non-respondent	0.000000	0.000000
45	24	South	58.940741	respondent	63.852469	63.852469
47	24	South	58.940741	non-respondent	0.000000	0.000000
55	24	South	58.940741	ineligible	58.940741	58.940741
60	29	South	65.702778	unknown	0.000000	0.000000
61	29	South	65.702778	respondent	101.081197	102.204321

Important

The default call of adjust() expects standard codes for response status that is “in”, “rr”, “nr”, and “uk” where “in” means ineligible, “rr” means respondent, “nr” means non-respondent, and “uk” means unknown eligibility.

In the call above, if we omit the parameter response_dict, then the run would fail with an assertion error message. The current error message is the following: The response status must only contains values in (‘in’, ‘rr’, ‘nr’, ‘uk’) or the mapping should be provided using response_dict parameter. For the call to run without using response_dict, it is necessary that the response status takes only values in the standard codes i.e. (“in”, “rr”, “nr”, “uk”). The variable associated with response_status can contain any code but a mapping is necessary when the response variable is not constructed using the standard codes.

To further illustrate the mapping of response status, let’s assume that we have response_status2 which has the values 100 for ineligible, 200 for non-respondent, 300 for respondent, and 999 for unknown.

response_status2 = np.repeat(100, full_sample["response_status"].shape[0])
response_status2[full_sample["response_status"] == "non-respondent"] = 200
response_status2[full_sample["response_status"] == "respondent"] = 300
response_status2[full_sample["response_status"] == "unknown"] = 999

pd.crosstab(response_status2, full_sample["response_status"])

response_status	ineligible	non-respondent	respondent	unknown
row_0
100	16	0	0	0
200	0	23	0	0
300	0	0	106	0
999	0	0	0	5

To use response_status2, we need to map the values 100, 200, 300 and 999 to “in”, “rr”, “nr”, and “uk”. This mapping is done below using the Python dictionary status_mapping2. Using status_mapping2 in the function call adjust() will lead to the same adjustment as in the previous run i.e. nr_weight and nr_weight3 contain the same adjusted weights.

status_mapping2 = {"in": 100, "nr": 200, "rr": 300, "uk": 999}

full_sample["nr_weight3"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"],
    adj_class=full_sample[["region", "cluster"]],
    resp_status=response_status2,
    resp_dict=status_mapping2,
)

full_sample[
    ["cluster", "region", "response_status", "nr_weight", "nr_weight3"]
    ].drop_duplicates().head()

	cluster	region	response_status	nr_weight	nr_weight3
0	7	North	ineligible	49.464286	49.464286
1	7	North	respondent	54.410714	54.410714
4	7	North	unknown	0.000000	0.000000
11	7	North	non-respondent	0.000000	0.000000
15	10	North	non-respondent	0.000000	0.000000

If the response status variable only takes values “in”, “nr”, “rr” and “uk”, then it is not necessary to provide the mapping dictionary to the function i.e. resp_dict can be omitted from the function call adjust().

response_status3 = np.repeat("in", full_sample["response_status"].shape[0])
response_status3[full_sample["response_status"] == "non-respondent"] = "nr"
response_status3[full_sample["response_status"] == "respondent"] = "rr"
response_status3[full_sample["response_status"] == "unknown"] = "uk"

full_sample["nr_weight4"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"],
    adj_class=full_sample[["region", "cluster"]],
    resp_status=response_status3,
)

full_sample[
    ["cluster", "region", "response_status", "nr_weight", "nr_weight4"]
    ].drop_duplicates().head()

	cluster	region	response_status	nr_weight	nr_weight4
0	7	North	ineligible	49.464286	49.464286
1	7	North	respondent	54.410714	54.410714
4	7	North	unknown	0.000000	0.000000
11	7	North	non-respondent	0.000000	0.000000
15	10	North	non-respondent	0.000000	0.000000

# Just dropping a couple of variables 
# not needed for the rest of the tutorial
full_sample.drop(
    columns=[
        "psu_prob", 
        "ssu_prob", 
        "inclusion_prob", 
        "nr_weight2", 
        "nr_weight3", 
        "nr_weight4"
        ], 
    inplace=True
)

Poststratification

Poststratification is useful to compensate for under-representation of the sample or to correct for nonsampling error. The most common poststratification method consists of adjusting the sample weights to ensure that they sum to known control values from reliable souces by adjustment classes (domains). Poststratification classes can be formed using variables beyond the ones involved in the sampling design. For example, socio-economic variables such as age group, gender, race and education are often used to form poststratification classes/cells.

Warning

poststratifying to totals that are known to be out of date, and thus likely inaccurate and/or unreliable may not improve the estimate. Use this with caution.

Let’s assume that we have a reliable external source e.g. a recent census that provides the number of households by region. The external source has the following control data: 3700 households for East, 1500 for North, 2800 for South and 6500 for West.

We use the method poststratify() to ensure that the poststratified sample weights (ps_weight) sum to the know control totals by region. Note that the control totals are provided using the Python dictionary census_households.

census_households = {"East": 3700, "North": 1500, "South": 2800, "West": 6500}

full_sample["ps_weight"] = SampleWeight().poststratify(
    samp_weight=full_sample["nr_weight"], 
    control=census_households, 
    domain=full_sample["region"]
)

full_sample.head(15)

	cluster	region	household	design_weight	response_status	nr_weight	ps_weight
0	7	North	72	46.166667	ineligible	49.464286	51.020408
1	7	North	73	46.166667	respondent	54.410714	56.122449
2	7	North	75	46.166667	respondent	54.410714	56.122449
3	7	North	715	46.166667	respondent	54.410714	56.122449
4	7	North	722	46.166667	unknown	0.000000	0.000000
5	7	North	724	46.166667	respondent	54.410714	56.122449
6	7	North	755	46.166667	respondent	54.410714	56.122449
7	7	North	761	46.166667	ineligible	49.464286	51.020408
8	7	North	764	46.166667	respondent	54.410714	56.122449
9	7	North	782	46.166667	respondent	54.410714	56.122449
10	7	North	795	46.166667	respondent	54.410714	56.122449
11	7	North	7111	46.166667	non-respondent	0.000000	0.000000
12	7	North	7112	46.166667	respondent	54.410714	56.122449
13	7	North	7117	46.166667	ineligible	49.464286	51.020408
14	7	North	7123	46.166667	respondent	54.410714	56.122449

The snippet of code below shows that the poststratified sample weights sum to the expected control totals that is 3700 households for East, 1500 for North, 2800 for South and 6500 for West.

sum_of_weights = full_sample[
    ["region", "nr_weight", "ps_weight"]
    ].groupby("region").sum()
sum_of_weights.reset_index(inplace=True)
sum_of_weights.head()

	region	nr_weight	ps_weight
0	East	3698.703391	3700.0
1	North	1454.250000	1500.0
2	South	2801.889620	2800.0
3	West	6485.783333	6500.0

The crosstable below shows that only one adjustment factor was calculated and applied per adjustment class or region.

full_sample["ps_adj_fct"] = \
    round(full_sample["ps_weight"] / full_sample["nr_weight"], 12)

pd.crosstab(full_sample["ps_adj_fct"], full_sample["region"])

region	East	North	South	West
ps_adj_fct
0.999326	0	0	38	0
1.000351	37	0	0	0
1.002192	0	0	0	23
1.031460	0	24	0	0

In some surveys, there is interest in keeping relative distribution of strata to some known distribution. For example, WHO EPI vaccination surveys often postratify sample weights to ensure that relative sizes of strata reflect offcial statistics e.g. census data. In most cases, the strata are based on some administrative divisions.

For example, assume that according to census data that East contains 25% of the households, North contains 10%, South contains 20% and West contains 45%. We can poststratify using the snippet of code below.

known_ratios = {"East": 0.25, "North": 0.10, "South": 0.20, "West": 0.45}
full_sample["ps_weight2"] = SampleWeight().poststratify(
    samp_weight=full_sample["nr_weight"], 
    factor=known_ratios, 
    domain=full_sample["region"]
)

full_sample.head()

	cluster	region	household	design_weight	response_status	nr_weight	ps_weight	ps_adj_fct	ps_weight2
0	7	North	72	46.166667	ineligible	49.464286	51.020408	1.03146	49.117777
1	7	North	73	46.166667	respondent	54.410714	56.122449	1.03146	54.029554
2	7	North	75	46.166667	respondent	54.410714	56.122449	1.03146	54.029554
3	7	North	715	46.166667	respondent	54.410714	56.122449	1.03146	54.029554
4	7	North	722	46.166667	unknown	0.000000	0.000000	NaN	0.000000

sum_of_weights2 = full_sample[
    ["region", "nr_weight", "ps_weight2"]
    ].groupby("region").sum()
sum_of_weights2.reset_index(inplace=True)
sum_of_weights2["ratio"] = \
    sum_of_weights2["ps_weight2"] / sum(sum_of_weights2["ps_weight2"])
sum_of_weights2.head()

	region	nr_weight	ps_weight2	ratio
0	East	3698.703391	3610.156586	0.25
1	North	1454.250000	1444.062634	0.10
2	South	2801.889620	2888.125269	0.20
3	West	6485.783333	6498.281855	0.45

Calibration

Calibration is a more general concept for adjusting sample weights to sum to known constants. In this tutorial, we consider the generalized regression (GREG) class of calibration. Assume that we have \(\hat{\mathbf{Y}} = \sum_{i \in s} w_i y_i\) and know population totals \(\mathbf{X} = (\mathbf{X}_1, ..., \mathbf{X}_p)^T\) are available. Working under the model \(Y_i | \mathbf{x}_i = \mathbf{x}^T_i \mathbf{\beta} + \epsilon_i\), the GREG estimator of the population total is

\[\hat{\mathbf{Y}}_{GR} = \hat{\mathbf{Y}} + (\mathbf{X} - \hat{\mathbf{X}})^T\hat{\mathbf{B}}\]

where \(\hat{\mathbf{B}}\) is the weighted least squares estimate of \(\mathbf{\beta}\) and \(\hat{\mathbf{X}}\) is the survey estimate of \(\mathbf{X}\). The essential of the GREG approach is, under the regression model, to find the adjusted weights \(w^{*}_i\) that are the closest to \(w_i\), to minimize \(h(z) = \frac{\sum_{i \in s} c_i(w_i - z_i)}{w_i}\).

Let us simulate three auxiliary variables that is education, poverty and under_five (number of children under five in the household) and assume that we have the following control totals.

Total number of under five children: 6300 in the East, 4000 in the North, 6500 in the South and 14000 in the West.
Poverty (Yes: in poverty / No: not in poverty)

Region Poverty Number of households

East No 2600

Yes 1200

North No 1500

Yes 200

South No 1800

Yes 1100

West No 4500

Yes 2200

Region	Poverty	Number of households
East	No	2600
	Yes	1200
North	No	1500
	Yes	200
South	No	1800
	Yes	1100
West	No	4500
	Yes	2200

Education (Low: less than secondary, Medium: secondary completed, and High: More than secondary)

Region	Education	Number of households
East	Low	2000
	Medium	1400
	High	350
North	Low	550
	Medium	700
	High	250
South	Low	1300
	Medium	1200
	High	350
West	Low	2100
	Medium	4000
	High	500

np.random.seed(150)
full_sample["education"] = np.random.choice(
    ("Low", "Medium", "High"), 
    size=150, 
    p=(0.40, 0.50, 0.10)
    )
full_sample["poverty"] = np.random.choice((0, 1), size=150, p=(0.70, 0.30))
full_sample["under_five"] = np.random.choice(
    (0, 1, 2, 3, 4, 5), 
    size=150, 
    p=(0.05, 0.35, 0.25, 0.20, 0.10, 0.05)
    )

full_sample[[
    "cluster", 
    "region", 
    "household", 
    "nr_weight", 
    "education", 
    "poverty", 
    "under_five"
    ]].head()

	cluster	region	household	nr_weight	education	poverty	under_five
0	7	North	72	49.464286	High	1	1
1	7	North	73	54.410714	Low	0	3
2	7	North	75	54.410714	Medium	0	2
3	7	North	715	54.410714	Medium	1	2
4	7	North	722	0.000000	Medium	0	2

We now will calibrate the nonreponse weight (nr_weight) to ensure that the estimated number of households in poverty is equal to 4,700 and the estimated total number of children under five is 30,8500. The control numbers 4,700 and 30,800 are obtained from the table above.

The class SampleWeight() uses the method calibrate(samp_weight, aux_vars, control, domain, scale, bounded, modified) to adjust the weight using the GREG approach. * The contol values must be stored in a python dictionnary i.e. totals = {“poverty”: 4700, “under_five”: 30800}. In this case, we have two numerical variables poverty with values in {0, 1} and under_five with values in {0, 1, 2, 3, 4, 5}. * aux_vars is the matrix of covariates.

totals = {"poverty": 4700, "under_five": 30800}

full_sample["calib_weight"] = SampleWeight().calibrate(
    full_sample["nr_weight"], full_sample[["poverty", "under_five"]], totals
)

full_sample[["cluster", "region", "household", "nr_weight", "calib_weight"]].head(15)

	cluster	region	household	nr_weight	calib_weight
0	7	North	72	49.464286	50.432441
1	7	North	73	54.410714	57.233887
2	7	North	75	54.410714	56.292829
3	7	North	715	54.410714	56.416743
4	7	North	722	0.000000	0.000000
5	7	North	724	54.410714	57.233887
6	7	North	755	54.410714	57.233887
7	7	North	761	49.464286	49.464286
8	7	North	764	54.410714	56.292829
9	7	North	782	54.410714	57.233887
10	7	North	795	54.410714	58.174944
11	7	North	7111	0.000000	0.000000
12	7	North	7112	54.410714	59.116002
13	7	North	7117	49.464286	50.319793
14	7	North	7123	54.410714	56.292829

We can confirm that the estimated totals for the auxiliary variables are equal to their control values.

poverty = full_sample["poverty"]
under_5 = full_sample["under_five"]
nr_weight = full_sample["nr_weight"]
calib_weight = full_sample["calib_weight"]

print(
    f"""\nTotal estimated number of poor households was 
    {sum(poverty*nr_weight):.2f} before adjustment and 
    {sum(poverty*calib_weight):.2f} after adjustment.\n"""
)
print(
    f"""Total estimated number of children under 5 was 
    {sum(under_5*nr_weight):.2f} before adjustment and 
    {sum(under_5*calib_weight):.2f} after adjustment.\n"""
)


Total estimated number of poor households was 
    4521.84 before adjustment and 
    4700.00 after adjustment.

Total estimated number of children under 5 was 
    29442.52 before adjustment and 
    30800.00 after adjustment.

If we want to control by domain then we can do so using the parameter domain of calibrate(). First we need to update the python dictionary holding the control values. Now, those values have to be provided for each domain. Note that the dictionary is now a nested dictionary where the higher level keys hold the domain values i.e. East, North, South and West. Then the higher level values of the dictionary are the dictionaries providing mapping for the auxiliary variables and the corresponding control values.

totals_by_domain = {
    "East": {"poverty": 1200, "under_five": 6300},
    "North": {"poverty": 200, "under_five": 4000},
    "South": {"poverty": 1100, "under_five": 6500},
    "West": {"poverty": 2200, "under_five": 14000},
}

full_sample["calib_weight_d"] = SampleWeight().calibrate(
    full_sample["nr_weight"], 
    full_sample[["poverty", "under_five"]], 
    totals_by_domain, full_sample["region"]
)

full_sample[[
    "cluster", 
    "region", 
    "household", 
    "nr_weight", 
    "calib_weight", 
    "calib_weight_d"
    ]].head(15)

	cluster	region	household	nr_weight	calib_weight	calib_weight_d
0	7	North	72	49.464286	50.432441	40.892864
1	7	North	73	54.410714	57.233887	61.852139
2	7	North	75	54.410714	56.292829	59.371664
3	7	North	715	54.410714	56.416743	47.462625
4	7	North	722	0.000000	0.000000	0.000000
5	7	North	724	54.410714	57.233887	61.852139
6	7	North	755	54.410714	57.233887	61.852139
7	7	North	761	49.464286	49.464286	49.464286
8	7	North	764	54.410714	56.292829	59.371664
9	7	North	782	54.410714	57.233887	61.852139
10	7	North	795	54.410714	58.174944	64.332614
11	7	North	7111	0.000000	0.000000	0.000000
12	7	North	7112	54.410714	59.116002	66.813089
13	7	North	7117	49.464286	50.319793	51.719263
14	7	North	7123	54.410714	56.292829	59.371664

Note that the GREG domain estimates above do not have the additive property. That is the GREG domain estimates do not sum to the overal GREG estimate. To illustrate this, let’s assume that we want to estimate the number of households.

print(f"\nThe number of households using the overall GREG is: \
    {sum(full_sample['calib_weight']):.2f} \n")
print(f"The number of households using the domain GREG is: \
    {sum(full_sample['calib_weight_d']):.2f} \n")


The number of households using the overall GREG is:     14960.15 

The number of households using the domain GREG is:     14959.01

Note

If the additive flag is set to True, the sum of the domain estimates will be equal to the GREG overal estimate.

totals_by_domain = {
    "East": {"poverty": 1200, "under_five": 6300},
    "North": {"poverty": 200, "under_five": 4000},
    "South": {"poverty": 1100, "under_five": 6500},
    "West": {"poverty": 2200, "under_five": 14000},
}

calib_weight3 = SampleWeight().calibrate(
    full_sample["nr_weight"],
    full_sample[["poverty", "under_five"]],
    totals_by_domain,
    full_sample["region"],
    additive=True,
)

under_5 = np.array(full_sample["under_five"])
print(f"\nEach column can be used to estimate a domain: \
{np.sum(np.transpose(calib_weight3) * under_5, axis=1)}\n")


Each column can be used to estimate a domain: [ 6300.  4000.  6500. 14000.]

print(f"The number of households using the overall GREG is: \
{sum(full_sample['calib_weight']):.2f} \n")

The number of households using the overall GREG is: 14960.15

print(f"The number of households using the domain GREG is: \
{sum(full_sample['calib_weight_d']):.2f} (additive=False)\n")

The number of households using the domain GREG is: 14959.01 (additive=False)

print(f"The number of households using the domain GREG is: \
{np.sum(np.transpose(calib_weight3)):.2f} (additive=True) \n")

The number of households using the domain GREG is: 14960.15 (additive=True)

Normalization

DHS and MICS normalize the final sample weights to sum to the sample size. We can use the class method normalize() to ensure that the sample weight sum to some constant across the sample or by normalization domain e.g. stratum.

Note

normalization is mostly added here for completeness but it is sheldom to see sample weight normalize in large scale household surveys. One major downside of normalized weights is the Note that estimation of totals does not make sense with normalized weights.

full_sample["norm_weight"] = \
    SampleWeight().normalize(samp_weight=full_sample["nr_weight"])
full_sample[["cluster", "region", "nr_weight", "norm_weight"]].head(25)

print((full_sample.shape[0], full_sample["norm_weight"].sum()))

(150, np.float64(150.0))

When normalize() is called with only the parameter sample_weight then the sample weights are normalize to sum to the length of the sample weight vector.

full_sample["norm_weight2"] = \
    SampleWeight().normalize(
        samp_weight=full_sample["nr_weight"], 
        control=300
        )

print(full_sample["norm_weight2"].sum())

300.0

full_sample["norm_weight3"] = SampleWeight().normalize(
    samp_weight=full_sample["nr_weight"], 
    domain=full_sample["region"]
    )

weight_sum = full_sample.groupby(["region"]).sum()
weight_sum.reset_index(inplace=True)
weight_sum[["region", "nr_weight", "norm_weight", "norm_weight3"]]

	region	nr_weight	norm_weight	norm_weight3
0	East	3698.703391	38.419768	45.0
1	North	1454.250000	15.105820	30.0
2	South	2801.889620	29.104239	45.0
3	West	6485.783333	67.370173	30.0

norm_level = {"East": 10, "North": 20, "South": 30, "West": 50}

full_sample["norm_weight4"] = SampleWeight().normalize(
    samp_weight=full_sample["nr_weight"], 
    control=norm_level, 
    domain=full_sample["region"]
    )

weight_sum = full_sample.groupby(["region"]).sum()
weight_sum.reset_index(inplace=True)
weight_sum[["region", "nr_weight", "norm_weight", "norm_weight3", "norm_weight4",]]

	region	nr_weight	norm_weight	norm_weight3	norm_weight4
0	East	3698.703391	38.419768	45.0	10.0
1	North	1454.250000	15.105820	30.0	20.0
2	South	2801.889620	29.104239	45.0	30.0
3	West	6485.783333	67.370173	30.0	50.0