import numpy as np
import pandas as pd
from samplics.datasets import load_psu_sample, load_ssu_sample
from samplics.weighting import SampleWeight
Sample Weight Adjustments
The objective of this tutorial is to familiarize ourselves with SampleWeight the samplics class for adjusting sample weights. In practice, it is necessary to adjust base or design sample weights obtained directly from the random sample mechanism. These adjustments are done to correct for nonresponse, reduce effects of extreme/large weights, better align with known auxiliary information, and more. Specifically in this tutorial, we will:
- learn how to use the method adjust() to redistribute sample weights to account for nonresponse and unknown eligibility,
- learn how to use the method poststratify() to ensure that the sample weights sum to known control totals or that the relative distribution of domains is force to known distributions,
- learn how to use the method calibrate() to adjust sample weight using auxiliary information under the regression model,
- learn how to use the method normalize() to ensure that the sample weights sum to known constants.
To run the code in this notebook, we will use the dataset that was developed in the previous tutorial on sample selection.
Design (base) weight
The design weight is the inverse of the overall probability of selection which is the product of the first and second probabilities of selection.
# Load PSU sample data
= load_psu_sample()
psu_sample_dict = psu_sample_dict["data"]
psu_sample
# Load PSU sample data
= load_ssu_sample()
ssu_sample_dict = ssu_sample_dict["data"]
ssu_sample
= pd.merge(
full_sample "cluster", "region", "psu_prob"]],
psu_sample[["cluster", "household", "ssu_prob"]],
ssu_sample[[="cluster"
on
)
"inclusion_prob"] = full_sample["psu_prob"] * \
full_sample["ssu_prob"]
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"]
full_sample[
15) full_sample.head(
cluster | region | psu_prob | household | ssu_prob | inclusion_prob | design_weight | |
---|---|---|---|---|---|---|---|
0 | 7 | North | 0.187726 | 72 | 0.115385 | 0.021661 | 46.166667 |
1 | 7 | North | 0.187726 | 73 | 0.115385 | 0.021661 | 46.166667 |
2 | 7 | North | 0.187726 | 75 | 0.115385 | 0.021661 | 46.166667 |
3 | 7 | North | 0.187726 | 715 | 0.115385 | 0.021661 | 46.166667 |
4 | 7 | North | 0.187726 | 722 | 0.115385 | 0.021661 | 46.166667 |
5 | 7 | North | 0.187726 | 724 | 0.115385 | 0.021661 | 46.166667 |
6 | 7 | North | 0.187726 | 755 | 0.115385 | 0.021661 | 46.166667 |
7 | 7 | North | 0.187726 | 761 | 0.115385 | 0.021661 | 46.166667 |
8 | 7 | North | 0.187726 | 764 | 0.115385 | 0.021661 | 46.166667 |
9 | 7 | North | 0.187726 | 782 | 0.115385 | 0.021661 | 46.166667 |
10 | 7 | North | 0.187726 | 795 | 0.115385 | 0.021661 | 46.166667 |
11 | 7 | North | 0.187726 | 7111 | 0.115385 | 0.021661 | 46.166667 |
12 | 7 | North | 0.187726 | 7112 | 0.115385 | 0.021661 | 46.166667 |
13 | 7 | North | 0.187726 | 7117 | 0.115385 | 0.021661 | 46.166667 |
14 | 7 | North | 0.187726 | 7123 | 0.115385 | 0.021661 | 46.166667 |
For the purposes of this illustration of handling non-response, we first need to incorporate some household non-response into our example. That is, we simulate the non-response status and store it in the variable response_status. The variable response_status has four possible values: ineligible which indicates that the sampling unit is not eligible for the survey, respondent which indicates that the sampling unit responded to the survey, non-respondent which indicates that the sampling unit did not respond to the survey, and unknown means that we are not able to infer the status of the sampling unit i.e. we do not know whether the sampling unit is eligible or not to the survey.
7)
np.random.seed("response_status"] = np.random.choice(
full_sample["ineligible", "respondent", "non-respondent", "unknown"],
[=full_sample.shape[0],
size=(0.10, 0.70, 0.15, 0.05)
p
)
full_sample["cluster", "region", "design_weight", "response_status"]
[15) ].head(
cluster | region | design_weight | response_status | |
---|---|---|---|---|
0 | 7 | North | 46.166667 | ineligible |
1 | 7 | North | 46.166667 | respondent |
2 | 7 | North | 46.166667 | respondent |
3 | 7 | North | 46.166667 | respondent |
4 | 7 | North | 46.166667 | unknown |
5 | 7 | North | 46.166667 | respondent |
6 | 7 | North | 46.166667 | respondent |
7 | 7 | North | 46.166667 | ineligible |
8 | 7 | North | 46.166667 | respondent |
9 | 7 | North | 46.166667 | respondent |
10 | 7 | North | 46.166667 | respondent |
11 | 7 | North | 46.166667 | non-respondent |
12 | 7 | North | 46.166667 | respondent |
13 | 7 | North | 46.166667 | ineligible |
14 | 7 | North | 46.166667 | respondent |
Nonresponse adjustment
In general, the sample weights are adjusted to redistribute the sample weights of all eligible units for which there is no sufficient response (unit level nonresponse) to the sampling units that sufficiently responded to the survey. This adjustment is done within adjustment classes or domains. Note that the determination of the response categories (unit response, item response, ineligible, etc.) is outside of the scope of this tutorial.
Also, the weights of the sampling units with unknown eligibility are redistributed to the rest of the sampling units. In general, ineligible sampling units receive weights from the sampling units with unknown eligibility since eligible sampling units can be part of the unknown pool.
The method adjust() has a boolean parameter unknown_to_inelig which controls how the sample weights of the unknown is redistributed. By default, adjust() redistribute the sample weights of the sampling units of the unknown to the ineligibles (unknown_to_inelig=True). If we do not wish to redistribute the sample weights of the unknowns to the ineligibles then we just set the flag to False (unknown_to_inelig=Fasle).
In the snippet of code below, we adjust the weight within clusters that is we use clusters as our adjustment classes. Note that we run the nonresponse adjustment twice, the first time with unknown_to_inelig=True (nr_weight) and the second time with the flag equal to False (nr_weight2). With unknown_to_inelig=True, the ineligible received part of the sample weights from the unknowns. Hence, the sample weights for the respondent is less than when the flag is False. With unknown_to_inelig=Fasle, the ineligible did Not receive any weights from the unknowns. Hence, the sample weights for the ineligible units remain the same before and after adjustment. In a real survey, the statistician may decide on the best non-response strategy based on the available information.
= {
status_mapping "in": "ineligible",
"rr": "respondent",
"nr": "non-respondent",
"uk": "unknown"
}
"nr_weight"] = SampleWeight().adjust(
full_sample[=full_sample["design_weight"],
samp_weight=full_sample[["region", "cluster"]],
adj_class=full_sample["response_status"],
resp_status=status_mapping,
resp_dict
)
"nr_weight2"] = SampleWeight().adjust(
full_sample[=full_sample["design_weight"],
samp_weight=full_sample[["region", "cluster"]],
adj_class=full_sample["response_status"],
resp_status=status_mapping,
resp_dict=False,
unknown_to_inelig
)
full_sample[["cluster",
"region",
"design_weight",
"response_status",
"nr_weight",
"nr_weight2"
15) ]].drop_duplicates().head(
cluster | region | design_weight | response_status | nr_weight | nr_weight2 | |
---|---|---|---|---|---|---|
0 | 7 | North | 46.166667 | ineligible | 49.464286 | 46.166667 |
1 | 7 | North | 46.166667 | respondent | 54.410714 | 55.400000 |
4 | 7 | North | 46.166667 | unknown | 0.000000 | 0.000000 |
11 | 7 | North | 46.166667 | non-respondent | 0.000000 | 0.000000 |
15 | 10 | North | 50.783333 | non-respondent | 0.000000 | 0.000000 |
16 | 10 | North | 50.783333 | respondent | 70.733929 | 71.096667 |
19 | 10 | North | 50.783333 | ineligible | 54.410714 | 50.783333 |
21 | 10 | North | 50.783333 | unknown | 0.000000 | 0.000000 |
30 | 16 | South | 62.149123 | respondent | 66.588346 | 66.588346 |
35 | 16 | South | 62.149123 | non-respondent | 0.000000 | 0.000000 |
45 | 24 | South | 58.940741 | respondent | 63.852469 | 63.852469 |
47 | 24 | South | 58.940741 | non-respondent | 0.000000 | 0.000000 |
55 | 24 | South | 58.940741 | ineligible | 58.940741 | 58.940741 |
60 | 29 | South | 65.702778 | unknown | 0.000000 | 0.000000 |
61 | 29 | South | 65.702778 | respondent | 101.081197 | 102.204321 |
The default call of adjust() expects standard codes for response status that is “in”, “rr”, “nr”, and “uk” where “in” means ineligible, “rr” means respondent, “nr” means non-respondent, and “uk” means unknown eligibility.
In the call above, if we omit the parameter response_dict, then the run would fail with an assertion error message. The current error message is the following: The response status must only contains values in (‘in’, ‘rr’, ‘nr’, ‘uk’) or the mapping should be provided using response_dict parameter. For the call to run without using response_dict, it is necessary that the response status takes only values in the standard codes i.e. (“in”, “rr”, “nr”, “uk”). The variable associated with response_status can contain any code but a mapping is necessary when the response variable is not constructed using the standard codes.
To further illustrate the mapping of response status, let’s assume that we have response_status2 which has the values 100 for ineligible, 200 for non-respondent, 300 for respondent, and 999 for unknown.
= np.repeat(100, full_sample["response_status"].shape[0])
response_status2 "response_status"] == "non-respondent"] = 200
response_status2[full_sample["response_status"] == "respondent"] = 300
response_status2[full_sample["response_status"] == "unknown"] = 999
response_status2[full_sample[
"response_status"]) pd.crosstab(response_status2, full_sample[
response_status | ineligible | non-respondent | respondent | unknown |
---|---|---|---|---|
row_0 | ||||
100 | 16 | 0 | 0 | 0 |
200 | 0 | 23 | 0 | 0 |
300 | 0 | 0 | 106 | 0 |
999 | 0 | 0 | 0 | 5 |
To use response_status2, we need to map the values 100, 200, 300 and 999 to “in”, “rr”, “nr”, and “uk”. This mapping is done below using the Python dictionary status_mapping2. Using status_mapping2 in the function call adjust() will lead to the same adjustment as in the previous run i.e. nr_weight and nr_weight3 contain the same adjusted weights.
= {"in": 100, "nr": 200, "rr": 300, "uk": 999}
status_mapping2
"nr_weight3"] = SampleWeight().adjust(
full_sample[=full_sample["design_weight"],
samp_weight=full_sample[["region", "cluster"]],
adj_class=response_status2,
resp_status=status_mapping2,
resp_dict
)
full_sample["cluster", "region", "response_status", "nr_weight", "nr_weight3"]
[ ].drop_duplicates().head()
cluster | region | response_status | nr_weight | nr_weight3 | |
---|---|---|---|---|---|
0 | 7 | North | ineligible | 49.464286 | 49.464286 |
1 | 7 | North | respondent | 54.410714 | 54.410714 |
4 | 7 | North | unknown | 0.000000 | 0.000000 |
11 | 7 | North | non-respondent | 0.000000 | 0.000000 |
15 | 10 | North | non-respondent | 0.000000 | 0.000000 |
If the response status variable only takes values “in”, “nr”, “rr” and “uk”, then it is not necessary to provide the mapping dictionary to the function i.e. resp_dict can be omitted from the function call adjust().
= np.repeat("in", full_sample["response_status"].shape[0])
response_status3 "response_status"] == "non-respondent"] = "nr"
response_status3[full_sample["response_status"] == "respondent"] = "rr"
response_status3[full_sample["response_status"] == "unknown"] = "uk"
response_status3[full_sample[
"nr_weight4"] = SampleWeight().adjust(
full_sample[=full_sample["design_weight"],
samp_weight=full_sample[["region", "cluster"]],
adj_class=response_status3,
resp_status
)
full_sample["cluster", "region", "response_status", "nr_weight", "nr_weight4"]
[ ].drop_duplicates().head()
cluster | region | response_status | nr_weight | nr_weight4 | |
---|---|---|---|---|---|
0 | 7 | North | ineligible | 49.464286 | 49.464286 |
1 | 7 | North | respondent | 54.410714 | 54.410714 |
4 | 7 | North | unknown | 0.000000 | 0.000000 |
11 | 7 | North | non-respondent | 0.000000 | 0.000000 |
15 | 10 | North | non-respondent | 0.000000 | 0.000000 |
# Just dropping a couple of variables
# not needed for the rest of the tutorial
full_sample.drop(=[
columns"psu_prob",
"ssu_prob",
"inclusion_prob",
"nr_weight2",
"nr_weight3",
"nr_weight4"
], =True
inplace )
Poststratification
Poststratification is useful to compensate for under-representation of the sample or to correct for nonsampling error. The most common poststratification method consists of adjusting the sample weights to ensure that they sum to known control values from reliable souces by adjustment classes (domains). Poststratification classes can be formed using variables beyond the ones involved in the sampling design. For example, socio-economic variables such as age group, gender, race and education are often used to form poststratification classes/cells.
poststratifying to totals that are known to be out of date, and thus likely inaccurate and/or unreliable may not improve the estimate. Use this with caution.
Let’s assume that we have a reliable external source e.g. a recent census that provides the number of households by region. The external source has the following control data: 3700 households for East, 1500 for North, 2800 for South and 6500 for West.
We use the method poststratify() to ensure that the poststratified sample weights (ps_weight) sum to the know control totals by region. Note that the control totals are provided using the Python dictionary census_households.
= {"East": 3700, "North": 1500, "South": 2800, "West": 6500}
census_households
"ps_weight"] = SampleWeight().poststratify(
full_sample[=full_sample["nr_weight"],
samp_weight=census_households,
control=full_sample["region"]
domain
)
15) full_sample.head(
cluster | region | household | design_weight | response_status | nr_weight | ps_weight | |
---|---|---|---|---|---|---|---|
0 | 7 | North | 72 | 46.166667 | ineligible | 49.464286 | 51.020408 |
1 | 7 | North | 73 | 46.166667 | respondent | 54.410714 | 56.122449 |
2 | 7 | North | 75 | 46.166667 | respondent | 54.410714 | 56.122449 |
3 | 7 | North | 715 | 46.166667 | respondent | 54.410714 | 56.122449 |
4 | 7 | North | 722 | 46.166667 | unknown | 0.000000 | 0.000000 |
5 | 7 | North | 724 | 46.166667 | respondent | 54.410714 | 56.122449 |
6 | 7 | North | 755 | 46.166667 | respondent | 54.410714 | 56.122449 |
7 | 7 | North | 761 | 46.166667 | ineligible | 49.464286 | 51.020408 |
8 | 7 | North | 764 | 46.166667 | respondent | 54.410714 | 56.122449 |
9 | 7 | North | 782 | 46.166667 | respondent | 54.410714 | 56.122449 |
10 | 7 | North | 795 | 46.166667 | respondent | 54.410714 | 56.122449 |
11 | 7 | North | 7111 | 46.166667 | non-respondent | 0.000000 | 0.000000 |
12 | 7 | North | 7112 | 46.166667 | respondent | 54.410714 | 56.122449 |
13 | 7 | North | 7117 | 46.166667 | ineligible | 49.464286 | 51.020408 |
14 | 7 | North | 7123 | 46.166667 | respondent | 54.410714 | 56.122449 |
The snippet of code below shows that the poststratified sample weights sum to the expected control totals that is 3700 households for East, 1500 for North, 2800 for South and 6500 for West.
= full_sample[
sum_of_weights "region", "nr_weight", "ps_weight"]
["region").sum()
].groupby(=True)
sum_of_weights.reset_index(inplace sum_of_weights.head()
region | nr_weight | ps_weight | |
---|---|---|---|
0 | East | 3698.703391 | 3700.0 |
1 | North | 1454.250000 | 1500.0 |
2 | South | 2801.889620 | 2800.0 |
3 | West | 6485.783333 | 6500.0 |
The crosstable below shows that only one adjustment factor was calculated and applied per adjustment class or region.
"ps_adj_fct"] = \
full_sample[round(full_sample["ps_weight"] / full_sample["nr_weight"], 12)
"ps_adj_fct"], full_sample["region"]) pd.crosstab(full_sample[
region | East | North | South | West |
---|---|---|---|---|
ps_adj_fct | ||||
0.999326 | 0 | 0 | 38 | 0 |
1.000351 | 37 | 0 | 0 | 0 |
1.002192 | 0 | 0 | 0 | 23 |
1.031460 | 0 | 24 | 0 | 0 |
In some surveys, there is interest in keeping relative distribution of strata to some known distribution. For example, WHO EPI vaccination surveys often postratify sample weights to ensure that relative sizes of strata reflect offcial statistics e.g. census data. In most cases, the strata are based on some administrative divisions.
For example, assume that according to census data that East contains 25% of the households, North contains 10%, South contains 20% and West contains 45%. We can poststratify using the snippet of code below.
= {"East": 0.25, "North": 0.10, "South": 0.20, "West": 0.45}
known_ratios "ps_weight2"] = SampleWeight().poststratify(
full_sample[=full_sample["nr_weight"],
samp_weight=known_ratios,
factor=full_sample["region"]
domain
)
full_sample.head()
cluster | region | household | design_weight | response_status | nr_weight | ps_weight | ps_adj_fct | ps_weight2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 7 | North | 72 | 46.166667 | ineligible | 49.464286 | 51.020408 | 1.03146 | 49.117777 |
1 | 7 | North | 73 | 46.166667 | respondent | 54.410714 | 56.122449 | 1.03146 | 54.029554 |
2 | 7 | North | 75 | 46.166667 | respondent | 54.410714 | 56.122449 | 1.03146 | 54.029554 |
3 | 7 | North | 715 | 46.166667 | respondent | 54.410714 | 56.122449 | 1.03146 | 54.029554 |
4 | 7 | North | 722 | 46.166667 | unknown | 0.000000 | 0.000000 | NaN | 0.000000 |
= full_sample[
sum_of_weights2 "region", "nr_weight", "ps_weight2"]
["region").sum()
].groupby(=True)
sum_of_weights2.reset_index(inplace"ratio"] = \
sum_of_weights2["ps_weight2"] / sum(sum_of_weights2["ps_weight2"])
sum_of_weights2[ sum_of_weights2.head()
region | nr_weight | ps_weight2 | ratio | |
---|---|---|---|---|
0 | East | 3698.703391 | 3610.156586 | 0.25 |
1 | North | 1454.250000 | 1444.062634 | 0.10 |
2 | South | 2801.889620 | 2888.125269 | 0.20 |
3 | West | 6485.783333 | 6498.281855 | 0.45 |
Calibration
Calibration is a more general concept for adjusting sample weights to sum to known constants. In this tutorial, we consider the generalized regression (GREG) class of calibration. Assume that we have \(\hat{\mathbf{Y}} = \sum_{i \in s} w_i y_i\) and know population totals \(\mathbf{X} = (\mathbf{X}_1, ..., \mathbf{X}_p)^T\) are available. Working under the model \(Y_i | \mathbf{x}_i = \mathbf{x}^T_i \mathbf{\beta} + \epsilon_i\), the GREG estimator of the population total is
\[\hat{\mathbf{Y}}_{GR} = \hat{\mathbf{Y}} + (\mathbf{X} - \hat{\mathbf{X}})^T\hat{\mathbf{B}}\]
where \(\hat{\mathbf{B}}\) is the weighted least squares estimate of \(\mathbf{\beta}\) and \(\hat{\mathbf{X}}\) is the survey estimate of \(\mathbf{X}\). The essential of the GREG approach is, under the regression model, to find the adjusted weights \(w^{*}_i\) that are the closest to \(w_i\), to minimize \(h(z) = \frac{\sum_{i \in s} c_i(w_i - z_i)}{w_i}\).
Let us simulate three auxiliary variables that is education, poverty and under_five (number of children under five in the household) and assume that we have the following control totals.
Total number of under five children: 6300 in the East, 4000 in the North, 6500 in the South and 14000 in the West.
Poverty (Yes: in poverty / No: not in poverty)
Region Poverty Number of households East No 2600 Yes 1200 North No 1500 Yes 200 South No 1800 Yes 1100 West No 4500 Yes 2200 Education (Low: less than secondary, Medium: secondary completed, and High: More than secondary)
Region Education Number of households East Low 2000 Medium 1400 High 350 North Low 550 Medium 700 High 250 South Low 1300 Medium 1200 High 350 West Low 2100 Medium 4000 High 500
150)
np.random.seed("education"] = np.random.choice(
full_sample["Low", "Medium", "High"),
(=150,
size=(0.40, 0.50, 0.10)
p
)"poverty"] = np.random.choice((0, 1), size=150, p=(0.70, 0.30))
full_sample["under_five"] = np.random.choice(
full_sample[0, 1, 2, 3, 4, 5),
(=150,
size=(0.05, 0.35, 0.25, 0.20, 0.10, 0.05)
p
)
full_sample[["cluster",
"region",
"household",
"nr_weight",
"education",
"poverty",
"under_five"
]].head()
cluster | region | household | nr_weight | education | poverty | under_five | |
---|---|---|---|---|---|---|---|
0 | 7 | North | 72 | 49.464286 | High | 1 | 1 |
1 | 7 | North | 73 | 54.410714 | Low | 0 | 3 |
2 | 7 | North | 75 | 54.410714 | Medium | 0 | 2 |
3 | 7 | North | 715 | 54.410714 | Medium | 1 | 2 |
4 | 7 | North | 722 | 0.000000 | Medium | 0 | 2 |
We now will calibrate the nonreponse weight (nr_weight) to ensure that the estimated number of households in poverty is equal to 4,700 and the estimated total number of children under five is 30,8500. The control numbers 4,700 and 30,800 are obtained from the table above.
The class SampleWeight() uses the method calibrate(samp_weight, aux_vars, control, domain, scale, bounded, modified) to adjust the weight using the GREG approach. * The contol values must be stored in a python dictionnary i.e. totals = {“poverty”: 4700, “under_five”: 30800}. In this case, we have two numerical variables poverty with values in {0, 1} and under_five with values in {0, 1, 2, 3, 4, 5}. * aux_vars is the matrix of covariates.
= {"poverty": 4700, "under_five": 30800}
totals
"calib_weight"] = SampleWeight().calibrate(
full_sample["nr_weight"], full_sample[["poverty", "under_five"]], totals
full_sample[
)
"cluster", "region", "household", "nr_weight", "calib_weight"]].head(15) full_sample[[
cluster | region | household | nr_weight | calib_weight | |
---|---|---|---|---|---|
0 | 7 | North | 72 | 49.464286 | 50.432441 |
1 | 7 | North | 73 | 54.410714 | 57.233887 |
2 | 7 | North | 75 | 54.410714 | 56.292829 |
3 | 7 | North | 715 | 54.410714 | 56.416743 |
4 | 7 | North | 722 | 0.000000 | 0.000000 |
5 | 7 | North | 724 | 54.410714 | 57.233887 |
6 | 7 | North | 755 | 54.410714 | 57.233887 |
7 | 7 | North | 761 | 49.464286 | 49.464286 |
8 | 7 | North | 764 | 54.410714 | 56.292829 |
9 | 7 | North | 782 | 54.410714 | 57.233887 |
10 | 7 | North | 795 | 54.410714 | 58.174944 |
11 | 7 | North | 7111 | 0.000000 | 0.000000 |
12 | 7 | North | 7112 | 54.410714 | 59.116002 |
13 | 7 | North | 7117 | 49.464286 | 50.319793 |
14 | 7 | North | 7123 | 54.410714 | 56.292829 |
We can confirm that the estimated totals for the auxiliary variables are equal to their control values.
= full_sample["poverty"]
poverty = full_sample["under_five"]
under_5 = full_sample["nr_weight"]
nr_weight = full_sample["calib_weight"]
calib_weight
print(
f"""\nTotal estimated number of poor households was
{sum(poverty*nr_weight):.2f} before adjustment and
{sum(poverty*calib_weight):.2f} after adjustment.\n"""
)print(
f"""Total estimated number of children under 5 was
{sum(under_5*nr_weight):.2f} before adjustment and
{sum(under_5*calib_weight):.2f} after adjustment.\n"""
)
Total estimated number of poor households was
4521.84 before adjustment and
4700.00 after adjustment.
Total estimated number of children under 5 was
29442.52 before adjustment and
30800.00 after adjustment.
If we want to control by domain then we can do so using the parameter domain of calibrate(). First we need to update the python dictionary holding the control values. Now, those values have to be provided for each domain. Note that the dictionary is now a nested dictionary where the higher level keys hold the domain values i.e. East, North, South and West. Then the higher level values of the dictionary are the dictionaries providing mapping for the auxiliary variables and the corresponding control values.
= {
totals_by_domain "East": {"poverty": 1200, "under_five": 6300},
"North": {"poverty": 200, "under_five": 4000},
"South": {"poverty": 1100, "under_five": 6500},
"West": {"poverty": 2200, "under_five": 14000},
}
"calib_weight_d"] = SampleWeight().calibrate(
full_sample["nr_weight"],
full_sample["poverty", "under_five"]],
full_sample[["region"]
totals_by_domain, full_sample[
)
full_sample[["cluster",
"region",
"household",
"nr_weight",
"calib_weight",
"calib_weight_d"
15) ]].head(
cluster | region | household | nr_weight | calib_weight | calib_weight_d | |
---|---|---|---|---|---|---|
0 | 7 | North | 72 | 49.464286 | 50.432441 | 40.892864 |
1 | 7 | North | 73 | 54.410714 | 57.233887 | 61.852139 |
2 | 7 | North | 75 | 54.410714 | 56.292829 | 59.371664 |
3 | 7 | North | 715 | 54.410714 | 56.416743 | 47.462625 |
4 | 7 | North | 722 | 0.000000 | 0.000000 | 0.000000 |
5 | 7 | North | 724 | 54.410714 | 57.233887 | 61.852139 |
6 | 7 | North | 755 | 54.410714 | 57.233887 | 61.852139 |
7 | 7 | North | 761 | 49.464286 | 49.464286 | 49.464286 |
8 | 7 | North | 764 | 54.410714 | 56.292829 | 59.371664 |
9 | 7 | North | 782 | 54.410714 | 57.233887 | 61.852139 |
10 | 7 | North | 795 | 54.410714 | 58.174944 | 64.332614 |
11 | 7 | North | 7111 | 0.000000 | 0.000000 | 0.000000 |
12 | 7 | North | 7112 | 54.410714 | 59.116002 | 66.813089 |
13 | 7 | North | 7117 | 49.464286 | 50.319793 | 51.719263 |
14 | 7 | North | 7123 | 54.410714 | 56.292829 | 59.371664 |
Note that the GREG domain estimates above do not have the additive property. That is the GREG domain estimates do not sum to the overal GREG estimate. To illustrate this, let’s assume that we want to estimate the number of households.
print(f"\nThe number of households using the overall GREG is: \
{sum(full_sample['calib_weight']):.2f} \n")
print(f"The number of households using the domain GREG is: \
{sum(full_sample['calib_weight_d']):.2f} \n")
The number of households using the overall GREG is: 14960.15
The number of households using the domain GREG is: 14959.01
If the additive flag is set to True
, the sum of the domain estimates will be equal to the GREG overal estimate.
= {
totals_by_domain "East": {"poverty": 1200, "under_five": 6300},
"North": {"poverty": 200, "under_five": 4000},
"South": {"poverty": 1100, "under_five": 6500},
"West": {"poverty": 2200, "under_five": 14000},
}
= SampleWeight().calibrate(
calib_weight3 "nr_weight"],
full_sample["poverty", "under_five"]],
full_sample[[
totals_by_domain,"region"],
full_sample[=True,
additive )
= np.array(full_sample["under_five"])
under_5 print(f"\nEach column can be used to estimate a domain: \
{np.sum(np.transpose(calib_weight3) * under_5, axis=1)}\n")
Each column can be used to estimate a domain: [ 6300. 4000. 6500. 14000.]
print(f"The number of households using the overall GREG is: \
{sum(full_sample['calib_weight']):.2f} \n")
The number of households using the overall GREG is: 14960.15
print(f"The number of households using the domain GREG is: \
{sum(full_sample['calib_weight_d']):.2f} (additive=False)\n")
The number of households using the domain GREG is: 14959.01 (additive=False)
print(f"The number of households using the domain GREG is: \
{np.sum(np.transpose(calib_weight3)):.2f} (additive=True) \n")
The number of households using the domain GREG is: 14960.15 (additive=True)
Normalization
DHS and MICS normalize the final sample weights to sum to the sample size. We can use the class method normalize() to ensure that the sample weight sum to some constant across the sample or by normalization domain e.g. stratum.
normalization is mostly added here for completeness but it is sheldom to see sample weight normalize in large scale household surveys. One major downside of normalized weights is the Note that estimation of totals does not make sense with normalized weights.
"norm_weight"] = \
full_sample[=full_sample["nr_weight"])
SampleWeight().normalize(samp_weight"cluster", "region", "nr_weight", "norm_weight"]].head(25)
full_sample[[
print((full_sample.shape[0], full_sample["norm_weight"].sum()))
(150, 150.0)
When normalize() is called with only the parameter sample_weight then the sample weights are normalize to sum to the length of the sample weight vector.
"norm_weight2"] = \
full_sample[
SampleWeight().normalize(=full_sample["nr_weight"],
samp_weight=300
control
)
print(full_sample["norm_weight2"].sum())
300.0
"norm_weight3"] = SampleWeight().normalize(
full_sample[=full_sample["nr_weight"],
samp_weight=full_sample["region"]
domain
)
= full_sample.groupby(["region"]).sum()
weight_sum =True)
weight_sum.reset_index(inplace"region", "nr_weight", "norm_weight", "norm_weight3"]] weight_sum[[
region | nr_weight | norm_weight | norm_weight3 | |
---|---|---|---|---|
0 | East | 3698.703391 | 38.419768 | 45.0 |
1 | North | 1454.250000 | 15.105820 | 30.0 |
2 | South | 2801.889620 | 29.104239 | 45.0 |
3 | West | 6485.783333 | 67.370173 | 30.0 |
= {"East": 10, "North": 20, "South": 30, "West": 50}
norm_level
"norm_weight4"] = SampleWeight().normalize(
full_sample[=full_sample["nr_weight"],
samp_weight=norm_level,
control=full_sample["region"]
domain
)
= full_sample.groupby(["region"]).sum()
weight_sum =True)
weight_sum.reset_index(inplace"region", "nr_weight", "norm_weight", "norm_weight3", "norm_weight4",]] weight_sum[[
region | nr_weight | norm_weight | norm_weight3 | norm_weight4 | |
---|---|---|---|---|---|
0 | East | 3698.703391 | 38.419768 | 45.0 | 10.0 |
1 | North | 1454.250000 | 15.105820 | 30.0 | 20.0 |
2 | South | 2801.889620 | 29.104239 | 45.0 | 30.0 |
3 | West | 6485.783333 | 67.370173 | 30.0 | 50.0 |