from samplics.datasets import load_nhanes2
from samplics.estimation import TaylorEstimator
from samplics.utils.types import PopParam
Taylor-based Estimation
Samplics’s class TaylorEstimator uses linearization methods to estimate the variance of population parameters.
# Load Nhanes sample data
= load_nhanes2()
nhanes2_dict = nhanes2_dict["data"]
nhanes2
15) nhanes2.head(
stratid | psuid | race | highbp | highlead | zinc | diabetes | finalwgt | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 0 | NaN | 104.0 | 0.0 | 8995 |
1 | 1 | 1 | 1 | 0 | 0.0 | 111.0 | 0.0 | 25964 |
2 | 1 | 1 | 3 | 0 | NaN | 102.0 | 0.0 | 8752 |
3 | 1 | 1 | 1 | 1 | NaN | 109.0 | 1.0 | 4310 |
4 | 1 | 1 | 1 | 0 | 0.0 | 99.0 | 0.0 | 9011 |
5 | 1 | 1 | 1 | 1 | NaN | 101.0 | 0.0 | 4310 |
6 | 1 | 1 | 1 | 0 | 0.0 | 93.0 | 0.0 | 3201 |
7 | 1 | 1 | 1 | 1 | NaN | 83.0 | 0.0 | 25386 |
8 | 1 | 1 | 1 | 0 | NaN | 98.0 | 0.0 | 12102 |
9 | 1 | 1 | 2 | 0 | 0.0 | 98.0 | 0.0 | 4312 |
10 | 1 | 1 | 1 | 1 | NaN | 92.0 | 0.0 | 4031 |
11 | 1 | 1 | 2 | 0 | 0.0 | 90.0 | 0.0 | 3628 |
12 | 1 | 1 | 1 | 0 | NaN | 101.0 | 0.0 | 28590 |
13 | 1 | 1 | 1 | 0 | 0.0 | NaN | 0.0 | 22754 |
14 | 1 | 1 | 2 | 0 | 1.0 | 123.0 | 0.0 | 7119 |
Using samplics, we can estimate the average level of zinc in the blood using the following
= TaylorEstimator(PopParam.mean)
zinc_mean_str
zinc_mean_str.estimate(=nhanes2["zinc"],
y=nhanes2["finalwgt"],
samp_weight=nhanes2["stratid"],
stratum=nhanes2["psuid"],
psu=True,
remove_nan
)
print(zinc_mean_str)
SAMPLICS - Estimation of Mean
Number of strata: 31
Number of psus: 62
Degree of freedom: 31
MEAN SE LCI UCI CV
87.182067 0.494483 86.173563 88.190571 0.005672
The results of the estimation are stored in the dictionary zinc_mean_str
. The users can covert the main estimation information into a pd.DataFrame by using the method to_dataframe()
.
zinc_mean_str.to_dataframe()
_param | _estimate | _stderror | _lci | _uci | _cv | |
---|---|---|---|---|---|---|
0 | PopParam.mean | 87.182067 | 0.494483 | 86.173563 | 88.190571 | 0.005672 |
The method to_dataframe()
is more useful for domain estimation by producing a table where which row is a level of the domain of interest, as shown below.
= TaylorEstimator(PopParam.mean)
zinc_mean_by_race
zinc_mean_by_race.estimate(=nhanes2["zinc"],
y=nhanes2["finalwgt"],
samp_weight=nhanes2["stratid"],
stratum=nhanes2["race"],
domain=nhanes2["psuid"],
psu=True,
remove_nan
)
zinc_mean_by_race.to_dataframe()
_param | _domain | _estimate | _stderror | _lci | _uci | _cv | |
---|---|---|---|---|---|---|---|
0 | PopParam.mean | 1 | 87.495389 | 0.479196 | 86.518062 | 88.472716 | 0.005477 |
1 | PopParam.mean | 2 | 85.085744 | 1.165209 | 82.709286 | 87.462203 | 0.013695 |
2 | PopParam.mean | 3 | 83.570910 | 1.585463 | 80.337338 | 86.804483 | 0.018971 |
Let’s remove the stratum parameter then we get
= TaylorEstimator(PopParam.mean)
zinc_mean_nostr
zinc_mean_nostr.estimate(=nhanes2["zinc"],
y=nhanes2["finalwgt"],
samp_weight=nhanes2["psuid"],
psu=True
remove_nan
)
print(zinc_mean_nostr)
SAMPLICS - Estimation of Mean
Number of strata: 1
Number of psus: 2
Degree of freedom: 1
MEAN SE LCI UCI CV
87.182067 0.742622 77.746158 96.617976 0.008518