from samplics.datasets import load_nhanes2
from samplics.estimation import TaylorEstimator
from samplics.utils.types import PopParamTaylor-based Estimation
Samplics’s class TaylorEstimator uses linearization methods to estimate the variance of population parameters.
# Load Nhanes sample data
nhanes2_dict = load_nhanes2()
nhanes2 = nhanes2_dict["data"]
nhanes2.head(15)| stratid | psuid | race | highbp | highlead | zinc | diabetes | finalwgt | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 0 | NaN | 104.0 | 0.0 | 8995 |
| 1 | 1 | 1 | 1 | 0 | 0.0 | 111.0 | 0.0 | 25964 |
| 2 | 1 | 1 | 3 | 0 | NaN | 102.0 | 0.0 | 8752 |
| 3 | 1 | 1 | 1 | 1 | NaN | 109.0 | 1.0 | 4310 |
| 4 | 1 | 1 | 1 | 0 | 0.0 | 99.0 | 0.0 | 9011 |
| 5 | 1 | 1 | 1 | 1 | NaN | 101.0 | 0.0 | 4310 |
| 6 | 1 | 1 | 1 | 0 | 0.0 | 93.0 | 0.0 | 3201 |
| 7 | 1 | 1 | 1 | 1 | NaN | 83.0 | 0.0 | 25386 |
| 8 | 1 | 1 | 1 | 0 | NaN | 98.0 | 0.0 | 12102 |
| 9 | 1 | 1 | 2 | 0 | 0.0 | 98.0 | 0.0 | 4312 |
| 10 | 1 | 1 | 1 | 1 | NaN | 92.0 | 0.0 | 4031 |
| 11 | 1 | 1 | 2 | 0 | 0.0 | 90.0 | 0.0 | 3628 |
| 12 | 1 | 1 | 1 | 0 | NaN | 101.0 | 0.0 | 28590 |
| 13 | 1 | 1 | 1 | 0 | 0.0 | NaN | 0.0 | 22754 |
| 14 | 1 | 1 | 2 | 0 | 1.0 | 123.0 | 0.0 | 7119 |
Using samplics, we can estimate the average level of zinc in the blood using the following
zinc_mean_str = TaylorEstimator(PopParam.mean)
zinc_mean_str.estimate(
y=nhanes2["zinc"],
samp_weight=nhanes2["finalwgt"],
stratum=nhanes2["stratid"],
psu=nhanes2["psuid"],
remove_nan=True,
)
print(zinc_mean_str)SAMPLICS - Estimation of Mean
Number of strata: 31
Number of psus: 62
Degree of freedom: 31
MEAN SE LCI UCI CV
87.182067 0.494483 86.173563 88.190571 0.005672
The results of the estimation are stored in the dictionary zinc_mean_str. The users can covert the main estimation information into a pd.DataFrame by using the method to_dataframe().
zinc_mean_str.to_dataframe()| _param | _estimate | _stderror | _lci | _uci | _cv | |
|---|---|---|---|---|---|---|
| 0 | PopParam.mean | 87.182067 | 0.494483 | 86.173563 | 88.190571 | 0.005672 |
The method to_dataframe() is more useful for domain estimation by producing a table where which row is a level of the domain of interest, as shown below.
zinc_mean_by_race = TaylorEstimator(PopParam.mean)
zinc_mean_by_race.estimate(
y=nhanes2["zinc"],
samp_weight=nhanes2["finalwgt"],
stratum=nhanes2["stratid"],
domain=nhanes2["race"],
psu=nhanes2["psuid"],
remove_nan=True,
)
zinc_mean_by_race.to_dataframe()| _param | _domain | _estimate | _stderror | _lci | _uci | _cv | |
|---|---|---|---|---|---|---|---|
| 0 | PopParam.mean | 1 | 87.495389 | 0.479196 | 86.518062 | 88.472716 | 0.005477 |
| 1 | PopParam.mean | 2 | 85.085744 | 1.165209 | 82.709286 | 87.462203 | 0.013695 |
| 2 | PopParam.mean | 3 | 83.570910 | 1.585463 | 80.337338 | 86.804483 | 0.018971 |
Let’s remove the stratum parameter then we get
zinc_mean_nostr = TaylorEstimator(PopParam.mean)
zinc_mean_nostr.estimate(
y=nhanes2["zinc"],
samp_weight=nhanes2["finalwgt"],
psu=nhanes2["psuid"],
remove_nan=True
)
print(zinc_mean_nostr)SAMPLICS - Estimation of Mean
Number of strata: 1
Number of psus: 2
Degree of freedom: 1
MEAN SE LCI UCI CV
87.182067 0.742622 77.746158 96.617976 0.008518