from samplics.datasets import load_county_crop, load_county_crop_means
from samplics.sae import EblupUnitModel
from samplics.utils.types import FitMethod
Unit Level Modeling
Empirical best linear unbiased Prediction (EBLUP)
The Unit Level model refers to a class of SAE techniques that fit the linear mixed model at the sampling unit level. As for the Area Level model, generalized linear mixed models are also the modeling framework for Unit Level model. In this case, given that the model is happening at the sub-area level, both the random effects and unit level standard errors can be estimated from the model. In this tutorial, we will predict the area level means which is a linear parameter.
County Crop (Corn and Soybeans) Areas Data
For this example, we use the county crop data used by Battese, Harter, and Fuller (1988). The datasets contains 37 observations on areas under corn and under soybeans for each of the 12 counties in the north-central of Iowa. Each county was divided in segments and using interviews and LANDSAT satellite data, data on area used for corn and soybeans was obtained. In the unit level data, each observation is a segment with the following variables: county id (county_id), area in hectare under corn (corn_area), area in heactare under soybeans (soybeans_area), number of pixel classified as corn (corn_pixel), and number of pixels classified as soybeans (soybeans_pixel).
# Load County Crop sample data
= load_county_crop()
countycrop_dict = countycrop_dict["data"]
countycrop
print("First observations from the unit (segment) level crop areas data")
15) countycrop.head(
First observations from the unit (segment) level crop areas data
county_id | corn_area | soybeans_area | corn_pixel | soybeans_pixel | |
---|---|---|---|---|---|
0 | 1 | 165.76 | 8.09 | 374 | 55 |
1 | 2 | 96.32 | 106.03 | 209 | 218 |
2 | 3 | 76.08 | 103.60 | 253 | 250 |
3 | 4 | 185.35 | 6.47 | 432 | 96 |
4 | 4 | 116.43 | 63.82 | 367 | 178 |
5 | 5 | 162.08 | 43.50 | 361 | 137 |
6 | 5 | 152.04 | 71.43 | 288 | 206 |
7 | 5 | 161.75 | 42.49 | 369 | 165 |
8 | 6 | 92.88 | 105.26 | 206 | 218 |
9 | 6 | 149.94 | 76.49 | 316 | 221 |
10 | 6 | 64.75 | 174.34 | 145 | 338 |
11 | 7 | 127.07 | 95.67 | 355 | 128 |
12 | 7 | 133.55 | 76.57 | 295 | 147 |
13 | 7 | 77.70 | 93.48 | 223 | 204 |
14 | 8 | 206.39 | 37.84 | 459 | 77 |
In addition to the unit (segment) level data, we have the small area (county) level averages of the number of pixels classified as corn or soybeans.
# Load County Crop Area Means sample data
= load_county_crop_means()
countycropmeans_dict = countycropmeans_dict["data"]
countycrop_means
print(f"County level crop areas averages")
15) countycrop_means.head(
County level crop areas averages
county_id | samp_segments | pop_segments | ave_corn_pixel | ave_soybeans_pixel | |
---|---|---|---|---|---|
0 | 1 | 1 | 545 | 295.29 | 189.70 |
1 | 2 | 1 | 566 | 300.40 | 196.65 |
2 | 3 | 1 | 394 | 289.60 | 205.28 |
3 | 4 | 2 | 424 | 290.74 | 220.22 |
4 | 5 | 3 | 564 | 318.21 | 188.06 |
5 | 6 | 3 | 570 | 257.17 | 247.13 |
6 | 7 | 3 | 402 | 291.77 | 185.37 |
7 | 8 | 3 | 567 | 301.26 | 221.36 |
8 | 9 | 4 | 687 | 262.17 | 247.09 |
9 | 10 | 5 | 569 | 314.28 | 198.66 |
10 | 11 | 5 | 965 | 298.65 | 204.61 |
11 | 12 | 6 | 556 | 325.99 | 177.05 |
Empirical Bayes linear unbiased predictor (EBLUP)
Now we are going to estimates the average area size under corn and soybeans. To do so, we use the nested error linear regression (special case of the linear mixed model) to model the number of hectares. As auxiliary variables, we use the number of pixel classified as corn and soybeans.
First, we use the method fit()
to estimate the model parameters.
= countycrop["county_id"]
areas = countycrop["corn_area"]
ys = countycrop[["corn_pixel", "soybeans_pixel"]]
Xs = countycrop_means[["ave_corn_pixel", "ave_corn_pixel"]]
Xp_mean = countycrop_means[["samp_segments"]]
samp_size = countycrop_means[["pop_segments"]]
pop_size
import numpy as np
= np.linspace(1, 12, 12)
areap
"""REML Method"""
= EblupUnitModel()
eblup_bhf_reml
eblup_bhf_reml.fit(
ys,
Xs,
areas,
)
eblup_bhf_reml.predict(=Xp_mean,
Xmean=areap,
area
)
= eblup_bhf_reml.to_dataframe()
corn_est_reml
print(corn_est_reml)
_parameter _area _estimate _mse
0 mean 1.0 119.357558 139.514787
1 mean 2.0 120.364916 140.131214
2 mean 3.0 110.530444 114.058730
3 mean 4.0 112.879489 113.682072
4 mean 5.0 133.244364 152.310283
5 mean 6.0 108.640579 75.256679
6 mean 7.0 113.284824 121.930105
7 mean 8.0 120.335415 113.394707
8 mean 9.0 111.072462 67.357524
9 mean 10.0 120.669683 118.122346
10 mean 11.0 109.649316 99.856041
11 mean 12.0 126.735499 148.884533
/Users/msdiallo/Dev/survey-methods/samplics/.venv/lib/python3.12/site-packages/statsmodels/base/optimizer.py:19: FutureWarning: Keyword arguments have been passed to the optimizer that have no effect. The list of allowed keyword arguments for method bfgs is: gtol, norm, epsilon. The list of unsupported keyword arguments passed include: tol. After release 0.14, this will raise.
warnings.warn(
"""ML Method"""
= EblupUnitModel(method=FitMethod.ml)
eblup_bhf_ml
eblup_bhf_ml.fit(
ys,
Xs,
areas,
)
eblup_bhf_ml.predict(Xp_mean, areap)
= eblup_bhf_ml.to_dataframe()
corn_est_ml
print(corn_est_ml)
_parameter _area _estimate _mse
0 mean 1.0 118.987340 117.381060
1 mean 2.0 120.091296 119.200583
2 mean 3.0 111.315363 96.845753
3 mean 4.0 113.302456 99.756880
4 mean 5.0 132.143329 138.775182
5 mean 6.0 108.072837 67.896678
6 mean 7.0 113.637097 108.711569
7 mean 8.0 120.189563 103.560647
8 mean 9.0 110.480483 62.797278
9 mean 10.0 120.961246 111.594667
10 mean 11.0 110.577742 93.776110
11 mean 12.0 126.790379 139.817391
Bootstrap MSE estimation
As shown above, the predict()
method provides the taylor-based MSE estimates. However, we can also calculate MSE estimates using the bootstrap approach.