Unit Level Modeling

Empirical best linear unbiased Prediction (EBLUP)

The Unit Level model refers to a class of SAE techniques that fit the linear mixed model at the sampling unit level. As for the Area Level model, generalized linear mixed models are also the modeling framework for Unit Level model. In this case, given that the model is happening at the sub-area level, both the random effects and unit level standard errors can be estimated from the model. In this tutorial, we will predict the area level means which is a linear parameter.

County Crop (Corn and Soybeans) Areas Data

For this example, we use the county crop data used by Battese, Harter, and Fuller (1988). The datasets contains 37 observations on areas under corn and under soybeans for each of the 12 counties in the north-central of Iowa. Each county was divided in segments and using interviews and LANDSAT satellite data, data on area used for corn and soybeans was obtained. In the unit level data, each observation is a segment with the following variables: county id (county_id), area in hectare under corn (corn_area), area in heactare under soybeans (soybeans_area), number of pixel classified as corn (corn_pixel), and number of pixels classified as soybeans (soybeans_pixel).

from samplics.datasets import load_county_crop, load_county_crop_means
from samplics.sae import EblupUnitModel

from samplics.utils.types import FitMethod
# Load County Crop sample data
countycrop_dict = load_county_crop()
countycrop = countycrop_dict["data"]

print("First observations from the unit (segment) level crop areas data")
countycrop.head(15)
First observations from the unit (segment) level crop areas data
county_id corn_area soybeans_area corn_pixel soybeans_pixel
0 1 165.76 8.09 374 55
1 2 96.32 106.03 209 218
2 3 76.08 103.60 253 250
3 4 185.35 6.47 432 96
4 4 116.43 63.82 367 178
5 5 162.08 43.50 361 137
6 5 152.04 71.43 288 206
7 5 161.75 42.49 369 165
8 6 92.88 105.26 206 218
9 6 149.94 76.49 316 221
10 6 64.75 174.34 145 338
11 7 127.07 95.67 355 128
12 7 133.55 76.57 295 147
13 7 77.70 93.48 223 204
14 8 206.39 37.84 459 77

In addition to the unit (segment) level data, we have the small area (county) level averages of the number of pixels classified as corn or soybeans.

# Load County Crop Area Means sample data
countycropmeans_dict = load_county_crop_means()
countycrop_means = countycropmeans_dict["data"]

print(f"County level crop areas averages")
countycrop_means.head(15)
County level crop areas averages
county_id samp_segments pop_segments ave_corn_pixel ave_soybeans_pixel
0 1 1 545 295.29 189.70
1 2 1 566 300.40 196.65
2 3 1 394 289.60 205.28
3 4 2 424 290.74 220.22
4 5 3 564 318.21 188.06
5 6 3 570 257.17 247.13
6 7 3 402 291.77 185.37
7 8 3 567 301.26 221.36
8 9 4 687 262.17 247.09
9 10 5 569 314.28 198.66
10 11 5 965 298.65 204.61
11 12 6 556 325.99 177.05

Empirical Bayes linear unbiased predictor (EBLUP)

Now we are going to estimates the average area size under corn and soybeans. To do so, we use the nested error linear regression (special case of the linear mixed model) to model the number of hectares. As auxiliary variables, we use the number of pixel classified as corn and soybeans.

First, we use the method fit() to estimate the model parameters.

areas = countycrop["county_id"]
ys = countycrop["corn_area"]
Xs = countycrop[["corn_pixel", "soybeans_pixel"]]
Xp_mean = countycrop_means[["ave_corn_pixel", "ave_corn_pixel"]]
samp_size = countycrop_means[["samp_segments"]]
pop_size = countycrop_means[["pop_segments"]]

import numpy as np
areap = np.linspace(1, 12, 12)

"""REML Method"""
eblup_bhf_reml = EblupUnitModel()
eblup_bhf_reml.fit(
    ys,
    Xs,
    areas,
)

eblup_bhf_reml.predict(
    Xmean=Xp_mean,
    area=areap,
)

corn_est_reml = eblup_bhf_reml.to_dataframe()

print(corn_est_reml)
   _parameter  _area   _estimate        _mse
0        mean    1.0  119.357558  139.514787
1        mean    2.0  120.364916  140.131214
2        mean    3.0  110.530444  114.058730
3        mean    4.0  112.879489  113.682072
4        mean    5.0  133.244364  152.310283
5        mean    6.0  108.640579   75.256679
6        mean    7.0  113.284824  121.930105
7        mean    8.0  120.335415  113.394707
8        mean    9.0  111.072462   67.357524
9        mean   10.0  120.669683  118.122346
10       mean   11.0  109.649316   99.856041
11       mean   12.0  126.735499  148.884533
/Users/msdiallo/Dev/survey-methods/samplics/.venv/lib/python3.12/site-packages/statsmodels/base/optimizer.py:19: FutureWarning: Keyword arguments have been passed to the optimizer that have no effect. The list of allowed keyword arguments for method bfgs is: gtol, norm, epsilon. The list of unsupported keyword arguments passed include: tol. After release 0.14, this will raise.
  warnings.warn(
"""ML Method"""
eblup_bhf_ml = EblupUnitModel(method=FitMethod.ml)
eblup_bhf_ml.fit(
    ys,
    Xs,
    areas,
)

eblup_bhf_ml.predict(Xp_mean, areap)

corn_est_ml = eblup_bhf_ml.to_dataframe()

print(corn_est_ml)
   _parameter  _area   _estimate        _mse
0        mean    1.0  118.987340  117.381060
1        mean    2.0  120.091296  119.200583
2        mean    3.0  111.315363   96.845753
3        mean    4.0  113.302456   99.756880
4        mean    5.0  132.143329  138.775182
5        mean    6.0  108.072837   67.896678
6        mean    7.0  113.637097  108.711569
7        mean    8.0  120.189563  103.560647
8        mean    9.0  110.480483   62.797278
9        mean   10.0  120.961246  111.594667
10       mean   11.0  110.577742   93.776110
11       mean   12.0  126.790379  139.817391

Bootstrap MSE estimation

As shown above, the predict() method provides the taylor-based MSE estimates. However, we can also calculate MSE estimates using the bootstrap approach.