Unit Level Modeling

Empirical best linear unbiased Prediction (EBLUP)

The Unit Level model refers to a class of SAE techniques that fit the linear mixed model at the sampling unit level. As for the Area Level model, generalized linear mixed models are also the modeling framework for Unit Level model. In this case, given that the model is happening at the sub-area level, both the random effects and unit level standard errors can be estimated from the model. In this tutorial, we will predict the area level means which is a linear parameter.

County Crop (Corn and Soybeans) Areas Data

For this example, we use the county crop data used by Battese, Harter, and Fuller (1988). The datasets contains 37 observations on areas under corn and under soybeans for each of the 12 counties in the north-central of Iowa. Each county was divided in segments and using interviews and LANDSAT satellite data, data on area used for corn and soybeans was obtained. In the unit level data, each observation is a segment with the following variables: county id (county_id), area in hectare under corn (corn_area), area in heactare under soybeans (soybeans_area), number of pixel classified as corn (corn_pixel), and number of pixels classified as soybeans (soybeans_pixel).

from samplics.datasets import load_county_crop, load_county_crop_means
from samplics.sae import EblupUnitModel

from samplics.utils.types import FitMethod

# Load County Crop sample data
countycrop_dict = load_county_crop()
countycrop = countycrop_dict["data"]

print("First observations from the unit (segment) level crop areas data")
countycrop.head(15)

First observations from the unit (segment) level crop areas data

	county_id	corn_area	soybeans_area	corn_pixel	soybeans_pixel
0	1	165.76	8.09	374	55
1	2	96.32	106.03	209	218
2	3	76.08	103.60	253	250
3	4	185.35	6.47	432	96
4	4	116.43	63.82	367	178
5	5	162.08	43.50	361	137
6	5	152.04	71.43	288	206
7	5	161.75	42.49	369	165
8	6	92.88	105.26	206	218
9	6	149.94	76.49	316	221
10	6	64.75	174.34	145	338
11	7	127.07	95.67	355	128
12	7	133.55	76.57	295	147
13	7	77.70	93.48	223	204
14	8	206.39	37.84	459	77

In addition to the unit (segment) level data, we have the small area (county) level averages of the number of pixels classified as corn or soybeans.

# Load County Crop Area Means sample data
countycropmeans_dict = load_county_crop_means()
countycrop_means = countycropmeans_dict["data"]

print(f"County level crop areas averages")
countycrop_means.head(15)

County level crop areas averages

	county_id	samp_segments	pop_segments	ave_corn_pixel	ave_soybeans_pixel
0	1	1	545	295.29	189.70
1	2	1	566	300.40	196.65
2	3	1	394	289.60	205.28
3	4	2	424	290.74	220.22
4	5	3	564	318.21	188.06
5	6	3	570	257.17	247.13
6	7	3	402	291.77	185.37
7	8	3	567	301.26	221.36
8	9	4	687	262.17	247.09
9	10	5	569	314.28	198.66
10	11	5	965	298.65	204.61
11	12	6	556	325.99	177.05

Empirical Bayes linear unbiased predictor (EBLUP)

Now we are going to estimates the average area size under corn and soybeans. To do so, we use the nested error linear regression (special case of the linear mixed model) to model the number of hectares. As auxiliary variables, we use the number of pixel classified as corn and soybeans.

First, we use the method fit() to estimate the model parameters.

areas = countycrop["county_id"]
ys = countycrop["corn_area"]
Xs = countycrop[["corn_pixel", "soybeans_pixel"]]
Xp_mean = countycrop_means[["ave_corn_pixel", "ave_corn_pixel"]]
samp_size = countycrop_means[["samp_segments"]]
pop_size = countycrop_means[["pop_segments"]]

import numpy as np
areap = np.linspace(1, 12, 12)

"""REML Method"""
eblup_bhf_reml = EblupUnitModel()
eblup_bhf_reml.fit(
    ys,
    Xs,
    areas,
)

eblup_bhf_reml.predict(
    Xmean=Xp_mean,
    area=areap,
)

corn_est_reml = eblup_bhf_reml.to_dataframe()

print(corn_est_reml)

   _parameter  _area   _estimate        _mse
0        mean    1.0  119.357558  139.514787
1        mean    2.0  120.364916  140.131214
2        mean    3.0  110.530444  114.058730
3        mean    4.0  112.879489  113.682072
4        mean    5.0  133.244364  152.310283
5        mean    6.0  108.640579   75.256679
6        mean    7.0  113.284824  121.930105
7        mean    8.0  120.335415  113.394707
8        mean    9.0  111.072462   67.357524
9        mean   10.0  120.669683  118.122346
10       mean   11.0  109.649316   99.856041
11       mean   12.0  126.735499  148.884533

/Users/msdiallo/Dev/survey-methods/samplics/.venv/lib/python3.13/site-packages/statsmodels/base/optimizer.py:19: FutureWarning: Keyword arguments have been passed to the optimizer that have no effect. The list of allowed keyword arguments for method bfgs is: gtol, norm, epsilon. The list of unsupported keyword arguments passed include: tol. After release 0.14, this will raise.
  warnings.warn(

"""ML Method"""
eblup_bhf_ml = EblupUnitModel(method=FitMethod.ml)
eblup_bhf_ml.fit(
    ys,
    Xs,
    areas,
)

eblup_bhf_ml.predict(Xp_mean, areap)

corn_est_ml = eblup_bhf_ml.to_dataframe()

print(corn_est_ml)

   _parameter  _area   _estimate        _mse
0        mean    1.0  118.987340  117.381060
1        mean    2.0  120.091296  119.200583
2        mean    3.0  111.315363   96.845753
3        mean    4.0  113.302456   99.756880
4        mean    5.0  132.143329  138.775182
5        mean    6.0  108.072837   67.896678
6        mean    7.0  113.637097  108.711569
7        mean    8.0  120.189563  103.560647
8        mean    9.0  110.480483   62.797278
9        mean   10.0  120.961246  111.594667
10       mean   11.0  110.577742   93.776110
11       mean   12.0  126.790379  139.817391

Bootstrap MSE estimation

As shown above, the predict() method provides the taylor-based MSE estimates. However, we can also calculate MSE estimates using the bootstrap approach.