from samplics.datasets import load_expenditure_milk
from samplics.sae import EblupAreaModel
from samplics.utils.types import FitMethod
Area Level Modeling
Empirical best linear unbiased predictor (EBLUP)
Small area estimation (SAE) are useful techniques when the sample sizes are not sufficient to provide reliable direct domain estimates given the sampling design. In this tutorial, the direct estimates refer to estimates obtained from the design-based approach. It usually consists of applying adjusted design weights to the variable of interest to compute sample parameters as estimates of equivalent population parameters. When auxiliary information is available, we can use model assisted survey methods can be used to estimate population parameters.
In this tutorial, we will go futher and use modeling techniques to produce domains estimates. For the area level model, the modeling is done at the area level using generalized linear mixed models. The sections below shows how to use the EblupAreaModel class from the samplics package to produce area level estimates.
Milk Expenditure data
To illustrate the EblupAreaModel class, we will use the Milk Expenditure dataset used in Rao and Molina (2015). As mentioned in the book, this dataset was originally used by Arora and Lahiri (1997) and later by You and Chapman (2006). For the R users, this dataset is also used by the R package sae (https://cran.r-project.org/web/packages/sae/index.html).
The Milk Expenditure data contains 43 observations on the average expenditure on fresh milk for the year 1989. The datasets has the following values: major area representing (major_area), small area (small_area), sample size (samp_size), direct survey estimates of average expenditure (direct_est), standard error of the direct estimate (std_error), and coefficient of variation of the direct estimates (coef_variance).
# Load Expenditure on Milk sample data
= load_expenditure_milk()
milk_exp_dict = milk_exp_dict["data"]
milk_exp
= 15
nb_obs print(f"First {nb_obs} observations of the Milk Expendure dataset\n")
milk_exp.tail(nb_obs)
First 15 observations of the Milk Expendure dataset
major_area | small_area | samp_size | direct_est | std_error | coef_var | |
---|---|---|---|---|---|---|
28 | 4 | 29 | 238 | 0.796 | 0.106 | 0.133 |
29 | 4 | 30 | 207 | 0.565 | 0.089 | 0.158 |
30 | 4 | 31 | 165 | 0.886 | 0.225 | 0.254 |
31 | 4 | 32 | 153 | 0.952 | 0.205 | 0.215 |
32 | 4 | 33 | 210 | 0.807 | 0.119 | 0.147 |
33 | 4 | 34 | 383 | 0.582 | 0.067 | 0.115 |
34 | 4 | 35 | 255 | 0.684 | 0.106 | 0.155 |
35 | 4 | 36 | 226 | 0.787 | 0.126 | 0.160 |
36 | 4 | 37 | 224 | 0.440 | 0.092 | 0.209 |
37 | 4 | 38 | 212 | 0.759 | 0.132 | 0.174 |
38 | 4 | 39 | 211 | 0.770 | 0.100 | 0.130 |
39 | 4 | 40 | 179 | 0.800 | 0.113 | 0.141 |
40 | 4 | 41 | 312 | 0.756 | 0.083 | 0.110 |
41 | 4 | 42 | 241 | 0.865 | 0.121 | 0.140 |
42 | 4 | 43 | 205 | 0.640 | 0.129 | 0.202 |
EBLUP Predictor
As shown in the milk expenditure datasets, some of the coefficients of variation are not small which indicates unstability of the direct survey estimates. Hence, we can try to reduce the variability of the estimates by smoothing them through modeling. For illustration purpose, we will model the average expenditure on milk using the major areas as auxiliary variables.
First, we use the method fit()
to estimate the model parameters. The pandas’s method get_dummies()
create a matrix with dummy values (0 and 1) from the categorical variable major_area.
= milk_exp["small_area"]
area = milk_exp["direct_est"]
yhat
import pandas as pd
= pd.get_dummies(milk_exp["major_area"],drop_first=True)
X = milk_exp["std_error"]
sigma_e
## REML method
= EblupAreaModel(method=FitMethod.reml)
fh_model_reml
fh_model_reml.fit(=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
yhat )
from pprint import pprint
print(f"The estimated fixed effects are:")
pprint(fh_model_reml.fixed_effects)
The estimated fixed effects are:
array([ 0.96818899, 0.13278031, 0.22694622, -0.24130104])
print("The estimated standard error of the area random effects is:")
pprint(fh_model_reml.re_std)
The estimated standard error of the area random effects is:
0.13619961367679143
print("The convergence statistics are:")
pprint(fh_model_reml.convergence)
The convergence statistics are:
{'achieved': True, 'iterations': 9, 'precision': 1.979926027448861e-09}
print("The goodness of fit statistics are:")
pprint(fh_model_reml.goodness)
The goodness of fit statistics are:
{'AIC': 30.806948817978125,
'BIC': 41.374149512139496,
'loglike': -9.403474408989062}
fh_model_reml.predict(=X, area=area, intercept=True
X
)
pprint(fh_model_reml.area_est)
{1: 1.0219705441558897,
2: 1.0476019514456771,
3: 1.0679514263086398,
4: 0.7608165650758554,
5: 0.846157043770443,
6: 0.9743727061135177,
7: 1.0584526719577587,
8: 1.0977762561870672,
9: 1.221545489460039,
10: 1.195146014849563,
11: 0.7852149191679926,
12: 1.2139462053925376,
13: 1.2096597208051416,
14: 0.9834964411924456,
15: 1.186424709592169,
16: 1.1556981139181448,
17: 1.2263412506598237,
18: 1.285648988670589,
19: 1.2363248408581455,
20: 1.2349601393920557,
21: 1.0903016274841368,
22: 1.1923057228337586,
23: 1.121646766758834,
24: 1.2230297218953323,
25: 1.1938054443945292,
26: 0.7627195896418988,
27: 0.7649551532185552,
28: 0.7338443880846524,
29: 0.7699295541643935,
30: 0.6134416233555453,
31: 0.7695560722792779,
32: 0.7958253116952815,
33: 0.7723188477263883,
34: 0.6102300683088241,
35: 0.7001781896510537,
36: 0.7592788104101728,
37: 0.5298863364482943,
38: 0.743446678039724,
39: 0.7548996331139428,
40: 0.7701919657191105,
41: 0.748116423847363,
42: 0.8040775158100446,
43: 0.6810868850580025}
We can use the utility method to_dataframe()
to output the estimates as a pandas dataframe. The function provides the area, the estimate and its MSE estimates. We can use col_names to customize the name of the columns. For example, using col_names = ["small_area", "eblup_estimate", "eblup_mse"]
. Otherwise, if col_names is not provided, “_area”, “_estimates” and “_mse” are used as defaults.
= fh_model_reml.to_dataframe(
milk_est_reml = ["parameter", "small_area", "eblup_estimate", "eblup_mse"]
col_names
)print("The dataframe version of the area level estimates:")
print(milk_est_reml)
The dataframe version of the area level estimates:
parameter small_area eblup_estimate eblup_mse
0 mean 1 1.021971 0.013460
1 mean 2 1.047602 0.005373
2 mean 3 1.067951 0.005702
3 mean 4 0.760817 0.008542
4 mean 5 0.846157 0.009580
5 mean 6 0.974373 0.011671
6 mean 7 1.058453 0.015926
7 mean 8 1.097776 0.010587
8 mean 9 1.221545 0.014184
9 mean 10 1.195146 0.014902
10 mean 11 0.785215 0.007694
11 mean 12 1.213946 0.016337
12 mean 13 1.209660 0.012563
13 mean 14 0.983496 0.012117
14 mean 15 1.186425 0.012031
15 mean 16 1.155698 0.011709
16 mean 17 1.226341 0.010860
17 mean 18 1.285649 0.013691
18 mean 19 1.236325 0.011035
19 mean 20 1.234960 0.013080
20 mean 21 1.090302 0.009949
21 mean 22 1.192306 0.017244
22 mean 23 1.121647 0.011292
23 mean 24 1.223030 0.013625
24 mean 25 1.193805 0.008066
25 mean 26 0.762720 0.009205
26 mean 27 0.764955 0.009205
27 mean 28 0.733844 0.016477
28 mean 29 0.769930 0.007801
29 mean 30 0.613442 0.006099
30 mean 31 0.769556 0.015442
31 mean 32 0.795825 0.014658
32 mean 33 0.772319 0.009025
33 mean 34 0.610230 0.003871
34 mean 35 0.700178 0.007801
35 mean 36 0.759279 0.009646
36 mean 37 0.529886 0.006404
37 mean 38 0.743447 0.010156
38 mean 39 0.754900 0.007210
39 mean 40 0.770192 0.008470
40 mean 41 0.748116 0.005485
41 mean 42 0.804078 0.009205
42 mean 43 0.681087 0.009904
We could also fit the model parameters using the maximum likelihood (ML) method which will impact the MSE estimation as well. To estimate the area means using the ML methdo, we only need to set method=“ML” then run the prediction as follows.
## ML method
= EblupAreaModel(method=FitMethod.ml)
fh_model_ml
fh_model_ml.fit(=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
yhat
)
= fh_model_ml.predict(
milk_est_ml =X, area=area, intercept=True
X
)
= fh_model_ml.to_dataframe(
milk_est_ml = ["parameter", "small_area", "eblup_estimate", "eblup_mse"]
col_names
)
print("The dataframe version of the ML area level estimates:")
print(milk_est_ml)
The dataframe version of the ML area level estimates:
parameter small_area eblup_estimate eblup_mse
0 mean 1 1.016228 0.013559
1 mean 2 1.043736 0.005507
2 mean 3 1.062868 0.005844
3 mean 4 0.775205 0.008724
4 mean 5 0.855398 0.009761
5 mean 6 0.973593 0.011823
6 mean 7 1.047581 0.015908
7 mean 8 1.095370 0.010806
8 mean 9 1.205566 0.014324
9 mean 10 1.181391 0.015012
10 mean 11 0.803191 0.007901
11 mean 12 1.196940 0.016378
12 mean 13 1.196292 0.012752
13 mean 14 0.991332 0.012316
14 mean 15 1.186879 0.012174
15 mean 16 1.159004 0.011859
16 mean 17 1.223267 0.011025
17 mean 18 1.275615 0.013783
18 mean 19 1.232324 0.011197
19 mean 20 1.230486 0.013193
20 mean 21 1.098496 0.010124
21 mean 22 1.192161 0.017163
22 mean 23 1.127931 0.011450
23 mean 24 1.219661 0.013719
24 mean 25 1.193628 0.008241
25 mean 26 0.759102 0.009332
26 mean 27 0.761161 0.009332
27 mean 28 0.731587 0.016359
28 mean 29 0.766310 0.007931
29 mean 30 0.619089 0.006215
30 mean 31 0.763002 0.015376
31 mean 32 0.786464 0.014629
32 mean 33 0.768021 0.009153
33 mean 34 0.614095 0.003944
34 mean 35 0.701300 0.007931
35 mean 36 0.755792 0.009768
36 mean 37 0.540557 0.006525
37 mean 38 0.741155 0.010271
38 mean 39 0.752476 0.007338
39 mean 40 0.766282 0.008601
40 mean 41 0.746552 0.005592
41 mean 42 0.797209 0.009332
42 mean 43 0.684069 0.010022
Similar, we can use the Fay-Herriot method as follows
## FH method
= EblupAreaModel(method=FitMethod.fh)
fh_model_fh
fh_model_fh.fit(=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
yhat
)
= fh_model_fh.predict(
milk_est_fh =X, area=area, intercept=True
X
)
= fh_model_fh.to_dataframe(col_names = ["parameter", "small_area", "eblup_estimate", "eblup_mse"])
milk_est_fh
print(f"The dataframe version of the ML area level estimates:")
print(milk_est_fh)
The dataframe version of the ML area level estimates:
parameter small_area eblup_estimate eblup_mse
0 mean 1 1.017976 0.012757
1 mean 2 1.044964 0.005314
2 mean 3 1.064481 0.005632
3 mean 4 0.770692 0.008323
4 mean 5 0.852512 0.009284
5 mean 6 0.973826 0.011178
6 mean 7 1.050857 0.014868
7 mean 8 1.096165 0.010253
8 mean 9 1.210505 0.013471
9 mean 10 1.185640 0.014095
10 mean 11 0.797569 0.007558
11 mean 12 1.202150 0.015325
12 mean 13 1.200459 0.012039
13 mean 14 0.988971 0.011640
14 mean 15 1.186745 0.011467
15 mean 16 1.157992 0.011182
16 mean 17 1.224223 0.010424
17 mean 18 1.278680 0.012914
18 mean 19 1.233566 0.010581
19 mean 20 1.231860 0.012386
20 mean 21 1.095955 0.009600
21 mean 22 1.192213 0.015890
22 mean 23 1.125997 0.010811
23 mean 24 1.220695 0.012858
24 mean 25 1.193687 0.007865
25 mean 26 0.760244 0.008855
26 mean 27 0.762358 0.008855
27 mean 28 0.732288 0.015042
28 mean 29 0.767459 0.007569
29 mean 30 0.617310 0.005975
30 mean 31 0.764997 0.014212
31 mean 32 0.789315 0.013572
32 mean 33 0.769376 0.008692
33 mean 34 0.612861 0.003833
34 mean 35 0.700962 0.007569
35 mean 36 0.756891 0.009253
36 mean 37 0.537193 0.006264
37 mean 38 0.741882 0.009709
38 mean 39 0.753251 0.007020
39 mean 40 0.767519 0.008186
40 mean 41 0.747059 0.005391
41 mean 42 0.799363 0.008855
42 mean 43 0.683161 0.009484