Area Level Modeling

Empirical best linear unbiased predictor (EBLUP)

Small area estimation (SAE) are useful techniques when the sample sizes are not sufficient to provide reliable direct domain estimates given the sampling design. In this tutorial, the direct estimates refer to estimates obtained from the design-based approach. It usually consists of applying adjusted design weights to the variable of interest to compute sample parameters as estimates of equivalent population parameters. When auxiliary information is available, we can use model assisted survey methods can be used to estimate population parameters.

In this tutorial, we will go futher and use modeling techniques to produce domains estimates. For the area level model, the modeling is done at the area level using generalized linear mixed models. The sections below shows how to use the EblupAreaModel class from the samplics package to produce area level estimates.

Milk Expenditure data

To illustrate the EblupAreaModel class, we will use the Milk Expenditure dataset used in Rao and Molina (2015). As mentioned in the book, this dataset was originally used by Arora and Lahiri (1997) and later by You and Chapman (2006). For the R users, this dataset is also used by the R package sae (https://cran.r-project.org/web/packages/sae/index.html).

The Milk Expenditure data contains 43 observations on the average expenditure on fresh milk for the year 1989. The datasets has the following values: major area representing (major_area), small area (small_area), sample size (samp_size), direct survey estimates of average expenditure (direct_est), standard error of the direct estimate (std_error), and coefficient of variation of the direct estimates (coef_variance).

from samplics.datasets import load_expenditure_milk
from samplics.sae import EblupAreaModel
# Load Expenditure on Milk sample data
milk_exp_dict = load_expenditure_milk()
milk_exp = milk_exp_dict["data"]

nb_obs = 15
print(f"First {nb_obs} observations of the Milk Expendure dataset\n")
milk_exp.tail(nb_obs)
First 15 observations of the Milk Expendure dataset
major_area small_area samp_size direct_est std_error coef_var
28 4 29 238 0.796 0.106 0.133
29 4 30 207 0.565 0.089 0.158
30 4 31 165 0.886 0.225 0.254
31 4 32 153 0.952 0.205 0.215
32 4 33 210 0.807 0.119 0.147
33 4 34 383 0.582 0.067 0.115
34 4 35 255 0.684 0.106 0.155
35 4 36 226 0.787 0.126 0.160
36 4 37 224 0.440 0.092 0.209
37 4 38 212 0.759 0.132 0.174
38 4 39 211 0.770 0.100 0.130
39 4 40 179 0.800 0.113 0.141
40 4 41 312 0.756 0.083 0.110
41 4 42 241 0.865 0.121 0.140
42 4 43 205 0.640 0.129 0.202

EBLUP Predictor

As shown in the milk expenditure datasets, some of the coefficients of variation are not small which indicates unstability of the direct survey estimates. Hence, we can try to reduce the variability of the estimates by smoothing them through modeling. For illustration purpose, we will model the average expenditure on milk using the major areas as auxiliary variables.

First, we use the method fit() to estimate the model parameters. The pandas’s method get_dummies() create a matrix with dummy values (0 and 1) from the categorical variable major_area.

area = milk_exp["small_area"]
yhat = milk_exp["direct_est"]

import pandas as pd
X = pd.get_dummies(milk_exp["major_area"],drop_first=True)
sigma_e = milk_exp["std_error"]

## REML method
fh_model_reml = EblupAreaModel(method="REML")
fh_model_reml.fit(
    yhat=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
)
from pprint import pprint

print(f"The estimated fixed effects are:") 
pprint(fh_model_reml.fixed_effects)
The estimated fixed effects are:
array([ 0.96818899,  0.13278031,  0.22694622, -0.24130104])
print("The estimated standard error of the area random effects is:")
pprint(fh_model_reml.re_std)
The estimated standard error of the area random effects is:
0.13619961367679143
print("The convergence statistics are:")
pprint(fh_model_reml.convergence)
The convergence statistics are:
{'achieved': True, 'iterations': 9, 'precision': 1.979926027448861e-09}
print("The goodness of fit statistics are:") 
pprint(fh_model_reml.goodness)
The goodness of fit statistics are:
{'AIC': 30.806948817978125,
 'BIC': 41.374149512139496,
 'loglike': -9.403474408989062}
fh_model_reml.predict(
    X=X, area=area, intercept=True
)

pprint(fh_model_reml.area_est)
{1: 1.0219705441558897,
 2: 1.0476019514456771,
 3: 1.0679514263086398,
 4: 0.7608165650758554,
 5: 0.846157043770443,
 6: 0.9743727061135177,
 7: 1.0584526719577587,
 8: 1.0977762561870672,
 9: 1.221545489460039,
 10: 1.195146014849563,
 11: 0.7852149191679926,
 12: 1.2139462053925376,
 13: 1.2096597208051416,
 14: 0.9834964411924456,
 15: 1.186424709592169,
 16: 1.1556981139181448,
 17: 1.2263412506598237,
 18: 1.285648988670589,
 19: 1.2363248408581455,
 20: 1.2349601393920557,
 21: 1.0903016274841368,
 22: 1.1923057228337586,
 23: 1.121646766758834,
 24: 1.2230297218953323,
 25: 1.1938054443945292,
 26: 0.7627195896418988,
 27: 0.7649551532185552,
 28: 0.7338443880846524,
 29: 0.7699295541643935,
 30: 0.6134416233555453,
 31: 0.7695560722792779,
 32: 0.7958253116952815,
 33: 0.7723188477263883,
 34: 0.6102300683088241,
 35: 0.7001781896510537,
 36: 0.7592788104101728,
 37: 0.5298863364482943,
 38: 0.743446678039724,
 39: 0.7548996331139428,
 40: 0.7701919657191105,
 41: 0.748116423847363,
 42: 0.8040775158100446,
 43: 0.6810868850580025}

We can use the utility method to_dataframe() to output the estimates as a pandas dataframe. The function provides the area, the estimate and its MSE estimates. We can use col_names to customize the name of the columns. For example, using col_names = ["small_area", "eblup_estimate", "eblup_mse"]. Otherwise, if col_names is not provided, “_area”, “_estimates” and “_mse” are used as defaults.

milk_est_reml = fh_model_reml.to_dataframe(
    col_names = ["parameter", "small_area", "eblup_estimate", "eblup_mse"]
    )
print("The dataframe version of the area level estimates:")
print(milk_est_reml)
The dataframe version of the area level estimates:
   parameter  small_area  eblup_estimate  eblup_mse
0       mean           1        1.021971   0.013460
1       mean           2        1.047602   0.005373
2       mean           3        1.067951   0.005702
3       mean           4        0.760817   0.008542
4       mean           5        0.846157   0.009580
5       mean           6        0.974373   0.011671
6       mean           7        1.058453   0.015926
7       mean           8        1.097776   0.010587
8       mean           9        1.221545   0.014184
9       mean          10        1.195146   0.014902
10      mean          11        0.785215   0.007694
11      mean          12        1.213946   0.016337
12      mean          13        1.209660   0.012563
13      mean          14        0.983496   0.012117
14      mean          15        1.186425   0.012031
15      mean          16        1.155698   0.011709
16      mean          17        1.226341   0.010860
17      mean          18        1.285649   0.013691
18      mean          19        1.236325   0.011035
19      mean          20        1.234960   0.013080
20      mean          21        1.090302   0.009949
21      mean          22        1.192306   0.017244
22      mean          23        1.121647   0.011292
23      mean          24        1.223030   0.013625
24      mean          25        1.193805   0.008066
25      mean          26        0.762720   0.009205
26      mean          27        0.764955   0.009205
27      mean          28        0.733844   0.016477
28      mean          29        0.769930   0.007801
29      mean          30        0.613442   0.006099
30      mean          31        0.769556   0.015442
31      mean          32        0.795825   0.014658
32      mean          33        0.772319   0.009025
33      mean          34        0.610230   0.003871
34      mean          35        0.700178   0.007801
35      mean          36        0.759279   0.009646
36      mean          37        0.529886   0.006404
37      mean          38        0.743447   0.010156
38      mean          39        0.754900   0.007210
39      mean          40        0.770192   0.008470
40      mean          41        0.748116   0.005485
41      mean          42        0.804078   0.009205
42      mean          43        0.681087   0.009904

We could also fit the model parameters using the maximum likelihood (ML) method which will impact the MSE estimation as well. To estimate the area means using the ML methdo, we only need to set method=“ML” then run the prediction as follows.

## ML method
fh_model_ml = EblupAreaModel(method="ML")
fh_model_ml.fit(
    yhat=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
)

milk_est_ml = fh_model_ml.predict(
    X=X, area=area, intercept=True
)

milk_est_ml = fh_model_ml.to_dataframe(
    col_names = ["parameter", "small_area", "eblup_estimate", "eblup_mse"]
    )


print("The dataframe version of the ML area level estimates:")
print(milk_est_ml)
The dataframe version of the ML area level estimates:
   parameter  small_area  eblup_estimate  eblup_mse
0       mean           1        1.016173   0.013580
1       mean           2        1.043697   0.005513
2       mean           3        1.062817   0.005851
3       mean           4        0.775349   0.008735
4       mean           5        0.855490   0.009775
5       mean           6        0.973586   0.011841
6       mean           7        1.047478   0.015934
7       mean           8        1.095344   0.010822
8       mean           9        1.205409   0.014346
9       mean          10        1.181256   0.015036
10      mean          11        0.803370   0.007911
11      mean          12        1.196775   0.016405
12      mean          13        1.196159   0.012771
13      mean          14        0.991405   0.012335
14      mean          15        1.186883   0.012192
15      mean          16        1.159036   0.011877
16      mean          17        1.223237   0.011041
17      mean          18        1.275519   0.013805
18      mean          19        1.232285   0.011214
19      mean          20        1.230442   0.013214
20      mean          21        1.098577   0.010138
21      mean          22        1.192160   0.017194
22      mean          23        1.127992   0.011468
23      mean          24        1.219628   0.013742
24      mean          25        1.193626   0.008251
25      mean          26        0.759065   0.009345
26      mean          27        0.761123   0.009345
27      mean          28        0.731565   0.016390
28      mean          29        0.766273   0.007942
29      mean          30        0.619145   0.006222
30      mean          31        0.762939   0.015404
31      mean          32        0.786375   0.014656
32      mean          33        0.767978   0.009165
33      mean          34        0.614135   0.003947
34      mean          35        0.701311   0.007942
35      mean          36        0.755756   0.009782
36      mean          37        0.540665   0.006532
37      mean          38        0.741132   0.010286
38      mean          39        0.752451   0.007347
39      mean          40        0.766242   0.008613
40      mean          41        0.746536   0.005598
41      mean          42        0.797140   0.009345
42      mean          43        0.684098   0.010037

Similar, we can use the Fay-Herriot method as follows

## FH method
fh_model_fh = EblupAreaModel(method="FH")
fh_model_fh.fit(
    yhat=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
)

milk_est_fh = fh_model_fh.predict(
    X=X, area=area, intercept=True
)

milk_est_fh = fh_model_fh.to_dataframe(col_names = ["parameter", "small_area", "eblup_estimate", "eblup_mse"])


print(f"The dataframe version of the ML area level estimates:")
print(milk_est_fh)
The dataframe version of the ML area level estimates:
   parameter  small_area  eblup_estimate  eblup_mse
0       mean           1        1.017976   0.012757
1       mean           2        1.044964   0.005314
2       mean           3        1.064481   0.005632
3       mean           4        0.770692   0.008323
4       mean           5        0.852512   0.009284
5       mean           6        0.973826   0.011178
6       mean           7        1.050857   0.014868
7       mean           8        1.096165   0.010253
8       mean           9        1.210505   0.013471
9       mean          10        1.185640   0.014095
10      mean          11        0.797569   0.007558
11      mean          12        1.202150   0.015325
12      mean          13        1.200459   0.012039
13      mean          14        0.988971   0.011640
14      mean          15        1.186745   0.011467
15      mean          16        1.157992   0.011182
16      mean          17        1.224223   0.010424
17      mean          18        1.278680   0.012914
18      mean          19        1.233566   0.010581
19      mean          20        1.231860   0.012386
20      mean          21        1.095955   0.009600
21      mean          22        1.192213   0.015890
22      mean          23        1.125997   0.010811
23      mean          24        1.220695   0.012858
24      mean          25        1.193687   0.007865
25      mean          26        0.760244   0.008855
26      mean          27        0.762358   0.008855
27      mean          28        0.732288   0.015042
28      mean          29        0.767459   0.007569
29      mean          30        0.617310   0.005975
30      mean          31        0.764997   0.014212
31      mean          32        0.789315   0.013572
32      mean          33        0.769376   0.008692
33      mean          34        0.612861   0.003833
34      mean          35        0.700962   0.007569
35      mean          36        0.756891   0.009253
36      mean          37        0.537193   0.006264
37      mean          38        0.741882   0.009709
38      mean          39        0.753251   0.007020
39      mean          40        0.767519   0.008186
40      mean          41        0.747059   0.005391
41      mean          42        0.799363   0.008855
42      mean          43        0.683161   0.009484