Area Level Modeling

Empirical best linear unbiased predictor (EBLUP)

Small area estimation (SAE) are useful techniques when the sample sizes are not sufficient to provide reliable direct domain estimates given the sampling design. In this tutorial, the direct estimates refer to estimates obtained from the design-based approach. It usually consists of applying adjusted design weights to the variable of interest to compute sample parameters as estimates of equivalent population parameters. When auxiliary information is available, we can use model assisted survey methods can be used to estimate population parameters.

In this tutorial, we will go futher and use modeling techniques to produce domains estimates. For the area level model, the modeling is done at the area level using generalized linear mixed models. The sections below shows how to use the EblupAreaModel class from the samplics package to produce area level estimates.

Milk Expenditure data

To illustrate the EblupAreaModel class, we will use the Milk Expenditure dataset used in Rao and Molina (2015). As mentioned in the book, this dataset was originally used by Arora and Lahiri (1997) and later by You and Chapman (2006). For the R users, this dataset is also used by the R package sae (https://cran.r-project.org/web/packages/sae/index.html).

The Milk Expenditure data contains 43 observations on the average expenditure on fresh milk for the year 1989. The datasets has the following values: major area representing (major_area), small area (small_area), sample size (samp_size), direct survey estimates of average expenditure (direct_est), standard error of the direct estimate (std_error), and coefficient of variation of the direct estimates (coef_variance).

from samplics.datasets import load_expenditure_milk
from samplics.sae import EblupAreaModel

from samplics.utils.types import FitMethod
# Load Expenditure on Milk sample data
milk_exp_dict = load_expenditure_milk()
milk_exp = milk_exp_dict["data"]

nb_obs = 15
print(f"First {nb_obs} observations of the Milk Expendure dataset\n")
milk_exp.tail(nb_obs)
First 15 observations of the Milk Expendure dataset
major_area small_area samp_size direct_est std_error coef_var
28 4 29 238 0.796 0.106 0.133
29 4 30 207 0.565 0.089 0.158
30 4 31 165 0.886 0.225 0.254
31 4 32 153 0.952 0.205 0.215
32 4 33 210 0.807 0.119 0.147
33 4 34 383 0.582 0.067 0.115
34 4 35 255 0.684 0.106 0.155
35 4 36 226 0.787 0.126 0.160
36 4 37 224 0.440 0.092 0.209
37 4 38 212 0.759 0.132 0.174
38 4 39 211 0.770 0.100 0.130
39 4 40 179 0.800 0.113 0.141
40 4 41 312 0.756 0.083 0.110
41 4 42 241 0.865 0.121 0.140
42 4 43 205 0.640 0.129 0.202

EBLUP Predictor

As shown in the milk expenditure datasets, some of the coefficients of variation are not small which indicates unstability of the direct survey estimates. Hence, we can try to reduce the variability of the estimates by smoothing them through modeling. For illustration purpose, we will model the average expenditure on milk using the major areas as auxiliary variables.

First, we use the method fit() to estimate the model parameters. The pandas’s method get_dummies() create a matrix with dummy values (0 and 1) from the categorical variable major_area.

area = milk_exp["small_area"]
yhat = milk_exp["direct_est"]

import pandas as pd
X = pd.get_dummies(milk_exp["major_area"],drop_first=True)
sigma_e = milk_exp["std_error"]

## REML method
fh_model_reml = EblupAreaModel(method=FitMethod.reml)
fh_model_reml.fit(
    yhat=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
)
from pprint import pprint

print(f"The estimated fixed effects are:") 
pprint(fh_model_reml.fixed_effects)
The estimated fixed effects are:
array([ 0.96818899,  0.13278031,  0.22694622, -0.24130104])
print("The estimated standard error of the area random effects is:")
pprint(fh_model_reml.re_std)
The estimated standard error of the area random effects is:
0.13619961367679143
print("The convergence statistics are:")
pprint(fh_model_reml.convergence)
The convergence statistics are:
{'achieved': True, 'iterations': 9, 'precision': 1.979926027448861e-09}
print("The goodness of fit statistics are:") 
pprint(fh_model_reml.goodness)
The goodness of fit statistics are:
{'AIC': 30.806948817978125,
 'BIC': 41.374149512139496,
 'loglike': -9.403474408989062}
fh_model_reml.predict(
    X=X, area=area, intercept=True
)

pprint(fh_model_reml.area_est)
{1: 1.0219705441558897,
 2: 1.0476019514456771,
 3: 1.0679514263086398,
 4: 0.7608165650758554,
 5: 0.846157043770443,
 6: 0.9743727061135177,
 7: 1.0584526719577587,
 8: 1.0977762561870672,
 9: 1.221545489460039,
 10: 1.195146014849563,
 11: 0.7852149191679926,
 12: 1.2139462053925376,
 13: 1.2096597208051416,
 14: 0.9834964411924456,
 15: 1.186424709592169,
 16: 1.1556981139181448,
 17: 1.2263412506598237,
 18: 1.285648988670589,
 19: 1.2363248408581455,
 20: 1.2349601393920557,
 21: 1.0903016274841368,
 22: 1.1923057228337586,
 23: 1.121646766758834,
 24: 1.2230297218953323,
 25: 1.1938054443945292,
 26: 0.7627195896418988,
 27: 0.7649551532185552,
 28: 0.7338443880846524,
 29: 0.7699295541643935,
 30: 0.6134416233555453,
 31: 0.7695560722792779,
 32: 0.7958253116952815,
 33: 0.7723188477263883,
 34: 0.6102300683088241,
 35: 0.7001781896510537,
 36: 0.7592788104101728,
 37: 0.5298863364482943,
 38: 0.743446678039724,
 39: 0.7548996331139428,
 40: 0.7701919657191105,
 41: 0.748116423847363,
 42: 0.8040775158100446,
 43: 0.6810868850580025}

We can use the utility method to_dataframe() to output the estimates as a pandas dataframe. The function provides the area, the estimate and its MSE estimates. We can use col_names to customize the name of the columns. For example, using col_names = ["small_area", "eblup_estimate", "eblup_mse"]. Otherwise, if col_names is not provided, “_area”, “_estimates” and “_mse” are used as defaults.

milk_est_reml = fh_model_reml.to_dataframe(
    col_names = ["parameter", "small_area", "eblup_estimate", "eblup_mse"]
    )
print("The dataframe version of the area level estimates:")
print(milk_est_reml)
The dataframe version of the area level estimates:
   parameter  small_area  eblup_estimate  eblup_mse
0       mean           1        1.021971   0.013460
1       mean           2        1.047602   0.005373
2       mean           3        1.067951   0.005702
3       mean           4        0.760817   0.008542
4       mean           5        0.846157   0.009580
5       mean           6        0.974373   0.011671
6       mean           7        1.058453   0.015926
7       mean           8        1.097776   0.010587
8       mean           9        1.221545   0.014184
9       mean          10        1.195146   0.014902
10      mean          11        0.785215   0.007694
11      mean          12        1.213946   0.016337
12      mean          13        1.209660   0.012563
13      mean          14        0.983496   0.012117
14      mean          15        1.186425   0.012031
15      mean          16        1.155698   0.011709
16      mean          17        1.226341   0.010860
17      mean          18        1.285649   0.013691
18      mean          19        1.236325   0.011035
19      mean          20        1.234960   0.013080
20      mean          21        1.090302   0.009949
21      mean          22        1.192306   0.017244
22      mean          23        1.121647   0.011292
23      mean          24        1.223030   0.013625
24      mean          25        1.193805   0.008066
25      mean          26        0.762720   0.009205
26      mean          27        0.764955   0.009205
27      mean          28        0.733844   0.016477
28      mean          29        0.769930   0.007801
29      mean          30        0.613442   0.006099
30      mean          31        0.769556   0.015442
31      mean          32        0.795825   0.014658
32      mean          33        0.772319   0.009025
33      mean          34        0.610230   0.003871
34      mean          35        0.700178   0.007801
35      mean          36        0.759279   0.009646
36      mean          37        0.529886   0.006404
37      mean          38        0.743447   0.010156
38      mean          39        0.754900   0.007210
39      mean          40        0.770192   0.008470
40      mean          41        0.748116   0.005485
41      mean          42        0.804078   0.009205
42      mean          43        0.681087   0.009904

We could also fit the model parameters using the maximum likelihood (ML) method which will impact the MSE estimation as well. To estimate the area means using the ML methdo, we only need to set method=“ML” then run the prediction as follows.

## ML method
fh_model_ml = EblupAreaModel(method=FitMethod.ml)
fh_model_ml.fit(
    yhat=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
)

milk_est_ml = fh_model_ml.predict(
    X=X, area=area, intercept=True
)

milk_est_ml = fh_model_ml.to_dataframe(
    col_names = ["parameter", "small_area", "eblup_estimate", "eblup_mse"]
    )


print("The dataframe version of the ML area level estimates:")
print(milk_est_ml)
The dataframe version of the ML area level estimates:
   parameter  small_area  eblup_estimate  eblup_mse
0       mean           1        1.016228   0.013559
1       mean           2        1.043736   0.005507
2       mean           3        1.062868   0.005844
3       mean           4        0.775205   0.008724
4       mean           5        0.855398   0.009761
5       mean           6        0.973593   0.011823
6       mean           7        1.047581   0.015908
7       mean           8        1.095370   0.010806
8       mean           9        1.205566   0.014324
9       mean          10        1.181391   0.015012
10      mean          11        0.803191   0.007901
11      mean          12        1.196940   0.016378
12      mean          13        1.196292   0.012752
13      mean          14        0.991332   0.012316
14      mean          15        1.186879   0.012174
15      mean          16        1.159004   0.011859
16      mean          17        1.223267   0.011025
17      mean          18        1.275615   0.013783
18      mean          19        1.232324   0.011197
19      mean          20        1.230486   0.013193
20      mean          21        1.098496   0.010124
21      mean          22        1.192161   0.017163
22      mean          23        1.127931   0.011450
23      mean          24        1.219661   0.013719
24      mean          25        1.193628   0.008241
25      mean          26        0.759102   0.009332
26      mean          27        0.761161   0.009332
27      mean          28        0.731587   0.016359
28      mean          29        0.766310   0.007931
29      mean          30        0.619089   0.006215
30      mean          31        0.763002   0.015376
31      mean          32        0.786464   0.014629
32      mean          33        0.768021   0.009153
33      mean          34        0.614095   0.003944
34      mean          35        0.701300   0.007931
35      mean          36        0.755792   0.009768
36      mean          37        0.540557   0.006525
37      mean          38        0.741155   0.010271
38      mean          39        0.752476   0.007338
39      mean          40        0.766282   0.008601
40      mean          41        0.746552   0.005592
41      mean          42        0.797209   0.009332
42      mean          43        0.684069   0.010022

Similar, we can use the Fay-Herriot method as follows

## FH method
fh_model_fh = EblupAreaModel(method=FitMethod.fh)
fh_model_fh.fit(
    yhat=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
)

milk_est_fh = fh_model_fh.predict(
    X=X, area=area, intercept=True
)

milk_est_fh = fh_model_fh.to_dataframe(col_names = ["parameter", "small_area", "eblup_estimate", "eblup_mse"])


print(f"The dataframe version of the ML area level estimates:")
print(milk_est_fh)
The dataframe version of the ML area level estimates:
   parameter  small_area  eblup_estimate  eblup_mse
0       mean           1        1.017976   0.012757
1       mean           2        1.044964   0.005314
2       mean           3        1.064481   0.005632
3       mean           4        0.770692   0.008323
4       mean           5        0.852512   0.009284
5       mean           6        0.973826   0.011178
6       mean           7        1.050857   0.014868
7       mean           8        1.096165   0.010253
8       mean           9        1.210505   0.013471
9       mean          10        1.185640   0.014095
10      mean          11        0.797569   0.007558
11      mean          12        1.202150   0.015325
12      mean          13        1.200459   0.012039
13      mean          14        0.988971   0.011640
14      mean          15        1.186745   0.011467
15      mean          16        1.157992   0.011182
16      mean          17        1.224223   0.010424
17      mean          18        1.278680   0.012914
18      mean          19        1.233566   0.010581
19      mean          20        1.231860   0.012386
20      mean          21        1.095955   0.009600
21      mean          22        1.192213   0.015890
22      mean          23        1.125997   0.010811
23      mean          24        1.220695   0.012858
24      mean          25        1.193687   0.007865
25      mean          26        0.760244   0.008855
26      mean          27        0.762358   0.008855
27      mean          28        0.732288   0.015042
28      mean          29        0.767459   0.007569
29      mean          30        0.617310   0.005975
30      mean          31        0.764997   0.014212
31      mean          32        0.789315   0.013572
32      mean          33        0.769376   0.008692
33      mean          34        0.612861   0.003833
34      mean          35        0.700962   0.007569
35      mean          36        0.756891   0.009253
36      mean          37        0.537193   0.006264
37      mean          38        0.741882   0.009709
38      mean          39        0.753251   0.007020
39      mean          40        0.767519   0.008186
40      mean          41        0.747059   0.005391
41      mean          42        0.799363   0.008855
42      mean          43        0.683161   0.009484