Replicate Weights

Replicate weights are usually created for the purpose of variance (uncertainty) estimation. One common use case for replication-based methods is the estimation of non-linear parameters fow which Taylor-based approximation may not be accurate enough. Another use case is when the number of primary sampling units selected per stratum is small (low degree of freedom). Replicate weights are usually created for the purpose of variance (uncertainty)estimation. One common use case for replication-based methods is the estimation of non-linear parameters fow which Taylor-based approximation may not be accurate enough. Another use case is when the number of primary sampling units selected per stratum is small (low degree of freedom).

In this tutorial, we will explore creating replicate weights using the class ReplicateWeight. Three replication methods have been implemented: balanced repeated replication (BRR) including the Fay-BRR, bootstrap and jackknife. The replicate method of interest is specified when initializing the class by using the parameter method. The parameter method takes the values “bootstrap”, “brr”, or “jackknife”. In this tutorial, we show how the API works for producing replicate weights.

import pandas as pd

from samplics.datasets import load_psu_sample, load_ssu_sample
from samplics.weighting import ReplicateWeight

from samplics.utils.types import RepMethod

We import the sample data…

# Load PSU sample data
psu_sample_dict = load_psu_sample()
psu_sample = psu_sample_dict["data"]

# Load PSU sample data
ssu_sample_dict = load_ssu_sample()
ssu_sample = ssu_sample_dict["data"]

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]], 
    ssu_sample[["cluster", "household", "ssu_prob"]], 
    on="cluster")

full_sample["inclusion_prob"] = \
    full_sample["psu_prob"] * full_sample["ssu_prob"] 
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"] 

full_sample.head(15)
cluster region psu_prob household ssu_prob inclusion_prob design_weight
0 7 North 0.187726 72 0.115385 0.021661 46.166667
1 7 North 0.187726 73 0.115385 0.021661 46.166667
2 7 North 0.187726 75 0.115385 0.021661 46.166667
3 7 North 0.187726 715 0.115385 0.021661 46.166667
4 7 North 0.187726 722 0.115385 0.021661 46.166667
5 7 North 0.187726 724 0.115385 0.021661 46.166667
6 7 North 0.187726 755 0.115385 0.021661 46.166667
7 7 North 0.187726 761 0.115385 0.021661 46.166667
8 7 North 0.187726 764 0.115385 0.021661 46.166667
9 7 North 0.187726 782 0.115385 0.021661 46.166667
10 7 North 0.187726 795 0.115385 0.021661 46.166667
11 7 North 0.187726 7111 0.115385 0.021661 46.166667
12 7 North 0.187726 7112 0.115385 0.021661 46.166667
13 7 North 0.187726 7117 0.115385 0.021661 46.166667
14 7 North 0.187726 7123 0.115385 0.021661 46.166667

Balanced Repeated Replication (BRR)

The basic idea of BRR is to slip the sample in independent random groups. The groups are then threated as independent replicates of the the sample design. A special case is when the sample is split into two half samples in each stratum. This design is suitable to many survey designs where only two psus are selected by stratum. In practice, one of the psu is asigned to the first random group and the other psu is assign to the second group. The sample weights are double for one group (say the first one) and the sample weights in the other group are set to zero. To ensure that the replicates are independent, we use hadamard matrices to assign the random groups.

import scipy
scipy.linalg.hadamard(8)
array([[ 1,  1,  1,  1,  1,  1,  1,  1],
       [ 1, -1,  1, -1,  1, -1,  1, -1],
       [ 1,  1, -1, -1,  1,  1, -1, -1],
       [ 1, -1, -1,  1,  1, -1, -1,  1],
       [ 1,  1,  1,  1, -1, -1, -1, -1],
       [ 1, -1,  1, -1, -1,  1, -1,  1],
       [ 1,  1, -1, -1, -1, -1,  1,  1],
       [ 1, -1, -1,  1, -1,  1,  1, -1]])

In our example, we have 10 psus. If we do not have explicit stratification then replicate() will group the clusters into 5 strata (2 per stratum). In this case, the smallest number of replicates possible using the hadamard matrix is 8.

The result below shows that replicate() created 5 strata by grouping clusters 7 and 10 in the first stratum, clusters 16 and 24 in the second stratum, and so on. We can achieve the same result by providing setting strat=True and providing the stratum variable to replicate().

brr = ReplicateWeight(method=RepMethod.brr, strat=False)
brr_wgt = brr.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"]
    )

brr_wgt.drop_duplicates().head(10)
_stratum _psu _samp_weight _brr_wgt_1 _brr_wgt_2 _brr_wgt_3 _brr_wgt_4 _brr_wgt_5 _brr_wgt_6 _brr_wgt_7 _brr_wgt_8
0 1 7 46.166667 0.000000 92.333333 0.000000 92.333333 0.000000 92.333333 0.000000 92.333333
15 1 10 50.783333 101.566667 0.000000 101.566667 0.000000 101.566667 0.000000 101.566667 0.000000
30 2 16 62.149123 0.000000 0.000000 124.298246 124.298246 0.000000 0.000000 124.298246 124.298246
45 2 24 58.940741 117.881481 117.881481 0.000000 0.000000 117.881481 117.881481 0.000000 0.000000
60 3 29 65.702778 0.000000 131.405556 131.405556 0.000000 0.000000 131.405556 131.405556 0.000000
75 3 34 75.661566 151.323133 0.000000 0.000000 151.323133 151.323133 0.000000 0.000000 151.323133
90 4 45 85.398025 0.000000 0.000000 0.000000 0.000000 170.796049 170.796049 170.796049 170.796049
105 4 52 85.520635 171.041270 171.041270 171.041270 171.041270 0.000000 0.000000 0.000000 0.000000
120 5 64 218.893889 0.000000 437.787778 0.000000 437.787778 437.787778 0.000000 437.787778 0.000000
135 5 86 213.491667 426.983333 0.000000 426.983333 0.000000 0.000000 426.983333 0.000000 426.983333

An extension of BRR is the Fay’s method. In the Fay’s approach, instead of multiplying one half-sample by zero, we multiple the sampel weights by a factor \(\alpha\) and the other halh-sample by \(2-\alpha\). We refer to \(\alpha\) as the fay coefficient. Note that when \(\alpha=0\) then teh Fay’s method reduces to BRR.

fay = ReplicateWeight(method=RepMethod.brr, strat=False, fay_coef=0.3)
fay_wgt = fay.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"], 
    rep_prefix="fay_weight_",
    psu_varname="cluster", 
    str_varname="stratum"
)

fay_wgt.drop_duplicates().head(10)
stratum cluster _samp_weight fay_weight_1 fay_weight_2 fay_weight_3 fay_weight_4 fay_weight_5 fay_weight_6 fay_weight_7 fay_weight_8
0 1 7 46.166667 13.850000 78.483333 13.850000 78.483333 13.850000 78.483333 13.850000 78.483333
15 1 10 50.783333 86.331667 15.235000 86.331667 15.235000 86.331667 15.235000 86.331667 15.235000
30 2 16 62.149123 18.644737 18.644737 105.653509 105.653509 18.644737 18.644737 105.653509 105.653509
45 2 24 58.940741 100.199259 100.199259 17.682222 17.682222 100.199259 100.199259 17.682222 17.682222
60 3 29 65.702778 19.710833 111.694722 111.694722 19.710833 19.710833 111.694722 111.694722 19.710833
75 3 34 75.661566 128.624663 22.698470 22.698470 128.624663 128.624663 22.698470 22.698470 128.624663
90 4 45 85.398025 25.619407 25.619407 25.619407 25.619407 145.176642 145.176642 145.176642 145.176642
105 4 52 85.520635 145.385079 145.385079 145.385079 145.385079 25.656190 25.656190 25.656190 25.656190
120 5 64 218.893889 65.668167 372.119611 65.668167 372.119611 372.119611 65.668167 372.119611 65.668167
135 5 86 213.491667 362.935833 64.047500 362.935833 64.047500 64.047500 362.935833 64.047500 362.935833
Important

For any of the three methods, we can request the replicate coefficient instead of the replicate weights by using rep_coefs=True.

#fay = ReplicateWeight(method="brr", strat=False, fay_coef=0.3)
fay_wgt = fay.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"], 
    rep_prefix="fay_weight_",
    psu_varname="cluster", 
    str_varname="stratum",
    rep_coefs=True
)

fay_wgt.drop_duplicates().head(10)
stratum cluster _samp_weight fay_weight_1 fay_weight_2 fay_weight_3 fay_weight_4 fay_weight_5 fay_weight_6 fay_weight_7 fay_weight_8
0 1 7 46.166667 0.3 1.7 0.3 1.7 0.3 1.7 0.3 1.7
15 1 10 50.783333 1.7 0.3 1.7 0.3 1.7 0.3 1.7 0.3
30 2 16 62.149123 0.3 0.3 1.7 1.7 0.3 0.3 1.7 1.7
45 2 24 58.940741 1.7 1.7 0.3 0.3 1.7 1.7 0.3 0.3
60 3 29 65.702778 0.3 1.7 1.7 0.3 0.3 1.7 1.7 0.3
75 3 34 75.661566 1.7 0.3 0.3 1.7 1.7 0.3 0.3 1.7
90 4 45 85.398025 0.3 0.3 0.3 0.3 1.7 1.7 1.7 1.7
105 4 52 85.520635 1.7 1.7 1.7 1.7 0.3 0.3 0.3 0.3
120 5 64 218.893889 0.3 1.7 0.3 1.7 1.7 0.3 1.7 0.3
135 5 86 213.491667 1.7 0.3 1.7 0.3 0.3 1.7 0.3 1.7

Bootstrap

For the bootstrap replicates, we need to provide the number of replicates. When the number of replicates is not provided, ReplicateWeight will default to 500. The bootstrap consists of selecting the same number of psus as in the sample but with replacement. The selection is independently repeated for each replicate.

bootstrap = ReplicateWeight(
    method=RepMethod.bootstrap, 
    strat=False, 
    nb_reps=50
    )
boot_wgt = bootstrap.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"]
    )

boot_wgt.drop_duplicates().head(10)
_psu _samp_weight _boot_wgt_1 _boot_wgt_2 _boot_wgt_3 _boot_wgt_4 _boot_wgt_5 _boot_wgt_6 _boot_wgt_7 _boot_wgt_8 ... _boot_wgt_41 _boot_wgt_42 _boot_wgt_43 _boot_wgt_44 _boot_wgt_45 _boot_wgt_46 _boot_wgt_47 _boot_wgt_48 _boot_wgt_49 _boot_wgt_50
0 7 46.166667 0.000000 51.296296 0.000000 0.000000 0.000000 0.000000 51.296296 102.592593 ... 0.000000 0.000000 51.296296 153.888889 102.592593 51.296296 0.000000 51.296296 0.000000 51.296296
15 10 50.783333 112.851852 169.277778 56.425926 0.000000 0.000000 56.425926 0.000000 0.000000 ... 0.000000 0.000000 56.425926 0.000000 112.851852 0.000000 0.000000 112.851852 56.425926 56.425926
30 16 62.149123 69.054581 69.054581 69.054581 138.109162 138.109162 138.109162 0.000000 69.054581 ... 69.054581 69.054581 138.109162 0.000000 69.054581 0.000000 69.054581 207.163743 207.163743 0.000000
45 24 58.940741 0.000000 0.000000 196.469136 65.489712 65.489712 0.000000 0.000000 0.000000 ... 0.000000 65.489712 0.000000 0.000000 0.000000 196.469136 196.469136 0.000000 65.489712 0.000000
60 29 65.702778 146.006173 219.009259 0.000000 73.003086 146.006173 73.003086 0.000000 73.003086 ... 73.003086 146.006173 146.006173 0.000000 73.003086 0.000000 0.000000 73.003086 73.003086 73.003086
75 34 75.661566 84.068407 0.000000 252.205222 0.000000 0.000000 84.068407 84.068407 84.068407 ... 84.068407 0.000000 168.136814 84.068407 84.068407 0.000000 84.068407 0.000000 84.068407 84.068407
90 45 85.398025 0.000000 0.000000 0.000000 0.000000 94.886694 189.773388 189.773388 94.886694 ... 94.886694 189.773388 0.000000 94.886694 0.000000 94.886694 0.000000 94.886694 0.000000 0.000000
105 52 85.520635 95.022928 95.022928 0.000000 0.000000 95.022928 95.022928 95.022928 0.000000 ... 0.000000 0.000000 0.000000 0.000000 190.045855 95.022928 95.022928 0.000000 0.000000 0.000000
120 64 218.893889 486.430864 0.000000 0.000000 486.430864 0.000000 0.000000 729.646296 486.430864 ... 486.430864 486.430864 243.215432 486.430864 0.000000 486.430864 243.215432 243.215432 243.215432 729.646296
135 86 213.491667 0.000000 0.000000 237.212963 711.638889 474.425926 237.212963 237.212963 237.212963 ... 711.638889 237.212963 0.000000 474.425926 0.000000 237.212963 474.425926 0.000000 237.212963 474.425926

10 rows × 52 columns

Jackknife

Below, we illustrate the API for creating replicate weights using the jackknife method.

jackknife = ReplicateWeight(
    method=RepMethod.jackknife, 
    strat=False
    )
jkn_wgt = jackknife.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"]
    )

jkn_wgt.drop_duplicates().head(10)
_psu _samp_weight _jk_wgt_1 _jk_wgt_2 _jk_wgt_3 _jk_wgt_4 _jk_wgt_5 _jk_wgt_6 _jk_wgt_7 _jk_wgt_8 _jk_wgt_9 _jk_wgt_10
0 7 46.166667 0.000000 51.296296 51.296296 51.296296 51.296296 51.296296 51.296296 51.296296 51.296296 51.296296
15 10 50.783333 56.425926 0.000000 56.425926 56.425926 56.425926 56.425926 56.425926 56.425926 56.425926 56.425926
30 16 62.149123 69.054581 69.054581 0.000000 69.054581 69.054581 69.054581 69.054581 69.054581 69.054581 69.054581
45 24 58.940741 65.489712 65.489712 65.489712 0.000000 65.489712 65.489712 65.489712 65.489712 65.489712 65.489712
60 29 65.702778 73.003086 73.003086 73.003086 73.003086 0.000000 73.003086 73.003086 73.003086 73.003086 73.003086
75 34 75.661566 84.068407 84.068407 84.068407 84.068407 84.068407 0.000000 84.068407 84.068407 84.068407 84.068407
90 45 85.398025 94.886694 94.886694 94.886694 94.886694 94.886694 94.886694 0.000000 94.886694 94.886694 94.886694
105 52 85.520635 95.022928 95.022928 95.022928 95.022928 95.022928 95.022928 95.022928 0.000000 95.022928 95.022928
120 64 218.893889 243.215432 243.215432 243.215432 243.215432 243.215432 243.215432 243.215432 243.215432 0.000000 243.215432
135 86 213.491667 237.212963 237.212963 237.212963 237.212963 237.212963 237.212963 237.212963 237.212963 237.212963 0.000000

With stratification…

jackknife = ReplicateWeight(method=RepMethod.jackknife, strat=True)
jkn_wgt = jackknife.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"], 
    stratum=full_sample["region"]
    )

jkn_wgt.drop_duplicates().head(10)
_stratum _psu _samp_weight _jk_wgt_1 _jk_wgt_2 _jk_wgt_3 _jk_wgt_4 _jk_wgt_5 _jk_wgt_6 _jk_wgt_7 _jk_wgt_8 _jk_wgt_9 _jk_wgt_10
0 East 52 85.520635 0.000000 69.250000 69.250000 46.166667 46.166667 46.166667 46.166667 46.166667 46.166667 46.166667
1 East 45 85.398025 69.250000 0.000000 69.250000 46.166667 46.166667 46.166667 46.166667 46.166667 46.166667 46.166667
15 East 52 85.520635 0.000000 76.175000 76.175000 50.783333 50.783333 50.783333 50.783333 50.783333 50.783333 50.783333
16 East 45 85.398025 76.175000 0.000000 76.175000 50.783333 50.783333 50.783333 50.783333 50.783333 50.783333 50.783333
20 East 34 75.661566 76.175000 76.175000 0.000000 50.783333 50.783333 50.783333 50.783333 50.783333 50.783333 50.783333
30 East 34 75.661566 93.223684 93.223684 0.000000 62.149123 62.149123 62.149123 62.149123 62.149123 62.149123 62.149123
33 East 45 85.398025 93.223684 0.000000 93.223684 62.149123 62.149123 62.149123 62.149123 62.149123 62.149123 62.149123
38 East 52 85.520635 0.000000 93.223684 93.223684 62.149123 62.149123 62.149123 62.149123 62.149123 62.149123 62.149123
45 North 7 46.166667 58.940741 58.940741 58.940741 0.000000 117.881481 58.940741 58.940741 58.940741 58.940741 58.940741
59 North 10 50.783333 58.940741 58.940741 58.940741 117.881481 0.000000 58.940741 58.940741 58.940741 58.940741 58.940741
Important

For any of the three methods, we can request the replicate coefficient instead of the replicate weights by using rep_coefs=True.

#jackknife = ReplicateWeight(method=RepMethod.jackknife, strat=True)
jkn_wgt = jackknife.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"], 
    stratum=full_sample["region"], 
    rep_coefs=True
)

jkn_wgt.drop_duplicates().sort_values(by="_stratum").head(15)
_stratum _psu _samp_weight _jk_wgt_1 _jk_wgt_2 _jk_wgt_3 _jk_wgt_4 _jk_wgt_5 _jk_wgt_6 _jk_wgt_7 _jk_wgt_8 _jk_wgt_9 _jk_wgt_10
0 East 52 85.520635 0.0 1.5 1.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 East 45 85.398025 1.5 0.0 1.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0
20 East 34 75.661566 1.5 1.5 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
45 North 7 46.166667 1.0 1.0 1.0 0.0 2.0 1.0 1.0 1.0 1.0 1.0
59 North 10 50.783333 1.0 1.0 1.0 2.0 0.0 1.0 1.0 1.0 1.0 1.0
75 South 29 65.702778 1.0 1.0 1.0 1.0 1.0 0.0 1.5 1.5 1.0 1.0
77 South 24 58.940741 1.0 1.0 1.0 1.0 1.0 1.5 0.0 1.5 1.0 1.0
82 South 16 62.149123 1.0 1.0 1.0 1.0 1.0 1.5 1.5 0.0 1.0 1.0
120 West 86 213.491667 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 2.0
133 West 64 218.893889 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 0.0