Replicate Weights

Replicate weights are usually created for the purpose of variance (uncertainty) estimation. One common use case for replication-based methods is the estimation of non-linear parameters fow which Taylor-based approximation may not be accurate enough. Another use case is when the number of primary sampling units selected per stratum is small (low degree of freedom). Replicate weights are usually created for the purpose of variance (uncertainty)estimation. One common use case for replication-based methods is the estimation of non-linear parameters fow which Taylor-based approximation may not be accurate enough. Another use case is when the number of primary sampling units selected per stratum is small (low degree of freedom).

In this tutorial, we will explore creating replicate weights using the class ReplicateWeight. Three replication methods have been implemented: balanced repeated replication (BRR) including the Fay-BRR, bootstrap and jackknife. The replicate method of interest is specified when initializing the class by using the parameter method. The parameter method takes the values “bootstrap”, “brr”, or “jackknife”. In this tutorial, we show how the API works for producing replicate weights.

import pandas as pd

from samplics.datasets import load_psu_sample, load_ssu_sample
from samplics.weighting import ReplicateWeight

from samplics.utils.types import RepMethod

We import the sample data…

# Load PSU sample data
psu_sample_dict = load_psu_sample()
psu_sample = psu_sample_dict["data"]

# Load PSU sample data
ssu_sample_dict = load_ssu_sample()
ssu_sample = ssu_sample_dict["data"]

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]], 
    ssu_sample[["cluster", "household", "ssu_prob"]], 
    on="cluster")

full_sample["inclusion_prob"] = \
    full_sample["psu_prob"] * full_sample["ssu_prob"] 
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"] 

full_sample.head(15)

	cluster	region	psu_prob	household	ssu_prob	inclusion_prob	design_weight
0	7	North	0.187726	72	0.115385	0.021661	46.166667
1	7	North	0.187726	73	0.115385	0.021661	46.166667
2	7	North	0.187726	75	0.115385	0.021661	46.166667
3	7	North	0.187726	715	0.115385	0.021661	46.166667
4	7	North	0.187726	722	0.115385	0.021661	46.166667
5	7	North	0.187726	724	0.115385	0.021661	46.166667
6	7	North	0.187726	755	0.115385	0.021661	46.166667
7	7	North	0.187726	761	0.115385	0.021661	46.166667
8	7	North	0.187726	764	0.115385	0.021661	46.166667
9	7	North	0.187726	782	0.115385	0.021661	46.166667
10	7	North	0.187726	795	0.115385	0.021661	46.166667
11	7	North	0.187726	7111	0.115385	0.021661	46.166667
12	7	North	0.187726	7112	0.115385	0.021661	46.166667
13	7	North	0.187726	7117	0.115385	0.021661	46.166667
14	7	North	0.187726	7123	0.115385	0.021661	46.166667

Balanced Repeated Replication (BRR)

The basic idea of BRR is to slip the sample in independent random groups. The groups are then threated as independent replicates of the the sample design. A special case is when the sample is split into two half samples in each stratum. This design is suitable to many survey designs where only two psus are selected by stratum. In practice, one of the psu is asigned to the first random group and the other psu is assign to the second group. The sample weights are double for one group (say the first one) and the sample weights in the other group are set to zero. To ensure that the replicates are independent, we use hadamard matrices to assign the random groups.

import scipy
scipy.linalg.hadamard(8)

array([[ 1,  1,  1,  1,  1,  1,  1,  1],
       [ 1, -1,  1, -1,  1, -1,  1, -1],
       [ 1,  1, -1, -1,  1,  1, -1, -1],
       [ 1, -1, -1,  1,  1, -1, -1,  1],
       [ 1,  1,  1,  1, -1, -1, -1, -1],
       [ 1, -1,  1, -1, -1,  1, -1,  1],
       [ 1,  1, -1, -1, -1, -1,  1,  1],
       [ 1, -1, -1,  1, -1,  1,  1, -1]])

In our example, we have 10 psus. If we do not have explicit stratification then replicate() will group the clusters into 5 strata (2 per stratum). In this case, the smallest number of replicates possible using the hadamard matrix is 8.

The result below shows that replicate() created 5 strata by grouping clusters 7 and 10 in the first stratum, clusters 16 and 24 in the second stratum, and so on. We can achieve the same result by providing setting strat=True and providing the stratum variable to replicate().

brr = ReplicateWeight(method=RepMethod.brr, strat=False)
brr_wgt = brr.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"]
    )

brr_wgt.drop_duplicates().head(10)

	_stratum	_psu	_samp_weight	_brr_wgt_1	_brr_wgt_2	_brr_wgt_3	_brr_wgt_4	_brr_wgt_5	_brr_wgt_6	_brr_wgt_7	_brr_wgt_8
0	1	7	46.166667	0.000000	92.333333	0.000000	92.333333	0.000000	92.333333	0.000000	92.333333
15	1	10	50.783333	101.566667	0.000000	101.566667	0.000000	101.566667	0.000000	101.566667	0.000000
30	2	16	62.149123	0.000000	0.000000	124.298246	124.298246	0.000000	0.000000	124.298246	124.298246
45	2	24	58.940741	117.881481	117.881481	0.000000	0.000000	117.881481	117.881481	0.000000	0.000000
60	3	29	65.702778	0.000000	131.405556	131.405556	0.000000	0.000000	131.405556	131.405556	0.000000
75	3	34	75.661566	151.323133	0.000000	0.000000	151.323133	151.323133	0.000000	0.000000	151.323133
90	4	45	85.398025	0.000000	0.000000	0.000000	0.000000	170.796049	170.796049	170.796049	170.796049
105	4	52	85.520635	171.041270	171.041270	171.041270	171.041270	0.000000	0.000000	0.000000	0.000000
120	5	64	218.893889	0.000000	437.787778	0.000000	437.787778	437.787778	0.000000	437.787778	0.000000
135	5	86	213.491667	426.983333	0.000000	426.983333	0.000000	0.000000	426.983333	0.000000	426.983333

An extension of BRR is the Fay’s method. In the Fay’s approach, instead of multiplying one half-sample by zero, we multiple the sampel weights by a factor \(\alpha\) and the other halh-sample by \(2-\alpha\). We refer to \(\alpha\) as the fay coefficient. Note that when \(\alpha=0\) then teh Fay’s method reduces to BRR.

fay = ReplicateWeight(method=RepMethod.brr, strat=False, fay_coef=0.3)
fay_wgt = fay.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"], 
    rep_prefix="fay_weight_",
    psu_varname="cluster", 
    str_varname="stratum"
)

fay_wgt.drop_duplicates().head(10)

	stratum	cluster	_samp_weight	fay_weight_1	fay_weight_2	fay_weight_3	fay_weight_4	fay_weight_5	fay_weight_6	fay_weight_7	fay_weight_8
0	1	7	46.166667	13.850000	78.483333	13.850000	78.483333	13.850000	78.483333	13.850000	78.483333
15	1	10	50.783333	86.331667	15.235000	86.331667	15.235000	86.331667	15.235000	86.331667	15.235000
30	2	16	62.149123	18.644737	18.644737	105.653509	105.653509	18.644737	18.644737	105.653509	105.653509
45	2	24	58.940741	100.199259	100.199259	17.682222	17.682222	100.199259	100.199259	17.682222	17.682222
60	3	29	65.702778	19.710833	111.694722	111.694722	19.710833	19.710833	111.694722	111.694722	19.710833
75	3	34	75.661566	128.624663	22.698470	22.698470	128.624663	128.624663	22.698470	22.698470	128.624663
90	4	45	85.398025	25.619407	25.619407	25.619407	25.619407	145.176642	145.176642	145.176642	145.176642
105	4	52	85.520635	145.385079	145.385079	145.385079	145.385079	25.656190	25.656190	25.656190	25.656190
120	5	64	218.893889	65.668167	372.119611	65.668167	372.119611	372.119611	65.668167	372.119611	65.668167
135	5	86	213.491667	362.935833	64.047500	362.935833	64.047500	64.047500	362.935833	64.047500	362.935833

Important

For any of the three methods, we can request the replicate coefficient instead of the replicate weights by using rep_coefs=True.

#fay = ReplicateWeight(method="brr", strat=False, fay_coef=0.3)
fay_wgt = fay.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"], 
    rep_prefix="fay_weight_",
    psu_varname="cluster", 
    str_varname="stratum",
    rep_coefs=True
)

fay_wgt.drop_duplicates().head(10)

	stratum	cluster	_samp_weight	fay_weight_1	fay_weight_2	fay_weight_3	fay_weight_4	fay_weight_5	fay_weight_6	fay_weight_7	fay_weight_8
0	1	7	46.166667	0.3	1.7	0.3	1.7	0.3	1.7	0.3	1.7
15	1	10	50.783333	1.7	0.3	1.7	0.3	1.7	0.3	1.7	0.3
30	2	16	62.149123	0.3	0.3	1.7	1.7	0.3	0.3	1.7	1.7
45	2	24	58.940741	1.7	1.7	0.3	0.3	1.7	1.7	0.3	0.3
60	3	29	65.702778	0.3	1.7	1.7	0.3	0.3	1.7	1.7	0.3
75	3	34	75.661566	1.7	0.3	0.3	1.7	1.7	0.3	0.3	1.7
90	4	45	85.398025	0.3	0.3	0.3	0.3	1.7	1.7	1.7	1.7
105	4	52	85.520635	1.7	1.7	1.7	1.7	0.3	0.3	0.3	0.3
120	5	64	218.893889	0.3	1.7	0.3	1.7	1.7	0.3	1.7	0.3
135	5	86	213.491667	1.7	0.3	1.7	0.3	0.3	1.7	0.3	1.7

Bootstrap

For the bootstrap replicates, we need to provide the number of replicates. When the number of replicates is not provided, ReplicateWeight will default to 500. The bootstrap consists of selecting the same number of psus as in the sample but with replacement. The selection is independently repeated for each replicate.

bootstrap = ReplicateWeight(
    method=RepMethod.bootstrap, 
    strat=False, 
    nb_reps=50
    )
boot_wgt = bootstrap.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"]
    )

boot_wgt.drop_duplicates().head(10)

	_psu	_samp_weight	_boot_wgt_1	_boot_wgt_2	_boot_wgt_3	_boot_wgt_4	_boot_wgt_5	_boot_wgt_6	_boot_wgt_7	_boot_wgt_8	...	_boot_wgt_41	_boot_wgt_42	_boot_wgt_43	_boot_wgt_44	_boot_wgt_45	_boot_wgt_46	_boot_wgt_47	_boot_wgt_48	_boot_wgt_49	_boot_wgt_50
0	7	46.166667	102.592593	0.000000	51.296296	0.000000	0.000000	102.592593	51.296296	0.000000	...	51.296296	51.296296	0.000000	0.000000	0.000000	102.592593	51.296296	51.296296	102.592593	102.592593
15	10	50.783333	0.000000	0.000000	56.425926	56.425926	112.851852	0.000000	169.277778	56.425926	...	56.425926	56.425926	0.000000	0.000000	56.425926	56.425926	56.425926	56.425926	0.000000	0.000000
30	16	62.149123	69.054581	138.109162	138.109162	69.054581	0.000000	0.000000	0.000000	69.054581	...	69.054581	69.054581	0.000000	0.000000	69.054581	0.000000	0.000000	138.109162	138.109162	69.054581
45	24	58.940741	261.958848	65.489712	65.489712	130.979424	0.000000	65.489712	0.000000	65.489712	...	196.469136	130.979424	0.000000	65.489712	0.000000	0.000000	65.489712	130.979424	130.979424	0.000000
60	29	65.702778	0.000000	0.000000	73.003086	73.003086	73.003086	73.003086	73.003086	73.003086	...	0.000000	0.000000	73.003086	73.003086	146.006173	0.000000	219.009259	146.006173	0.000000	73.003086
75	34	75.661566	84.068407	84.068407	84.068407	84.068407	84.068407	84.068407	0.000000	0.000000	...	84.068407	84.068407	84.068407	84.068407	84.068407	84.068407	84.068407	0.000000	0.000000	252.205222
90	45	85.398025	0.000000	0.000000	94.886694	94.886694	0.000000	189.773388	284.660082	0.000000	...	0.000000	0.000000	94.886694	379.546776	0.000000	189.773388	189.773388	0.000000	94.886694	94.886694
105	52	85.520635	95.022928	190.045855	0.000000	95.022928	95.022928	0.000000	0.000000	95.022928	...	0.000000	95.022928	0.000000	95.022928	190.045855	0.000000	0.000000	0.000000	0.000000	95.022928
120	64	218.893889	0.000000	243.215432	243.215432	243.215432	486.430864	0.000000	243.215432	243.215432	...	0.000000	243.215432	729.646296	243.215432	243.215432	486.430864	0.000000	243.215432	0.000000	0.000000
135	86	213.491667	0.000000	474.425926	0.000000	0.000000	474.425926	474.425926	0.000000	711.638889	...	474.425926	237.212963	711.638889	0.000000	237.212963	237.212963	0.000000	0.000000	474.425926	0.000000

10 rows × 52 columns

Jackknife

Below, we illustrate the API for creating replicate weights using the jackknife method.

jackknife = ReplicateWeight(
    method=RepMethod.jackknife, 
    strat=False
    )
jkn_wgt = jackknife.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"]
    )

jkn_wgt.drop_duplicates().head(10)

	_psu	_samp_weight	_jk_wgt_1	_jk_wgt_2	_jk_wgt_3	_jk_wgt_4	_jk_wgt_5	_jk_wgt_6	_jk_wgt_7	_jk_wgt_8	_jk_wgt_9	_jk_wgt_10
0	7	46.166667	0.000000	51.296296	51.296296	51.296296	51.296296	51.296296	51.296296	51.296296	51.296296	51.296296
15	10	50.783333	56.425926	0.000000	56.425926	56.425926	56.425926	56.425926	56.425926	56.425926	56.425926	56.425926
30	16	62.149123	69.054581	69.054581	0.000000	69.054581	69.054581	69.054581	69.054581	69.054581	69.054581	69.054581
45	24	58.940741	65.489712	65.489712	65.489712	0.000000	65.489712	65.489712	65.489712	65.489712	65.489712	65.489712
60	29	65.702778	73.003086	73.003086	73.003086	73.003086	0.000000	73.003086	73.003086	73.003086	73.003086	73.003086
75	34	75.661566	84.068407	84.068407	84.068407	84.068407	84.068407	0.000000	84.068407	84.068407	84.068407	84.068407
90	45	85.398025	94.886694	94.886694	94.886694	94.886694	94.886694	94.886694	0.000000	94.886694	94.886694	94.886694
105	52	85.520635	95.022928	95.022928	95.022928	95.022928	95.022928	95.022928	95.022928	0.000000	95.022928	95.022928
120	64	218.893889	243.215432	243.215432	243.215432	243.215432	243.215432	243.215432	243.215432	243.215432	0.000000	243.215432
135	86	213.491667	237.212963	237.212963	237.212963	237.212963	237.212963	237.212963	237.212963	237.212963	237.212963	0.000000

With stratification…

jackknife = ReplicateWeight(method=RepMethod.jackknife, strat=True)
jkn_wgt = jackknife.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"], 
    stratum=full_sample["region"]
    )

jkn_wgt.drop_duplicates().head(10)

	_stratum	_psu	_samp_weight	_jk_wgt_1	_jk_wgt_2	_jk_wgt_3	_jk_wgt_4	_jk_wgt_5	_jk_wgt_6	_jk_wgt_7	_jk_wgt_8	_jk_wgt_9	_jk_wgt_10
0	East	52	85.520635	0.000000	69.250000	69.250000	46.166667	46.166667	46.166667	46.166667	46.166667	46.166667	46.166667
1	East	45	85.398025	69.250000	0.000000	69.250000	46.166667	46.166667	46.166667	46.166667	46.166667	46.166667	46.166667
15	East	52	85.520635	0.000000	76.175000	76.175000	50.783333	50.783333	50.783333	50.783333	50.783333	50.783333	50.783333
16	East	45	85.398025	76.175000	0.000000	76.175000	50.783333	50.783333	50.783333	50.783333	50.783333	50.783333	50.783333
20	East	34	75.661566	76.175000	76.175000	0.000000	50.783333	50.783333	50.783333	50.783333	50.783333	50.783333	50.783333
30	East	34	75.661566	93.223684	93.223684	0.000000	62.149123	62.149123	62.149123	62.149123	62.149123	62.149123	62.149123
33	East	45	85.398025	93.223684	0.000000	93.223684	62.149123	62.149123	62.149123	62.149123	62.149123	62.149123	62.149123
38	East	52	85.520635	0.000000	93.223684	93.223684	62.149123	62.149123	62.149123	62.149123	62.149123	62.149123	62.149123
45	North	7	46.166667	58.940741	58.940741	58.940741	0.000000	117.881481	58.940741	58.940741	58.940741	58.940741	58.940741
59	North	10	50.783333	58.940741	58.940741	58.940741	117.881481	0.000000	58.940741	58.940741	58.940741	58.940741	58.940741

Important

For any of the three methods, we can request the replicate coefficient instead of the replicate weights by using rep_coefs=True.

#jackknife = ReplicateWeight(method=RepMethod.jackknife, strat=True)
jkn_wgt = jackknife.replicate(
    samp_weight=full_sample["design_weight"], 
    psu=full_sample["cluster"], 
    stratum=full_sample["region"], 
    rep_coefs=True
)

jkn_wgt.drop_duplicates().sort_values(by="_stratum").head(15)

	_stratum	_psu	_samp_weight	_jk_wgt_1	_jk_wgt_2	_jk_wgt_3	_jk_wgt_4	_jk_wgt_5	_jk_wgt_6	_jk_wgt_7	_jk_wgt_8	_jk_wgt_9	_jk_wgt_10
0	East	52	85.520635	0.0	1.5	1.5	1.0	1.0	1.0	1.0	1.0	1.0	1.0
1	East	45	85.398025	1.5	0.0	1.5	1.0	1.0	1.0	1.0	1.0	1.0	1.0
20	East	34	75.661566	1.5	1.5	0.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
45	North	7	46.166667	1.0	1.0	1.0	0.0	2.0	1.0	1.0	1.0	1.0	1.0
59	North	10	50.783333	1.0	1.0	1.0	2.0	0.0	1.0	1.0	1.0	1.0	1.0
75	South	29	65.702778	1.0	1.0	1.0	1.0	1.0	0.0	1.5	1.5	1.0	1.0
77	South	24	58.940741	1.0	1.0	1.0	1.0	1.0	1.5	0.0	1.5	1.0	1.0
82	South	16	62.149123	1.0	1.0	1.0	1.0	1.0	1.5	1.5	0.0	1.0	1.0
120	West	86	213.491667	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	0.0	2.0
133	West	64	218.893889	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	2.0	0.0