import numpy as np
import pandas as pd
from samplics import SelectMethod
from samplics.sampling import SampleSelection
# Load the selected PSUs
= pd.read_csv("./psu_frame.csv") psu_frame
Selection of SSUs
To select the second stage sample, we need the second stage frame which is the list of all the households in the 10 selected clusters (psus). DHS, PHIA, MICS and other large scale surveys visit the selected clusters and construct the list of all households in the selected clusters.
Before starting the second stage selection, let us import the data from the first stage sampling information that is the first stage sample (psu_sample). For clarity, we explicitly import the packages and modules needed for this notebook.
In this tutorial, we will simulate the second stage frame. For the simulation, assume that the psu frame was obtained from a previous census conducted several years before. We also assume that, the change in the number of households since the previous census follows a normal distribution with a mean equal to 5% higher than the census value and a variance of 0.15 times the number of households from the census. Under these assumptions, we generate the following second stage frame of households. Note that the frame is created only for the selected PSUs.
# Create a synthetic second stage frame
= psu_frame.loc[
census_size "psu_sample"] == 1,
psu_frame["number_households_census"
].values= psu_frame.loc[
stratum_names "psu_sample"] == 1,
psu_frame["region"
].values= psu_frame.loc[psu_frame["psu_sample"] == 1, "cluster"].values
cluster
15)
np.random.seed(
= np.zeros(census_size.size)
listing_size for k in range(census_size.size):
= np.random.normal(
listing_size[k] 1.05 * census_size[k], 0.15 * census_size[k]
)
= listing_size.astype(int)
listing_size = rr_id = cl_id = []
hh_id for k, s in enumerate(listing_size):
= np.char.array(np.repeat(stratum_names[k], s)).astype(str)
hh_k1 = np.char.array(np.arange(1, s + 1)).astype(str)
hh_k2 = np.repeat(cluster[k], s)
cl_k = np.char.add(np.char.array(cl_k).astype(str), hh_k2)
hh_k = np.append(hh_id, hh_k)
hh_id = np.append(rr_id, hh_k1)
rr_id = np.append(cl_id, cl_k)
cl_id
= pd.DataFrame(cl_id.astype(int))
ssu_frame ={0: "cluster"}, inplace=True)
ssu_frame.rename(columns"region"] = rr_id
ssu_frame["household"] = hh_id
ssu_frame[
= 15
nb_obs print(f"\nFirst {nb_obs} observations of the SSU frame\n")
ssu_frame.head(nb_obs)
First 15 observations of the SSU frame
cluster | region | household | |
---|---|---|---|
0 | 7 | North | 71 |
1 | 7 | North | 72 |
2 | 7 | North | 73 |
3 | 7 | North | 74 |
4 | 7 | North | 75 |
5 | 7 | North | 76 |
6 | 7 | North | 77 |
7 | 7 | North | 78 |
8 | 7 | North | 79 |
9 | 7 | North | 710 |
10 | 7 | North | 711 |
11 | 7 | North | 712 |
12 | 7 | North | 713 |
13 | 7 | North | 714 |
14 | 7 | North | 715 |
= psu_frame.loc[psu_frame["psu_sample"] == 1]
psu_sample = ssu_frame.groupby("cluster").count()
ssu_counts ="region", inplace=True)
ssu_counts.drop(columns=True)
ssu_counts.reset_index(inplace
ssu_counts.rename(={"household": "number_households_listed"},
columns=True
inplace
)
pd.merge("cluster", "region", "number_households_census"]],
psu_sample[["cluster", "number_households_listed"]],
ssu_counts[[=["cluster"],
on )
cluster | region | number_households_census | number_households_listed | |
---|---|---|---|---|
0 | 7 | North | 130 | 130 |
1 | 10 | North | 600 | 660 |
2 | 16 | South | 190 | 195 |
3 | 24 | South | 75 | 73 |
4 | 29 | South | 200 | 217 |
5 | 34 | East | 305 | 239 |
6 | 45 | East | 450 | 398 |
7 | 52 | East | 700 | 620 |
8 | 64 | West | 300 | 301 |
9 | 86 | West | 280 | 274 |
According to the simulated second stage frame, we get the same number of households in cluster 7 as the census. However, in strata 10, 16, 29, and 64, we listed more households than during than the census. And finally, we found less households in the remaining clusters than the census.
Now that we have a second stage frame, let’s use samplics to calculate the probabilities of selection and to select a sample. The second stage sample size is 150 households and the strategy is to select 15 households per cluster.
SSU (household) Probability of Selection
The second stage probabilities of selection are conditional on the first stage realization. For this stage, simple random selection (srs) and systematic selection(sys) are common methods used to select households. For this example, we use srs to select 15 households from each cluster. Conditionally to teh first stage, the second stage selection is a stratified srs where the clusters are the strata. More generally, we have that \[\begin{equation} p_{hij} = \frac{m_{hi}}{M_{hi}^{'}} \end{equation}\] where \(p_{hij}\) is the conditional probability of selection for unit \(j\) from stratum \(h\) and cluster \(j\), \(m_{hi}\) and \(M_{hi}^{'}\) are the sample size and the number of secondary sampling units listed for stratum \(h\) and cluster \(j\), respectively.
In this scenario, sample size is the same in each stratum. Hence, the parameter sample_size does not need to be a Python dictionary; we will only provide 15 in the function call.
= SampleSelection(
stage2_design =SelectMethod.srs_wor, strat=True, wr=False
method
)
"ssu_prob"] = stage2_design.inclusion_probs(
ssu_frame["household"], 15, ssu_frame["cluster"]
ssu_frame[
)
20) ssu_frame.sample(
cluster | region | household | ssu_prob | |
---|---|---|---|---|
1438 | 34 | East | 34164 | 0.004828 |
2517 | 52 | East | 52606 | 0.004828 |
2943 | 86 | West | 86111 | 0.004828 |
3002 | 86 | West | 86170 | 0.004828 |
559 | 10 | North | 10430 | 0.004828 |
1216 | 29 | South | 29159 | 0.004828 |
2751 | 64 | West | 64220 | 0.004828 |
412 | 10 | North | 10283 | 0.004828 |
2549 | 64 | West | 6418 | 0.004828 |
1508 | 34 | East | 34234 | 0.004828 |
1517 | 45 | East | 454 | 0.004828 |
2658 | 64 | West | 64127 | 0.004828 |
1072 | 29 | South | 2915 | 0.004828 |
1074 | 29 | South | 2917 | 0.004828 |
1669 | 45 | East | 45156 | 0.004828 |
1286 | 34 | East | 3412 | 0.004828 |
1162 | 29 | South | 29105 | 0.004828 |
1051 | 24 | South | 2467 | 0.004828 |
3085 | 86 | West | 86253 | 0.004828 |
2148 | 52 | East | 52237 | 0.004828 |
SSU (household) Selection
The second stage sample is selected from the SSU frame (ssu_frame) using the variable cluster as the strat variable. The sample is selected without replacement according to the specification of the second stage design. Hence, both ssu_sample and ssu_hits sum to 150 and each selected household was hit only ounce (i.e. ssu_hits = 1).
11)
np.random.seed(= stage2_design.select(
ssu_sample, ssu_hits, ssu_probs "household"], 15, ssu_frame["cluster"]
ssu_frame[
)
"ssu_sample"] = ssu_sample
ssu_frame["ssu_hits"] = ssu_hits
ssu_frame["ssu_probs"] = ssu_probs
ssu_frame[
"ssu_sample"] == 1].sample(15) ssu_frame[ssu_frame[
cluster | region | household | ssu_prob | ssu_sample | ssu_hits | ssu_probs | |
---|---|---|---|---|---|---|---|
2319 | 52 | East | 52408 | 0.004828 | True | 1 | 0.024194 |
2931 | 86 | West | 8699 | 0.004828 | True | 1 | 0.054745 |
2642 | 64 | West | 64111 | 0.004828 | True | 1 | 0.049834 |
122 | 7 | North | 7123 | 0.004828 | True | 1 | 0.115385 |
60 | 7 | North | 761 | 0.004828 | True | 1 | 0.115385 |
945 | 16 | South | 16156 | 0.004828 | True | 1 | 0.076923 |
338 | 10 | North | 10209 | 0.004828 | True | 1 | 0.022727 |
2218 | 52 | East | 52307 | 0.004828 | True | 1 | 0.024194 |
2870 | 86 | West | 8638 | 0.004828 | True | 1 | 0.054745 |
1764 | 45 | East | 45251 | 0.004828 | True | 1 | 0.037688 |
630 | 10 | North | 10501 | 0.004828 | True | 1 | 0.022727 |
1441 | 34 | East | 34167 | 0.004828 | True | 1 | 0.062762 |
986 | 24 | South | 242 | 0.004828 | True | 1 | 0.205479 |
1796 | 45 | East | 45283 | 0.004828 | True | 1 | 0.037688 |
1264 | 29 | South | 29207 | 0.004828 | True | 1 | 0.069124 |
Let’s check that both ssu_sample and ssu_hits sum to 150 and each selected household was hit only ounce (i.e. ssu_hits = 1).
print(f"The sum of `ssu_sample` is equal to: {ssu_frame['ssu_sample'].sum()}\n")
The sum of `ssu_sample` is equal to: 150
print(f"The sum of `ssu_hits` is equal to: {ssu_frame['ssu_hits'].sum()}\n")
The sum of `ssu_hits` is equal to: 150
print(f"The values of `ssu_hits` are: {np.unique(ssu_frame['ssu_hits']).tolist()}\n")
The values of `ssu_hits` are: [0, 1]
To use systematic selection, we just need to replace method=SelectMethod.srs_wor
by method=SelectMethod.sys
.
Another common approach is to use a rate for selecting the sample. Instead of selecting 15 households from 130 in the first cluster, we may want to select with a rate of 15/130, and similarly for the other clusters.
= np.repeat(15, 10) / ssu_counts["number_households_listed"].values
rates = dict(zip(np.unique(ssu_frame["cluster"]), rates))
ssu_rates ssu_rates
{7: 0.11538461538461539,
10: 0.022727272727272728,
16: 0.07692307692307693,
24: 0.2054794520547945,
29: 0.06912442396313365,
34: 0.06276150627615062,
45: 0.03768844221105527,
52: 0.024193548387096774,
64: 0.04983388704318937,
86: 0.05474452554744526}
22)
np.random.seed(
= SampleSelection(
stage2_design2 =SelectMethod.sys, strat=True, wr=False
method
)
= stage2_design2.select(
ssu_sample_r, ssu_hits_r, _ "household"],
ssu_frame[=ssu_frame["cluster"],
stratum=ssu_rates
samp_rate
)
= pd.DataFrame(
ssu_sample2 ={
data"household": ssu_frame["household"],
"ssu_sample_r": ssu_sample_r,
"ssu_hits_r": ssu_hits_r,
}
)
25) ssu_sample2.head(
household | ssu_sample_r | ssu_hits_r | |
---|---|---|---|
0 | 71 | 0 | 0 |
1 | 72 | 0 | 0 |
2 | 73 | 0 | 0 |
3 | 74 | 0 | 0 |
4 | 75 | 0 | 0 |
5 | 76 | 1 | 1 |
6 | 77 | 0 | 0 |
7 | 78 | 0 | 0 |
8 | 79 | 0 | 0 |
9 | 710 | 0 | 0 |
10 | 711 | 0 | 0 |
11 | 712 | 0 | 0 |
12 | 713 | 0 | 0 |
13 | 714 | 1 | 1 |
14 | 715 | 0 | 0 |
15 | 716 | 0 | 0 |
16 | 717 | 0 | 0 |
17 | 718 | 0 | 0 |
18 | 719 | 0 | 0 |
19 | 720 | 0 | 0 |
20 | 721 | 0 | 0 |
21 | 722 | 1 | 1 |
22 | 723 | 0 | 0 |
23 | 724 | 0 | 0 |
24 | 725 | 0 | 0 |
Let’s store the first and second stages samples.
# First stage sample
"cluster", "region", "psu_prob"]].to_csv("psu_sample.csv")
psu_sample[[
# Second stage sample
= ssu_frame.loc[ssu_frame["ssu_sample"] == 1]
ssu_sample "cluster", "household", "ssu_prob"]].to_csv("ssu_sample.csv") ssu_sample[[