Selection of SSUs

To select the second stage sample, we need the second stage frame which is the list of all the households in the 10 selected clusters (psus). DHS, PHIA, MICS and other large scale surveys visit the selected clusters and construct the list of all households in the selected clusters.

Before starting the second stage selection, let us import the data from the first stage sampling information that is the first stage sample (psu_sample). For clarity, we explicitly import the packages and modules needed for this notebook.

import numpy as np 
import pandas as pd 

from samplics import SelectMethod
from samplics.sampling import SampleSelection

# Load the selected PSUs 
psu_frame = pd.read_csv("./psu_frame.csv")

In this tutorial, we will simulate the second stage frame. For the simulation, assume that the psu frame was obtained from a previous census conducted several years before. We also assume that, the change in the number of households since the previous census follows a normal distribution with a mean equal to 5% higher than the census value and a variance of 0.15 times the number of households from the census. Under these assumptions, we generate the following second stage frame of households. Note that the frame is created only for the selected PSUs.

# Create a synthetic second stage frame
census_size = psu_frame.loc[
    psu_frame["psu_sample"] == 1, 
    "number_households_census"
].values
stratum_names = psu_frame.loc[
    psu_frame["psu_sample"] == 1, 
    "region"
    ].values
cluster = psu_frame.loc[psu_frame["psu_sample"] == 1, "cluster"].values

np.random.seed(15)

listing_size = np.zeros(census_size.size)
for k in range(census_size.size):
    listing_size[k] = np.random.normal(
        1.05 * census_size[k], 0.15 * census_size[k]
        )

listing_size = listing_size.astype(int)
hh_id = rr_id = cl_id = []
for k, s in enumerate(listing_size):
    hh_k1 = np.char.array(np.repeat(stratum_names[k], s)).astype(str)
    hh_k2 = np.char.array(np.arange(1, s + 1)).astype(str)
    cl_k = np.repeat(cluster[k], s)
    hh_k = np.char.add(np.char.array(cl_k).astype(str), hh_k2)
    hh_id = np.append(hh_id, hh_k)
    rr_id = np.append(rr_id, hh_k1)
    cl_id = np.append(cl_id, cl_k)

ssu_frame = pd.DataFrame(cl_id.astype(int))
ssu_frame.rename(columns={0: "cluster"}, inplace=True)
ssu_frame["region"] = rr_id
ssu_frame["household"] = hh_id

nb_obs = 15
print(f"\nFirst {nb_obs} observations of the SSU frame\n")
ssu_frame.head(nb_obs)

First 15 observations of the SSU frame
cluster region household
0 7 North 71
1 7 North 72
2 7 North 73
3 7 North 74
4 7 North 75
5 7 North 76
6 7 North 77
7 7 North 78
8 7 North 79
9 7 North 710
10 7 North 711
11 7 North 712
12 7 North 713
13 7 North 714
14 7 North 715
psu_sample = psu_frame.loc[psu_frame["psu_sample"] == 1]
ssu_counts = ssu_frame.groupby("cluster").count()
ssu_counts.drop(columns="region", inplace=True)
ssu_counts.reset_index(inplace=True)
ssu_counts.rename(
    columns={"household": "number_households_listed"}, 
    inplace=True
    )

pd.merge(
    psu_sample[["cluster", "region", "number_households_census"]],
    ssu_counts[["cluster", "number_households_listed"]],
    on=["cluster"],
)
cluster region number_households_census number_households_listed
0 7 North 130 130
1 10 North 600 660
2 16 South 190 195
3 24 South 75 73
4 29 South 200 217
5 34 East 305 239
6 45 East 450 398
7 52 East 700 620
8 64 West 300 301
9 86 West 280 274

According to the simulated second stage frame, we get the same number of households in cluster 7 as the census. However, in strata 10, 16, 29, and 64, we listed more households than during than the census. And finally, we found less households in the remaining clusters than the census.

Now that we have a second stage frame, let’s use samplics to calculate the probabilities of selection and to select a sample. The second stage sample size is 150 households and the strategy is to select 15 households per cluster.

SSU (household) Probability of Selection

The second stage probabilities of selection are conditional on the first stage realization. For this stage, simple random selection (srs) and systematic selection(sys) are common methods used to select households. For this example, we use srs to select 15 households from each cluster. Conditionally to teh first stage, the second stage selection is a stratified srs where the clusters are the strata. More generally, we have that \[\begin{equation} p_{hij} = \frac{m_{hi}}{M_{hi}^{'}} \end{equation}\] where \(p_{hij}\) is the conditional probability of selection for unit \(j\) from stratum \(h\) and cluster \(j\), \(m_{hi}\) and \(M_{hi}^{'}\) are the sample size and the number of secondary sampling units listed for stratum \(h\) and cluster \(j\), respectively.

In this scenario, sample size is the same in each stratum. Hence, the parameter sample_size does not need to be a Python dictionary; we will only provide 15 in the function call.

stage2_design = SampleSelection(
    method=SelectMethod.srs_wor, strat=True, wr=False
)

ssu_frame["ssu_prob"] = stage2_design.inclusion_probs(
    ssu_frame["household"], 15, ssu_frame["cluster"]
)

ssu_frame.sample(20)
cluster region household ssu_prob
1438 34 East 34164 0.004828
2517 52 East 52606 0.004828
2943 86 West 86111 0.004828
3002 86 West 86170 0.004828
559 10 North 10430 0.004828
1216 29 South 29159 0.004828
2751 64 West 64220 0.004828
412 10 North 10283 0.004828
2549 64 West 6418 0.004828
1508 34 East 34234 0.004828
1517 45 East 454 0.004828
2658 64 West 64127 0.004828
1072 29 South 2915 0.004828
1074 29 South 2917 0.004828
1669 45 East 45156 0.004828
1286 34 East 3412 0.004828
1162 29 South 29105 0.004828
1051 24 South 2467 0.004828
3085 86 West 86253 0.004828
2148 52 East 52237 0.004828

SSU (household) Selection

The second stage sample is selected from the SSU frame (ssu_frame) using the variable cluster as the strat variable. The sample is selected without replacement according to the specification of the second stage design. Hence, both ssu_sample and ssu_hits sum to 150 and each selected household was hit only ounce (i.e. ssu_hits = 1).

np.random.seed(11)
ssu_sample, ssu_hits, ssu_probs = stage2_design.select(
    ssu_frame["household"], 15, ssu_frame["cluster"]
)

ssu_frame["ssu_sample"] = ssu_sample
ssu_frame["ssu_hits"] = ssu_hits
ssu_frame["ssu_probs"] = ssu_probs

ssu_frame[ssu_frame["ssu_sample"] == 1].sample(15)
cluster region household ssu_prob ssu_sample ssu_hits ssu_probs
2319 52 East 52408 0.004828 True 1 0.024194
2931 86 West 8699 0.004828 True 1 0.054745
2642 64 West 64111 0.004828 True 1 0.049834
122 7 North 7123 0.004828 True 1 0.115385
60 7 North 761 0.004828 True 1 0.115385
945 16 South 16156 0.004828 True 1 0.076923
338 10 North 10209 0.004828 True 1 0.022727
2218 52 East 52307 0.004828 True 1 0.024194
2870 86 West 8638 0.004828 True 1 0.054745
1764 45 East 45251 0.004828 True 1 0.037688
630 10 North 10501 0.004828 True 1 0.022727
1441 34 East 34167 0.004828 True 1 0.062762
986 24 South 242 0.004828 True 1 0.205479
1796 45 East 45283 0.004828 True 1 0.037688
1264 29 South 29207 0.004828 True 1 0.069124

Let’s check that both ssu_sample and ssu_hits sum to 150 and each selected household was hit only ounce (i.e. ssu_hits = 1).

print(f"The sum of `ssu_sample` is equal to: {ssu_frame['ssu_sample'].sum()}\n")
The sum of `ssu_sample` is equal to: 150
print(f"The sum of `ssu_hits` is equal to: {ssu_frame['ssu_hits'].sum()}\n")
The sum of `ssu_hits` is equal to: 150
print(f"The values of `ssu_hits` are: {np.unique(ssu_frame['ssu_hits']).tolist()}\n")
The values of `ssu_hits` are: [0, 1]

To use systematic selection, we just need to replace method=SelectMethod.srs_wor by method=SelectMethod.sys.

Another common approach is to use a rate for selecting the sample. Instead of selecting 15 households from 130 in the first cluster, we may want to select with a rate of 15/130, and similarly for the other clusters.

rates = np.repeat(15, 10) / ssu_counts["number_households_listed"].values
ssu_rates = dict(zip(np.unique(ssu_frame["cluster"]), rates))
ssu_rates
{7: 0.11538461538461539,
 10: 0.022727272727272728,
 16: 0.07692307692307693,
 24: 0.2054794520547945,
 29: 0.06912442396313365,
 34: 0.06276150627615062,
 45: 0.03768844221105527,
 52: 0.024193548387096774,
 64: 0.04983388704318937,
 86: 0.05474452554744526}
np.random.seed(22)

stage2_design2 = SampleSelection(
    method=SelectMethod.sys, strat=True, wr=False
)

ssu_sample_r, ssu_hits_r, _ = stage2_design2.select(
    ssu_frame["household"], 
    stratum=ssu_frame["cluster"], 
    samp_rate=ssu_rates
)

ssu_sample2 = pd.DataFrame(
    data={
        "household": ssu_frame["household"],
        "ssu_sample_r": ssu_sample_r,
        "ssu_hits_r": ssu_hits_r,
    }
)

ssu_sample2.head(25)
household ssu_sample_r ssu_hits_r
0 71 0 0
1 72 0 0
2 73 0 0
3 74 0 0
4 75 0 0
5 76 1 1
6 77 0 0
7 78 0 0
8 79 0 0
9 710 0 0
10 711 0 0
11 712 0 0
12 713 0 0
13 714 1 1
14 715 0 0
15 716 0 0
16 717 0 0
17 718 0 0
18 719 0 0
19 720 0 0
20 721 0 0
21 722 1 1
22 723 0 0
23 724 0 0
24 725 0 0

Let’s store the first and second stages samples.

# First stage sample
psu_sample[["cluster", "region", "psu_prob"]].to_csv("psu_sample.csv")

# Second stage sample
ssu_sample = ssu_frame.loc[ssu_frame["ssu_sample"] == 1]
ssu_sample[["cluster", "household", "ssu_prob"]].to_csv("ssu_sample.csv")