Selection of PSUs

In the sections below, we draw primary sampling units (PSUs) using probability proportional to size (PPS) sampling techniques implemented in the Sample class. The class Sample has two main methods that is inclusion_probs and select. The method inclusion_probs() computes the probability of selection and select() draws the random samples.

The following will illustrate the use of samplics for sample selection. For the illustration,

This example is not meant to be exhaustif. There are many use cases that are not covered in this tutorial. For example, some PSUs may be segmented due to their size and segments selected in a subsequent step. Segment selection can be done with Samplics in a similar way as the PSUs selection, with PPS or SRS, after the segements have been created by the user.

First, let us import the python packages necessary to run the tutorial.

import numpy as np 
from samplics.datasets import load_psu_frame

from samplics import SelectMethod
from samplics.sampling import SampleSelection

Sample Dataset

The file sample_frame.csv - shown below - contains synthetic data of 100 clusters classified by region (East, North, South and West). Clusters represent a group of households. In the file, each cluster has an associated number of households (number_households) and a status variable indicating whether the cluster is in scope or not.

This synthetic data represents a simplified version of enumeration areas (EAs) frames found in many countries and used by major household survey programs such as the Demographic and Health Surveys (DHS), the Population-based HIV Impact Assessment (PHIA) surveys and the Multiple Cluster Indicator Surveys (MICS).

psu_frame_dict = load_psu_frame()
psu_frame = psu_frame_dict["data"]
psu_frame.head(25)
cluster region number_households_census cluster_status comment
0 1 North 105 1 NaN
1 2 North 85 1 NaN
2 3 North 95 1 NaN
3 4 North 75 1 NaN
4 5 North 120 1 NaN
5 6 North 90 1 NaN
6 7 North 130 1 NaN
7 8 North 55 1 NaN
8 9 North 30 1 NaN
9 10 North 600 1 due to a large building
10 11 South 25 1 NaN
11 12 South 250 1 NaN
12 13 South 105 1 NaN
13 14 South 75 1 NaN
14 15 South 205 1 NaN
15 16 South 190 1 NaN
16 17 South 95 1 NaN
17 18 South 85 1 NaN
18 19 South 50 1 NaN
19 20 South 110 1 NaN
20 21 South 130 1 NaN
21 22 South 180 1 NaN
22 23 South 65 1 NaN
23 24 South 75 1 NaN
24 25 South 95 1 NaN

Often, sampling frames are not available for the sampling units of interest. For example, most countries do not have a list of all households or people living in the country. Even if such frames exist, it may not be operationally and financially feasible to directly select sampling units without any form of clustering.

Hence, stage sampling is a common strategy used by large household national surveys for selecting samples of households and people. At the first stage, geographic or administrative clusters of households are selected. At the second stage, a frame of households is created from the selected clusters and a sample of households is selected. At the third stage (if applicable), a sample of people is selected from the households in the sample. This is a high level description of the process; usually implementations are much less straightforward and may require many adjustments to address complexities.

PSU Probability of Selection

At the first stage, we use the proportional to size (pps) method to select a random sample of clusters. The measure of size is the number of households (number_households) as provided in the psu sampling frame. The sample is stratified by region. The probabilities, for stratified pps, is obtained as follow: \[\begin{equation} p_{hi} = \frac{n_h M_{hi}}{\sum_{i=1}^{N_h}{M_{hi}}} \end{equation}\] where \(p_{hi}\) is the probability of selection for unit \(i\) from stratum \(h\), \(M_{hi}\) is the measure of size (mos), \(n_h\) and \(N_h\) are the sample size and the total number of clusters in stratum \(h\), respectively.

Important

The PPS method is used in many surveys not just for multistage household surveys.

For example, in business surveys, establishments can greatly vary in size; hence pps methods are often use to select samples. Simarly, facility-based surveys can benefit from pps methods when frames with measures of size are available.

PSU Sample size

For a stratified sampling design, the sample size is provided using a Python dictionary. Python dictionaries allow us to pair the strata with the sample sizes. Let’s say that we want to select 3 clusters from stratum East, 2 from West, 2 from North and 3 from South. The snippet of code below demonstrates how to create the Python dictionary. Note that it is important to correctly spell out the keys of the dictionary which corresponds to the values of the variable stratum (in our case it’s region).

psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}

print(f"\nThe sample size per domain is:\n {psu_sample_size}\n")

The sample size per domain is:
 {'East': 3, 'West': 2, 'North': 2, 'South': 3}

The function array_to_dict() converts an array to a dictionnary by pairing the values of the array to their frequency. We can use this function to calculates the number of clusters per stratum and store the result in a Python dictionnary. Then, we modify the values of the dictionnary to create the sample size dictionnary.

If some of the clusters are certainties then an exception will be raised. Hence, the user will have to manually handle the certaininties. Better handling of certainties is planned for future versions of the library samplics.

from samplics import array_to_dict

frame_size = array_to_dict(psu_frame["region"])
print(f"\nThe number of clusters per stratum is:\n {frame_size}")

The number of clusters per stratum is:
 {'East': 25, 'North': 10, 'South': 20, 'West': 45}
psu_sample_size = frame_size.copy()
psu_sample_size["East"] = 3
psu_sample_size["North"] = 2
psu_sample_size["South"] = 3
psu_sample_size["West"] = 2
print(f"\nThe sample size per stratum is:\n {psu_sample_size}\n")

The sample size per stratum is:
 {'East': 3, 'North': 2, 'South': 3, 'West': 2}
stage1_design = SampleSelection(method=SelectMethod.pps_sys, strat=True, wr=False)

psu_frame["psu_prob"] = stage1_design.inclusion_probs(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"],
    psu_frame["number_households_census"],
    )

nb_obs = 15
print(f"\nFirst {nb_obs} observations of the PSU frame \n")
psu_frame.head(nb_obs)

First 15 observations of the PSU frame 
cluster region number_households_census cluster_status comment psu_prob
0 1 North 105 1 NaN 0.151625
1 2 North 85 1 NaN 0.122744
2 3 North 95 1 NaN 0.137184
3 4 North 75 1 NaN 0.108303
4 5 North 120 1 NaN 0.173285
5 6 North 90 1 NaN 0.129964
6 7 North 130 1 NaN 0.187726
7 8 North 55 1 NaN 0.079422
8 9 North 30 1 NaN 0.043321
9 10 North 600 1 due to a large building 0.866426
10 11 South 25 1 NaN 0.027523
11 12 South 250 1 NaN 0.275229
12 13 South 105 1 NaN 0.115596
13 14 South 75 1 NaN 0.082569
14 15 South 205 1 NaN 0.225688

PSU Selection

In this section, we select a sample of psus using pps methods. In the section above, we have calculated the probabilities of selection. That step is not necessary when using samplics. We can use the method select() to calculate the probability of selection and select the sample, in one run. As shown below, select() method returns a tuple of three arrays.
* The first array indicates the selected units (i.e. psu_sample = 1 if selected, and 0 if not selected).
* The second array provides the number of hits, useful when the sample is selected with replacement.
* The third array is the probability of selection.

Note

np.random.seed() fixes the random seed to allow us to reproduce the random selection.

np.random.seed(23)

psu_frame["psu_sample"], psu_frame["psu_hits"], psu_frame["psu_probs"] = \
    stage1_design.select(
        psu_frame["cluster"], 
        psu_sample_size, 
        psu_frame["region"], 
        psu_frame["number_households_census"]
    )
    
psu_frame.to_csv("./psu_frame.csv")

print(
    "\nFirst 15 obs of the PSU frame with the sampling information\n"
    )
psu_frame[
    ["cluster", "region", "psu_prob", "psu_sample", "psu_hits", "psu_probs"]
    ].head(15)

First 15 obs of the PSU frame with the sampling information
cluster region psu_prob psu_sample psu_hits psu_probs
0 1 North 0.151625 0 0 0.151625
1 2 North 0.122744 0 0 0.122744
2 3 North 0.137184 0 0 0.137184
3 4 North 0.108303 0 0 0.108303
4 5 North 0.173285 0 0 0.173285
5 6 North 0.129964 0 0 0.129964
6 7 North 0.187726 1 1 0.187726
7 8 North 0.079422 0 0 0.079422
8 9 North 0.043321 0 0 0.043321
9 10 North 0.866426 1 1 0.866426
10 11 South 0.027523 0 0 0.027523
11 12 South 0.275229 0 0 0.275229
12 13 South 0.115596 0 0 0.115596
13 14 South 0.082569 0 0 0.082569
14 15 South 0.225688 0 0 0.225688

The default setting sample_only=False returns the entire frame. We can easily reduce the output data to the sample by filtering i.e. psu_sample == 1. However, if we are only interested in the sample, we could use sample_only=True when calling select(). This will reduce the output data to the sampled units and to_dataframe=true will convert the data to a pandas dataframe (pd.DataFrame). Note that the columns in the dataframe will be reduced to the minimum.

np.random.seed(23)

psu_sample = stage1_design.select(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"], 
    psu_frame["number_households_census"],
    to_dataframe = True,
    sample_only = True
    )

print("\nPSU sample without the non-sampled units\n")
psu_sample

PSU sample without the non-sampled units
_samp_unit _stratum _mos _sample _hits _probs
0 7 North 130 1 1 0.187726
1 10 North 600 1 1 0.866426
2 16 South 190 1 1 0.209174
3 24 South 75 1 1 0.082569
4 29 South 200 1 1 0.220183
5 34 East 305 1 1 0.210587
6 45 East 450 1 1 0.310702
7 52 East 700 1 1 0.483314
8 64 West 300 1 1 0.091673
9 86 West 280 1 1 0.085561

The systematic selection method can be implemented with or without replacement. The other samplics algorithms for selecting sample with unequal probablities of selection are Brewer, Hanurav-Vijayan (hv), Murphy, and Rao-Sampford (rs) methods. As shown below, all these sampling techniques can be specified when extentiating a Sample class; then call select() to draw samples.

Sample(method=SelectMethod.pps_sys, wr=True)
Sample(method=SelectMethod.pps_sys, wr=False)
Sample(method=SelectMethod.pps_brewer, wr=False)
Sample(method=SelectMethod.pps_hv, wr=False) # Hanurav-Vijayan method
Sample(method=SelectMethod.pps_murphy, wr=False)
Sample(method=SelectMethod.pps_rs, wr=False) # Rao-Sampford method

For example, if we wanted to select the sample using the Rao-Sampford method, we could use the following snippet of code.

np.random.seed(23)

stage1_sampford = SampleSelection(
    method=SelectMethod.pps_rs, 
    strat=True, 
    wr=False
    )

psu_sample_sampford = stage1_sampford.select(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"], 
    psu_frame["number_households_census"],
    to_dataframe=True,
    sample_only=False
    )

psu_sample_sampford
_samp_unit _stratum _mos _sample _hits _probs
0 1 North 105 0 0 0.151625
1 2 North 85 0 0 0.122744
2 3 North 95 1 1 0.137184
3 4 North 75 0 0 0.108303
4 5 North 120 0 0 0.173285
... ... ... ... ... ... ...
95 96 West 95 1 1 0.029030
96 97 West 40 0 0 0.012223
97 98 West 105 0 0 0.032086
98 99 West 320 0 0 0.097785
99 100 West 200 0 0 0.061115

100 rows × 6 columns