Selection of PSUs

In the sections below, we draw primary sampling units (PSUs) using probability proportional to size (PPS) sampling techniques implemented in the Sample class. The class Sample has two main methods that is inclusion_probs and select. The method inclusion_probs() computes the probability of selection and select() draws the random samples.

The following will illustrate the use of samplics for sample selection. For the illustration,

We consider a stratified cluster design.
We will a priori decide how many PSUs to sample from each stratum
For the clusters selection, we demonstrate PPS methods

This example is not meant to be exhaustif. There are many use cases that are not covered in this tutorial. For example, some PSUs may be segmented due to their size and segments selected in a subsequent step. Segment selection can be done with Samplics in a similar way as the PSUs selection, with PPS or SRS, after the segements have been created by the user.

First, let us import the python packages necessary to run the tutorial.

import numpy as np 
from samplics.datasets import load_psu_frame

from samplics import SelectMethod
from samplics.sampling import SampleSelection

Sample Dataset

The file sample_frame.csv - shown below - contains synthetic data of 100 clusters classified by region (East, North, South and West). Clusters represent a group of households. In the file, each cluster has an associated number of households (number_households) and a status variable indicating whether the cluster is in scope or not.

This synthetic data represents a simplified version of enumeration areas (EAs) frames found in many countries and used by major household survey programs such as the Demographic and Health Surveys (DHS), the Population-based HIV Impact Assessment (PHIA) surveys and the Multiple Cluster Indicator Surveys (MICS).

psu_frame_dict = load_psu_frame()
psu_frame = psu_frame_dict["data"]
psu_frame.head(25)

	cluster	region	number_households_census	cluster_status	comment
0	1	North	105	1	NaN
1	2	North	85	1	NaN
2	3	North	95	1	NaN
3	4	North	75	1	NaN
4	5	North	120	1	NaN
5	6	North	90	1	NaN
6	7	North	130	1	NaN
7	8	North	55	1	NaN
8	9	North	30	1	NaN
9	10	North	600	1	due to a large building
10	11	South	25	1	NaN
11	12	South	250	1	NaN
12	13	South	105	1	NaN
13	14	South	75	1	NaN
14	15	South	205	1	NaN
15	16	South	190	1	NaN
16	17	South	95	1	NaN
17	18	South	85	1	NaN
18	19	South	50	1	NaN
19	20	South	110	1	NaN
20	21	South	130	1	NaN
21	22	South	180	1	NaN
22	23	South	65	1	NaN
23	24	South	75	1	NaN
24	25	South	95	1	NaN

Often, sampling frames are not available for the sampling units of interest. For example, most countries do not have a list of all households or people living in the country. Even if such frames exist, it may not be operationally and financially feasible to directly select sampling units without any form of clustering.

Hence, stage sampling is a common strategy used by large household national surveys for selecting samples of households and people. At the first stage, geographic or administrative clusters of households are selected. At the second stage, a frame of households is created from the selected clusters and a sample of households is selected. At the third stage (if applicable), a sample of people is selected from the households in the sample. This is a high level description of the process; usually implementations are much less straightforward and may require many adjustments to address complexities.

PSU Probability of Selection

At the first stage, we use the proportional to size (pps) method to select a random sample of clusters. The measure of size is the number of households (number_households) as provided in the psu sampling frame. The sample is stratified by region. The probabilities, for stratified pps, is obtained as follow: \[\begin{equation} p_{hi} = \frac{n_h M_{hi}}{\sum_{i=1}^{N_h}{M_{hi}}} \end{equation}\] where \(p_{hi}\) is the probability of selection for unit \(i\) from stratum \(h\), \(M_{hi}\) is the measure of size (mos), \(n_h\) and \(N_h\) are the sample size and the total number of clusters in stratum \(h\), respectively.

Important

The PPS method is used in many surveys not just for multistage household surveys.

For example, in business surveys, establishments can greatly vary in size; hence pps methods are often use to select samples. Simarly, facility-based surveys can benefit from pps methods when frames with measures of size are available.

PSU Sample size

For a stratified sampling design, the sample size is provided using a Python dictionary. Python dictionaries allow us to pair the strata with the sample sizes. Let’s say that we want to select 3 clusters from stratum East, 2 from West, 2 from North and 3 from South. The snippet of code below demonstrates how to create the Python dictionary. Note that it is important to correctly spell out the keys of the dictionary which corresponds to the values of the variable stratum (in our case it’s region).

psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}

print(f"\nThe sample size per domain is:\n {psu_sample_size}\n")


The sample size per domain is:
 {'East': 3, 'West': 2, 'North': 2, 'South': 3}

The function array_to_dict() converts an array to a dictionnary by pairing the values of the array to their frequency. We can use this function to calculates the number of clusters per stratum and store the result in a Python dictionnary. Then, we modify the values of the dictionnary to create the sample size dictionnary.

If some of the clusters are certainties then an exception will be raised. Hence, the user will have to manually handle the certaininties. Better handling of certainties is planned for future versions of the library samplics.

from samplics import array_to_dict

frame_size = array_to_dict(psu_frame["region"])
print(f"\nThe number of clusters per stratum is:\n {frame_size}")


The number of clusters per stratum is:
 {'East': 25, 'North': 10, 'South': 20, 'West': 45}

psu_sample_size = frame_size.copy()
psu_sample_size["East"] = 3
psu_sample_size["North"] = 2
psu_sample_size["South"] = 3
psu_sample_size["West"] = 2
print(f"\nThe sample size per stratum is:\n {psu_sample_size}\n")


The sample size per stratum is:
 {'East': 3, 'North': 2, 'South': 3, 'West': 2}

stage1_design = SampleSelection(method=SelectMethod.pps_sys, strat=True, wr=False)

psu_frame["psu_prob"] = stage1_design.inclusion_probs(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"],
    psu_frame["number_households_census"],
    )

nb_obs = 15
print(f"\nFirst {nb_obs} observations of the PSU frame \n")
psu_frame.head(nb_obs)


First 15 observations of the PSU frame

	cluster	region	number_households_census	cluster_status	comment	psu_prob
0	1	North	105	1	NaN	0.151625
1	2	North	85	1	NaN	0.122744
2	3	North	95	1	NaN	0.137184
3	4	North	75	1	NaN	0.108303
4	5	North	120	1	NaN	0.173285
5	6	North	90	1	NaN	0.129964
6	7	North	130	1	NaN	0.187726
7	8	North	55	1	NaN	0.079422
8	9	North	30	1	NaN	0.043321
9	10	North	600	1	due to a large building	0.866426
10	11	South	25	1	NaN	0.027523
11	12	South	250	1	NaN	0.275229
12	13	South	105	1	NaN	0.115596
13	14	South	75	1	NaN	0.082569
14	15	South	205	1	NaN	0.225688

PSU Selection

In this section, we select a sample of psus using pps methods. In the section above, we have calculated the probabilities of selection. That step is not necessary when using samplics. We can use the method select() to calculate the probability of selection and select the sample, in one run. As shown below, select() method returns a tuple of three arrays.
* The first array indicates the selected units (i.e. psu_sample = 1 if selected, and 0 if not selected).
* The second array provides the number of hits, useful when the sample is selected with replacement.
* The third array is the probability of selection.

Note

np.random.seed() fixes the random seed to allow us to reproduce the random selection.

np.random.seed(23)

psu_frame["psu_sample"], psu_frame["psu_hits"], psu_frame["psu_probs"] = \
    stage1_design.select(
        psu_frame["cluster"], 
        psu_sample_size, 
        psu_frame["region"], 
        psu_frame["number_households_census"]
    )
    
psu_frame.to_csv("./psu_frame.csv")

print(
    "\nFirst 15 obs of the PSU frame with the sampling information\n"
    )
psu_frame[
    ["cluster", "region", "psu_prob", "psu_sample", "psu_hits", "psu_probs"]
    ].head(15)


First 15 obs of the PSU frame with the sampling information

	cluster	region	psu_prob	psu_sample	psu_hits	psu_probs
0	1	North	0.151625	0	0	0.151625
1	2	North	0.122744	0	0	0.122744
2	3	North	0.137184	0	0	0.137184
3	4	North	0.108303	0	0	0.108303
4	5	North	0.173285	0	0	0.173285
5	6	North	0.129964	0	0	0.129964
6	7	North	0.187726	1	1	0.187726
7	8	North	0.079422	0	0	0.079422
8	9	North	0.043321	0	0	0.043321
9	10	North	0.866426	1	1	0.866426
10	11	South	0.027523	0	0	0.027523
11	12	South	0.275229	0	0	0.275229
12	13	South	0.115596	0	0	0.115596
13	14	South	0.082569	0	0	0.082569
14	15	South	0.225688	0	0	0.225688

The default setting sample_only=False returns the entire frame. We can easily reduce the output data to the sample by filtering i.e. psu_sample == 1. However, if we are only interested in the sample, we could use sample_only=True when calling select(). This will reduce the output data to the sampled units and to_dataframe=true will convert the data to a pandas dataframe (pd.DataFrame). Note that the columns in the dataframe will be reduced to the minimum.

np.random.seed(23)

psu_sample = stage1_design.select(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"], 
    psu_frame["number_households_census"],
    to_dataframe = True,
    sample_only = True
    )

print("\nPSU sample without the non-sampled units\n")
psu_sample


PSU sample without the non-sampled units

	_samp_unit	_stratum	_mos	_sample	_hits	_probs
0	7	North	130	1	1	0.187726
1	10	North	600	1	1	0.866426
2	16	South	190	1	1	0.209174
3	24	South	75	1	1	0.082569
4	29	South	200	1	1	0.220183
5	34	East	305	1	1	0.210587
6	45	East	450	1	1	0.310702
7	52	East	700	1	1	0.483314
8	64	West	300	1	1	0.091673
9	86	West	280	1	1	0.085561

The systematic selection method can be implemented with or without replacement. The other samplics algorithms for selecting sample with unequal probablities of selection are Brewer, Hanurav-Vijayan (hv), Murphy, and Rao-Sampford (rs) methods. As shown below, all these sampling techniques can be specified when extentiating a Sample class; then call select() to draw samples.

Sample(method=SelectMethod.pps_sys, wr=True)
Sample(method=SelectMethod.pps_sys, wr=False)
Sample(method=SelectMethod.pps_brewer, wr=False)
Sample(method=SelectMethod.pps_hv, wr=False) # Hanurav-Vijayan method
Sample(method=SelectMethod.pps_murphy, wr=False)
Sample(method=SelectMethod.pps_rs, wr=False) # Rao-Sampford method

For example, if we wanted to select the sample using the Rao-Sampford method, we could use the following snippet of code.

np.random.seed(23)

stage1_sampford = SampleSelection(
    method=SelectMethod.pps_rs, 
    strat=True, 
    wr=False
    )

psu_sample_sampford = stage1_sampford.select(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"], 
    psu_frame["number_households_census"],
    to_dataframe=True,
    sample_only=False
    )

psu_sample_sampford

	_samp_unit	_stratum	_mos	_sample	_hits	_probs
0	1	North	105	0	0	0.151625
1	2	North	85	0	0	0.122744
2	3	North	95	1	1	0.137184
3	4	North	75	0	0	0.108303
4	5	North	120	0	0	0.173285
...	...	...	...	...	...	...
95	96	West	95	1	1	0.029030
96	97	West	40	0	0	0.012223
97	98	West	105	0	0	0.032086
98	99	West	320	0	0	0.097785
99	100	West	200	0	0	0.061115

100 rows × 6 columns