import numpy as np
from samplics.datasets import load_psu_frame
from samplics import SelectMethod
from samplics.sampling import SampleSelection
Selection of PSUs
In the sections below, we draw primary sampling units (PSUs) using probability proportional to size (PPS) sampling techniques implemented in the Sample class. The class Sample has two main methods that is inclusion_probs and select. The method inclusion_probs() computes the probability of selection and select() draws the random samples.
The following will illustrate the use of samplics for sample selection. For the illustration,
- We consider a stratified cluster design.
- We will a priori decide how many PSUs to sample from each stratum
- For the clusters selection, we demonstrate PPS methods
This example is not meant to be exhaustif. There are many use cases that are not covered in this tutorial. For example, some PSUs may be segmented due to their size and segments selected in a subsequent step. Segment selection can be done with Samplics in a similar way as the PSUs selection, with PPS or SRS, after the segements have been created by the user.
First, let us import the python packages necessary to run the tutorial.
Sample Dataset
The file sample_frame.csv - shown below - contains synthetic data of 100 clusters classified by region (East, North, South and West). Clusters represent a group of households. In the file, each cluster has an associated number of households (number_households) and a status variable indicating whether the cluster is in scope or not.
This synthetic data represents a simplified version of enumeration areas (EAs) frames found in many countries and used by major household survey programs such as the Demographic and Health Surveys (DHS), the Population-based HIV Impact Assessment (PHIA) surveys and the Multiple Cluster Indicator Surveys (MICS).
= load_psu_frame()
psu_frame_dict = psu_frame_dict["data"]
psu_frame 25) psu_frame.head(
cluster | region | number_households_census | cluster_status | comment | |
---|---|---|---|---|---|
0 | 1 | North | 105 | 1 | NaN |
1 | 2 | North | 85 | 1 | NaN |
2 | 3 | North | 95 | 1 | NaN |
3 | 4 | North | 75 | 1 | NaN |
4 | 5 | North | 120 | 1 | NaN |
5 | 6 | North | 90 | 1 | NaN |
6 | 7 | North | 130 | 1 | NaN |
7 | 8 | North | 55 | 1 | NaN |
8 | 9 | North | 30 | 1 | NaN |
9 | 10 | North | 600 | 1 | due to a large building |
10 | 11 | South | 25 | 1 | NaN |
11 | 12 | South | 250 | 1 | NaN |
12 | 13 | South | 105 | 1 | NaN |
13 | 14 | South | 75 | 1 | NaN |
14 | 15 | South | 205 | 1 | NaN |
15 | 16 | South | 190 | 1 | NaN |
16 | 17 | South | 95 | 1 | NaN |
17 | 18 | South | 85 | 1 | NaN |
18 | 19 | South | 50 | 1 | NaN |
19 | 20 | South | 110 | 1 | NaN |
20 | 21 | South | 130 | 1 | NaN |
21 | 22 | South | 180 | 1 | NaN |
22 | 23 | South | 65 | 1 | NaN |
23 | 24 | South | 75 | 1 | NaN |
24 | 25 | South | 95 | 1 | NaN |
Often, sampling frames are not available for the sampling units of interest. For example, most countries do not have a list of all households or people living in the country. Even if such frames exist, it may not be operationally and financially feasible to directly select sampling units without any form of clustering.
Hence, stage sampling is a common strategy used by large household national surveys for selecting samples of households and people. At the first stage, geographic or administrative clusters of households are selected. At the second stage, a frame of households is created from the selected clusters and a sample of households is selected. At the third stage (if applicable), a sample of people is selected from the households in the sample. This is a high level description of the process; usually implementations are much less straightforward and may require many adjustments to address complexities.
PSU Probability of Selection
At the first stage, we use the proportional to size (pps) method to select a random sample of clusters. The measure of size is the number of households (number_households) as provided in the psu sampling frame. The sample is stratified by region. The probabilities, for stratified pps, is obtained as follow: \[\begin{equation} p_{hi} = \frac{n_h M_{hi}}{\sum_{i=1}^{N_h}{M_{hi}}} \end{equation}\] where \(p_{hi}\) is the probability of selection for unit \(i\) from stratum \(h\), \(M_{hi}\) is the measure of size (mos), \(n_h\) and \(N_h\) are the sample size and the total number of clusters in stratum \(h\), respectively.
The PPS method is used in many surveys not just for multistage household surveys.
For example, in business surveys, establishments can greatly vary in size; hence pps methods are often use to select samples. Simarly, facility-based surveys can benefit from pps methods when frames with measures of size are available.
PSU Sample size
For a stratified sampling design, the sample size is provided using a Python dictionary. Python dictionaries allow us to pair the strata with the sample sizes. Let’s say that we want to select 3 clusters from stratum East, 2 from West, 2 from North and 3 from South. The snippet of code below demonstrates how to create the Python dictionary. Note that it is important to correctly spell out the keys of the dictionary which corresponds to the values of the variable stratum (in our case it’s region).
= {"East":3, "West": 2, "North": 2, "South": 3}
psu_sample_size
print(f"\nThe sample size per domain is:\n {psu_sample_size}\n")
The sample size per domain is:
{'East': 3, 'West': 2, 'North': 2, 'South': 3}
The function array_to_dict() converts an array to a dictionnary by pairing the values of the array to their frequency. We can use this function to calculates the number of clusters per stratum and store the result in a Python dictionnary. Then, we modify the values of the dictionnary to create the sample size dictionnary.
If some of the clusters are certainties then an exception will be raised. Hence, the user will have to manually handle the certaininties. Better handling of certainties is planned for future versions of the library samplics.
from samplics import array_to_dict
= array_to_dict(psu_frame["region"])
frame_size print(f"\nThe number of clusters per stratum is:\n {frame_size}")
The number of clusters per stratum is:
{'East': 25, 'North': 10, 'South': 20, 'West': 45}
= frame_size.copy()
psu_sample_size "East"] = 3
psu_sample_size["North"] = 2
psu_sample_size["South"] = 3
psu_sample_size["West"] = 2
psu_sample_size[print(f"\nThe sample size per stratum is:\n {psu_sample_size}\n")
The sample size per stratum is:
{'East': 3, 'North': 2, 'South': 3, 'West': 2}
= SampleSelection(method=SelectMethod.pps_sys, strat=True, wr=False)
stage1_design
"psu_prob"] = stage1_design.inclusion_probs(
psu_frame["cluster"],
psu_frame[
psu_sample_size, "region"],
psu_frame["number_households_census"],
psu_frame[
)
= 15
nb_obs print(f"\nFirst {nb_obs} observations of the PSU frame \n")
psu_frame.head(nb_obs)
First 15 observations of the PSU frame
cluster | region | number_households_census | cluster_status | comment | psu_prob | |
---|---|---|---|---|---|---|
0 | 1 | North | 105 | 1 | NaN | 0.151625 |
1 | 2 | North | 85 | 1 | NaN | 0.122744 |
2 | 3 | North | 95 | 1 | NaN | 0.137184 |
3 | 4 | North | 75 | 1 | NaN | 0.108303 |
4 | 5 | North | 120 | 1 | NaN | 0.173285 |
5 | 6 | North | 90 | 1 | NaN | 0.129964 |
6 | 7 | North | 130 | 1 | NaN | 0.187726 |
7 | 8 | North | 55 | 1 | NaN | 0.079422 |
8 | 9 | North | 30 | 1 | NaN | 0.043321 |
9 | 10 | North | 600 | 1 | due to a large building | 0.866426 |
10 | 11 | South | 25 | 1 | NaN | 0.027523 |
11 | 12 | South | 250 | 1 | NaN | 0.275229 |
12 | 13 | South | 105 | 1 | NaN | 0.115596 |
13 | 14 | South | 75 | 1 | NaN | 0.082569 |
14 | 15 | South | 205 | 1 | NaN | 0.225688 |
PSU Selection
In this section, we select a sample of psus using pps methods. In the section above, we have calculated the probabilities of selection. That step is not necessary when using samplics. We can use the method select() to calculate the probability of selection and select the sample, in one run. As shown below, select() method returns a tuple of three arrays.
* The first array indicates the selected units (i.e. psu_sample = 1 if selected, and 0 if not selected).
* The second array provides the number of hits, useful when the sample is selected with replacement.
* The third array is the probability of selection.
np.random.seed() fixes the random seed to allow us to reproduce the random selection.
23)
np.random.seed(
"psu_sample"], psu_frame["psu_hits"], psu_frame["psu_probs"] = \
psu_frame[
stage1_design.select("cluster"],
psu_frame[
psu_sample_size, "region"],
psu_frame["number_households_census"]
psu_frame[
)
"./psu_frame.csv")
psu_frame.to_csv(
print(
"\nFirst 15 obs of the PSU frame with the sampling information\n"
)
psu_frame["cluster", "region", "psu_prob", "psu_sample", "psu_hits", "psu_probs"]
[15) ].head(
First 15 obs of the PSU frame with the sampling information
cluster | region | psu_prob | psu_sample | psu_hits | psu_probs | |
---|---|---|---|---|---|---|
0 | 1 | North | 0.151625 | 0 | 0 | 0.151625 |
1 | 2 | North | 0.122744 | 0 | 0 | 0.122744 |
2 | 3 | North | 0.137184 | 0 | 0 | 0.137184 |
3 | 4 | North | 0.108303 | 0 | 0 | 0.108303 |
4 | 5 | North | 0.173285 | 0 | 0 | 0.173285 |
5 | 6 | North | 0.129964 | 0 | 0 | 0.129964 |
6 | 7 | North | 0.187726 | 1 | 1 | 0.187726 |
7 | 8 | North | 0.079422 | 0 | 0 | 0.079422 |
8 | 9 | North | 0.043321 | 0 | 0 | 0.043321 |
9 | 10 | North | 0.866426 | 1 | 1 | 0.866426 |
10 | 11 | South | 0.027523 | 0 | 0 | 0.027523 |
11 | 12 | South | 0.275229 | 0 | 0 | 0.275229 |
12 | 13 | South | 0.115596 | 0 | 0 | 0.115596 |
13 | 14 | South | 0.082569 | 0 | 0 | 0.082569 |
14 | 15 | South | 0.225688 | 0 | 0 | 0.225688 |
The default setting sample_only=False
returns the entire frame. We can easily reduce the output data to the sample by filtering i.e. psu_sample == 1
. However, if we are only interested in the sample, we could use sample_only=True
when calling select(). This will reduce the output data to the sampled units and to_dataframe=true
will convert the data to a pandas dataframe (pd.DataFrame). Note that the columns in the dataframe will be reduced to the minimum.
23)
np.random.seed(
= stage1_design.select(
psu_sample "cluster"],
psu_frame[
psu_sample_size, "region"],
psu_frame["number_households_census"],
psu_frame[= True,
to_dataframe = True
sample_only
)
print("\nPSU sample without the non-sampled units\n")
psu_sample
PSU sample without the non-sampled units
_samp_unit | _stratum | _mos | _sample | _hits | _probs | |
---|---|---|---|---|---|---|
0 | 7 | North | 130 | 1 | 1 | 0.187726 |
1 | 10 | North | 600 | 1 | 1 | 0.866426 |
2 | 16 | South | 190 | 1 | 1 | 0.209174 |
3 | 24 | South | 75 | 1 | 1 | 0.082569 |
4 | 29 | South | 200 | 1 | 1 | 0.220183 |
5 | 34 | East | 305 | 1 | 1 | 0.210587 |
6 | 45 | East | 450 | 1 | 1 | 0.310702 |
7 | 52 | East | 700 | 1 | 1 | 0.483314 |
8 | 64 | West | 300 | 1 | 1 | 0.091673 |
9 | 86 | West | 280 | 1 | 1 | 0.085561 |
The systematic selection method can be implemented with or without replacement. The other samplics algorithms for selecting sample with unequal probablities of selection are Brewer, Hanurav-Vijayan (hv), Murphy, and Rao-Sampford (rs) methods. As shown below, all these sampling techniques can be specified when extentiating a Sample class; then call select() to draw samples.
=SelectMethod.pps_sys, wr=True)
Sample(method=SelectMethod.pps_sys, wr=False)
Sample(method=SelectMethod.pps_brewer, wr=False)
Sample(method=SelectMethod.pps_hv, wr=False) # Hanurav-Vijayan method
Sample(method=SelectMethod.pps_murphy, wr=False)
Sample(method=SelectMethod.pps_rs, wr=False) # Rao-Sampford method Sample(method
For example, if we wanted to select the sample using the Rao-Sampford method, we could use the following snippet of code.
23)
np.random.seed(
= SampleSelection(
stage1_sampford =SelectMethod.pps_rs,
method=True,
strat=False
wr
)
= stage1_sampford.select(
psu_sample_sampford "cluster"],
psu_frame[
psu_sample_size, "region"],
psu_frame["number_households_census"],
psu_frame[=True,
to_dataframe=False
sample_only
)
psu_sample_sampford
_samp_unit | _stratum | _mos | _sample | _hits | _probs | |
---|---|---|---|---|---|---|
0 | 1 | North | 105 | 0 | 0 | 0.151625 |
1 | 2 | North | 85 | 0 | 0 | 0.122744 |
2 | 3 | North | 95 | 1 | 1 | 0.137184 |
3 | 4 | North | 75 | 0 | 0 | 0.108303 |
4 | 5 | North | 120 | 0 | 0 | 0.173285 |
... | ... | ... | ... | ... | ... | ... |
95 | 96 | West | 95 | 1 | 1 | 0.029030 |
96 | 97 | West | 40 | 0 | 0 | 0.012223 |
97 | 98 | West | 105 | 0 | 0 | 0.032086 |
98 | 99 | West | 320 | 0 | 0 | 0.097785 |
99 | 100 | West | 200 | 0 | 0 | 0.061115 |
100 rows × 6 columns