from samplics.sampling import SampleSize
Sample Size for Stage Design
In the cells below, we illustrate a simple example of sample size calculation in the context of household surveys using stage sampling designs. Let’s assume that we want to calculate sample size for a vaccination survey in Senegal. We want to stratify the sample by administrative region. We will use the 2017 Senegal Demographic and Health Survey (DHS) to get an idea of the vaccination coverage rates for some main vaccine-doses. Below, we show coverage rates of hepatitis B birth dose (hepB0) vaccine, first and third dose of diphtheria, tetanus and pertussis (DTP), first dose of measles containing vaccine (MCV1) and coverage of basic vaccination. Basic vaccination refers to the 12-23 months old children that received BCG vaccine, three doses of DTP containing vaccine, three doses of polio vaccine, and the first dose of measles containing vaccine.The table below shows the 2017 Senegal DHS vaccination coverage of a few vaccine-doses for children aged 12 to 23 months old.
Region | HepB0 | DTP1 | DTP3 | MCV1 | Basic vaccination |
---|---|---|---|---|---|
Dakar | 53.6 | 99.1 | 98.5 | 97.0 | 84.9 |
Ziguinchor | 47.1 | 98.6 | 94.1 | 93.6 | 80.9 |
Diourbel | 62.8 | 94.6 | 88.2 | 86.1 | 68.2 |
Saint-Louis | 40.1 | 99.1 | 97.2 | 94.7 | 80.6 |
Tambacounda | 45.0 | 83.3 | 72.7 | 65.3 | 47.0 |
Kaolack | 63.9 | 99.6 | 92.2 | 89.3 | 79.7 |
Thies | 62.3 | 100.0 | 98.8 | 91.6 | 83.4 |
Louga | 49.8 | 96.2 | 87.8 | 81.5 | 67.8 |
Fatick | 62.7 | 98.5 | 93.8 | 90.3 | 76.6 |
Kolda | 32.8 | 94.4 | 87.3 | 85.6 | 63.7 |
Matam | 43.1 | 94.3 | 88.1 | 79.4 | 68.7 |
Kaffrine | 56.9 | 98.0 | 93.6 | 88.7 | 76.6 |
Kedougou | 44.4 | 70.7 | 60.2 | 46.5 | 33.6 |
Sedhiou | 46.6 | 96.8 | 90.4 | 89.9 | 74.2 |
The 2017 Senegal DHS data collection happened from April to December 2018. Therefore, the data shown in the table represent children born from October 2016 to December 2017. For the purpose of this tutorial, we will assume that these vaccine coverage rates still hold. Furthermore, we will use the basic vaccination coverage rates to calculate sample size.
Number of children
Wald
method
The first step is to create and object using the SampleSize class with the parameter of interest, the sample size calculation method, and the stratification status. In this example, we want to calculate sample size for proportions, using wald method for a stratified design. This is achived with the following snippet of code.
SampleSize(="proportion", method="wald", strat=True
param )
Because, we are using a stratified sample design, it is best to specify the expected coverage levels by stratum. If the information is not available then aggregated values can be used across the strata. The 2017 Senegal DHS published the coverage rates by region hence we have the information available by stratum. To provide the informmation to Samplics we use the python dictionaries as follows
= {
expected_coverage "Dakar": 0.849,
"Ziguinchor": 0.809,
"Diourbel": 0.682,
"Saint-Louis": 0.806,
"Tambacounda": 0.470,
"Kaolack": 0.797,
"Thies": 0.834,
"Louga": 0.678,
"Fatick": 0.766,
"Kolda": 0.637,
"Matam": 0.687,
"Kaffrine": 0.766,
"Kedougou": 0.336,
"Sedhiou": 0.742,
}
Now, we want to calculate the sample size with desired precision of 0.07 which means that we want the expected vaccination coverage rates to have 7% half confidence intervals e.g. expected rate of 90% will have a confidence interval of [83%, 97%]. Note that the desired precision can be specified by stratum in a similar way as the target coverage using a python dictionary.
Given that information, we can calculate the sample size using SampleSize class as follows.
from samplics.utils.types import SizeMethod, PopParam
# Declare the sample size calculation parameters
= SampleSize(
sen_vaccine_wald =PopParam.prop, method=SizeMethod.wald, strat=True
param
)
# calculate the sample size
=expected_coverage, half_ci=0.07)
sen_vaccine_wald.calculate(target
# show the calculated sample size
print(f"\nCalculated sample sizes by stratum: ")
sen_vaccine_wald.samp_size
Calculated sample sizes by stratum:
{'Dakar': 101,
'Ziguinchor': 122,
'Diourbel': 171,
'Saint-Louis': 123,
'Tambacounda': 196,
'Kaolack': 127,
'Thies': 109,
'Louga': 172,
'Fatick': 141,
'Kolda': 182,
'Matam': 169,
'Kaffrine': 141,
'Kedougou': 175,
'Sedhiou': 151}
SampleSize calculates the sample sizes and store the in teh samp_size attributes which is a python dictinary object. If a dataframe is better suited for the use case, the method to_dataframe() can be used to return a pandas dataframe.
sen_vaccine_wald.to_dataframe()
_param | _stratum | _target | _sigma | _half_ci | _samp_size | |
---|---|---|---|---|---|---|
0 | PopParam.prop | Dakar | 0.849 | 0.128199 | 0.07 | 101 |
1 | PopParam.prop | Ziguinchor | 0.809 | 0.154519 | 0.07 | 122 |
2 | PopParam.prop | Diourbel | 0.682 | 0.216876 | 0.07 | 171 |
3 | PopParam.prop | Saint-Louis | 0.806 | 0.156364 | 0.07 | 123 |
4 | PopParam.prop | Tambacounda | 0.470 | 0.249100 | 0.07 | 196 |
5 | PopParam.prop | Kaolack | 0.797 | 0.161791 | 0.07 | 127 |
6 | PopParam.prop | Thies | 0.834 | 0.138444 | 0.07 | 109 |
7 | PopParam.prop | Louga | 0.678 | 0.218316 | 0.07 | 172 |
8 | PopParam.prop | Fatick | 0.766 | 0.179244 | 0.07 | 141 |
9 | PopParam.prop | Kolda | 0.637 | 0.231231 | 0.07 | 182 |
10 | PopParam.prop | Matam | 0.687 | 0.215031 | 0.07 | 169 |
11 | PopParam.prop | Kaffrine | 0.766 | 0.179244 | 0.07 | 141 |
12 | PopParam.prop | Kedougou | 0.336 | 0.223104 | 0.07 | 175 |
13 | PopParam.prop | Sedhiou | 0.742 | 0.191436 | 0.07 | 151 |
The sample size calculation above assumes that the design effect (DEFF) was equal to 1. A design effect of 1 correspond to sampling design with a variance equivalent to a simple random selection of same sample size. In the context of complex sampling designs, DEFF is often different from 1. Stage sampling and unequal weights usually increase the design effect above 1. The 2017 Senegal DHS indicated a design effect equal to 1.963 (1.401^2) for basic vaccination. Hence, to calculate the sample size, we will use the design effect provided by DHS.
sen_vaccine_wald.calculate(=expected_coverage, half_ci=0.07, deff=1.401 ** 2
target
)
sen_vaccine_wald.to_dataframe()
_param | _stratum | _target | _sigma | _half_ci | _samp_size | |
---|---|---|---|---|---|---|
0 | PopParam.prop | Dakar | 0.849 | 0.128199 | 0.07 | 198 |
1 | PopParam.prop | Ziguinchor | 0.809 | 0.154519 | 0.07 | 238 |
2 | PopParam.prop | Diourbel | 0.682 | 0.216876 | 0.07 | 334 |
3 | PopParam.prop | Saint-Louis | 0.806 | 0.156364 | 0.07 | 241 |
4 | PopParam.prop | Tambacounda | 0.470 | 0.249100 | 0.07 | 384 |
5 | PopParam.prop | Kaolack | 0.797 | 0.161791 | 0.07 | 249 |
6 | PopParam.prop | Thies | 0.834 | 0.138444 | 0.07 | 214 |
7 | PopParam.prop | Louga | 0.678 | 0.218316 | 0.07 | 336 |
8 | PopParam.prop | Fatick | 0.766 | 0.179244 | 0.07 | 276 |
9 | PopParam.prop | Kolda | 0.637 | 0.231231 | 0.07 | 356 |
10 | PopParam.prop | Matam | 0.687 | 0.215031 | 0.07 | 331 |
11 | PopParam.prop | Kaffrine | 0.766 | 0.179244 | 0.07 | 276 |
12 | PopParam.prop | Kedougou | 0.336 | 0.223104 | 0.07 | 344 |
13 | PopParam.prop | Sedhiou | 0.742 | 0.191436 | 0.07 | 295 |
Since the sample design is stratified, the sample size calculation will be more precised if DEFF is specified at the stratum level which is available from the 2017 Senegal DHS provided report. Some regions have a design effect below 1. To be conservative with our sample size calculation, we will use 1.21 as the minimum design effect to use in the sample size calculation.
# Target coverage rates
= {
expected_deff "Dakar": 1.100 ** 2,
"Ziguinchor": 1.100 ** 2,
"Diourbel": 1.346 ** 2,
"Saint-Louis": 1.484 ** 2,
"Tambacounda": 1.366 ** 2,
"Kaolack": 1.360 ** 2,
"Thies": 1.109 ** 2,
"Louga": 1.902 ** 2,
"Fatick": 1.100 ** 2,
"Kolda": 1.217 ** 2,
"Matam": 1.403 ** 2,
"Kaffrine": 1.256 ** 2,
"Kedougou": 2.280 ** 2,
"Sedhiou": 1.335 ** 2,
}
# Calculate sample sizes using deff at the stratum level
sen_vaccine_wald.calculate(=expected_coverage, half_ci=0.07, deff=expected_deff
target
)
# Convert sample sizes to a dataframe
sen_vaccine_wald.to_dataframe()
_param | _stratum | _target | _sigma | _half_ci | _samp_size | |
---|---|---|---|---|---|---|
0 | PopParam.prop | Dakar | 0.849 | 0.128199 | 0.07 | 122 |
1 | PopParam.prop | Ziguinchor | 0.809 | 0.154519 | 0.07 | 147 |
2 | PopParam.prop | Diourbel | 0.682 | 0.216876 | 0.07 | 309 |
3 | PopParam.prop | Saint-Louis | 0.806 | 0.156364 | 0.07 | 270 |
4 | PopParam.prop | Tambacounda | 0.470 | 0.249100 | 0.07 | 365 |
5 | PopParam.prop | Kaolack | 0.797 | 0.161791 | 0.07 | 235 |
6 | PopParam.prop | Thies | 0.834 | 0.138444 | 0.07 | 134 |
7 | PopParam.prop | Louga | 0.678 | 0.218316 | 0.07 | 620 |
8 | PopParam.prop | Fatick | 0.766 | 0.179244 | 0.07 | 171 |
9 | PopParam.prop | Kolda | 0.637 | 0.231231 | 0.07 | 269 |
10 | PopParam.prop | Matam | 0.687 | 0.215031 | 0.07 | 332 |
11 | PopParam.prop | Kaffrine | 0.766 | 0.179244 | 0.07 | 222 |
12 | PopParam.prop | Kedougou | 0.336 | 0.223104 | 0.07 | 910 |
13 | PopParam.prop | Sedhiou | 0.742 | 0.191436 | 0.07 | 268 |
The sample size calculation above does not account for attrition of sample sizes due to non-response. In the 2017 Semegal DHS, the overal household and women reponse rate was abou 94.2%.
# Calculate sample sizes with a resp_rate of 94.2%
sen_vaccine_wald.calculate(=expected_coverage,
target=0.07,
half_ci=expected_deff,
deff=0.942
resp_rate
)
# Convert sample sizes to a dataframe
sen_vaccine_wald.to_dataframe(=[
col_names"Parameter",
"region",
"vaccine_cov",
"stderr",
"half_ci",
"count_12_23",
] )
Parameter | region | vaccine_cov | stderr | half_ci | count_12_23 | |
---|---|---|---|---|---|---|
0 | PopParam.prop | Dakar | 0.849 | 0.128199 | 0.07 | 130 |
1 | PopParam.prop | Ziguinchor | 0.809 | 0.154519 | 0.07 | 156 |
2 | PopParam.prop | Diourbel | 0.682 | 0.216876 | 0.07 | 328 |
3 | PopParam.prop | Saint-Louis | 0.806 | 0.156364 | 0.07 | 287 |
4 | PopParam.prop | Tambacounda | 0.470 | 0.249100 | 0.07 | 387 |
5 | PopParam.prop | Kaolack | 0.797 | 0.161791 | 0.07 | 250 |
6 | PopParam.prop | Thies | 0.834 | 0.138444 | 0.07 | 142 |
7 | PopParam.prop | Louga | 0.678 | 0.218316 | 0.07 | 658 |
8 | PopParam.prop | Fatick | 0.766 | 0.179244 | 0.07 | 181 |
9 | PopParam.prop | Kolda | 0.637 | 0.231231 | 0.07 | 286 |
10 | PopParam.prop | Matam | 0.687 | 0.215031 | 0.07 | 353 |
11 | PopParam.prop | Kaffrine | 0.766 | 0.179244 | 0.07 | 236 |
12 | PopParam.prop | Kedougou | 0.336 | 0.223104 | 0.07 | 966 |
13 | PopParam.prop | Sedhiou | 0.742 | 0.191436 | 0.07 | 284 |
Fleiss
method
The World Health Organization (WHO) recommends using the Fleiss method for calculating sample size for vaccination coverage survey, as specified in the following guideline document: https://www.who.int/immunization/documents/who_ivb_18.09/en/. To use the Fleiss method, the examples shown above are the same with method="fleiss"
.
= SampleSize(
sen_vaccine_fleiss =PopParam.prop,
param=SizeMethod.fleiss,
method=True
strat
)
sen_vaccine_fleiss.calculate(=expected_coverage,
target=0.07,
half_ci=expected_deff,
deff=0.942
resp_rate
)
= sen_vaccine_fleiss.to_dataframe(
sen_vaccine_sample =[
col_names"Parameter",
"region",
"vaccine_cov",
"stderr",
"half_ci",
"count_12_23",
]
)15) sen_vaccine_sample.head(
Parameter | region | vaccine_cov | stderr | half_ci | count_12_23 | |
---|---|---|---|---|---|---|
0 | PopParam.prop | Dakar | 0.849 | 0.128199 | 0.07 | 190 |
1 | PopParam.prop | Ziguinchor | 0.809 | 0.154519 | 0.07 | 210 |
2 | PopParam.prop | Diourbel | 0.682 | 0.216876 | 0.07 | 398 |
3 | PopParam.prop | Saint-Louis | 0.806 | 0.156364 | 0.07 | 384 |
4 | PopParam.prop | Tambacounda | 0.470 | 0.249100 | 0.07 | 410 |
5 | PopParam.prop | Kaolack | 0.797 | 0.161791 | 0.07 | 329 |
6 | PopParam.prop | Thies | 0.834 | 0.138444 | 0.07 | 201 |
7 | PopParam.prop | Louga | 0.678 | 0.218316 | 0.07 | 794 |
8 | PopParam.prop | Fatick | 0.766 | 0.179244 | 0.07 | 228 |
9 | PopParam.prop | Kolda | 0.637 | 0.231231 | 0.07 | 325 |
10 | PopParam.prop | Matam | 0.687 | 0.215031 | 0.07 | 432 |
11 | PopParam.prop | Kaffrine | 0.766 | 0.179244 | 0.07 | 297 |
12 | PopParam.prop | Kedougou | 0.336 | 0.223104 | 0.07 | 1140 |
13 | PopParam.prop | Sedhiou | 0.742 | 0.191436 | 0.07 | 348 |
At this point, we have the number of 12-23 months needed to achieve the desired precision given the expected proportions using wald or fleiss calculation methods.
Number of households
To obtain the number of households, we need to know the expected average number of children aged 12-23 months per household. This information can be obtained from census data or from surveys’ rosters. Since, the design is stratified, it is best to obtain the information per stratum. In this example, we wil assume that 5.2% of the population is between 12 and 23 months of age and apply that to all strata and household. Hence, the minimum number of households to select is:
"number_hhs"] = round(
sen_vaccine_sample["count_12_23"] / 0.052, 0
sen_vaccine_sample[
)
15) sen_vaccine_sample.head(
Parameter | region | vaccine_cov | stderr | half_ci | count_12_23 | number_hhs | |
---|---|---|---|---|---|---|---|
0 | PopParam.prop | Dakar | 0.849 | 0.128199 | 0.07 | 190 | 3654.0 |
1 | PopParam.prop | Ziguinchor | 0.809 | 0.154519 | 0.07 | 210 | 4038.0 |
2 | PopParam.prop | Diourbel | 0.682 | 0.216876 | 0.07 | 398 | 7654.0 |
3 | PopParam.prop | Saint-Louis | 0.806 | 0.156364 | 0.07 | 384 | 7385.0 |
4 | PopParam.prop | Tambacounda | 0.470 | 0.249100 | 0.07 | 410 | 7885.0 |
5 | PopParam.prop | Kaolack | 0.797 | 0.161791 | 0.07 | 329 | 6327.0 |
6 | PopParam.prop | Thies | 0.834 | 0.138444 | 0.07 | 201 | 3865.0 |
7 | PopParam.prop | Louga | 0.678 | 0.218316 | 0.07 | 794 | 15269.0 |
8 | PopParam.prop | Fatick | 0.766 | 0.179244 | 0.07 | 228 | 4385.0 |
9 | PopParam.prop | Kolda | 0.637 | 0.231231 | 0.07 | 325 | 6250.0 |
10 | PopParam.prop | Matam | 0.687 | 0.215031 | 0.07 | 432 | 8308.0 |
11 | PopParam.prop | Kaffrine | 0.766 | 0.179244 | 0.07 | 297 | 5712.0 |
12 | PopParam.prop | Kedougou | 0.336 | 0.223104 | 0.07 | 1140 | 21923.0 |
13 | PopParam.prop | Sedhiou | 0.742 | 0.191436 | 0.07 | 348 | 6692.0 |
Similarly, the number of clusters to select can be obtained by dividing the number of households by the number of households per cluster to be selected.