Datasets

The module datasets allows the user to load the datasets used in this tutorial. Note that the datasets are only used to illustrate the syntax and APIs of Samplics. Many of the datasets in this tutorial are subsets of actual samples but DO NOT represent these samples. The datasets were subseted from existing samples to reduce the size of the files.

Tip

a dataset can be loaded using the function load_xxx() where xxx indicates the dataset name.

For example, load_psu_frame() loads the PSU frame dataset.

These functions return a dictionary with the following members: name, description, nrows, ncols, design, source, and, data. The current list of datasets is the following:

Birth: This dataset represent a city data of categories of birth by age group. The dataset was obtained through the public stata API. Use load_birth() to load the dataset.
CountyCrop and CountryCropMeans: These datasets were used by Battese, Harter, and Fuller (1988) for their pioner paper on small area estimation. Use load_county_crop() and load_county_crop_means() to load the datasets.
ExpenditureMilk: The Milk Expenditure data contains 43 observations on the average expenditure on fresh milk for the year 1989. This dataset was originally used by Arora and Lahiri (1997) and later by You and Chapman (2006). Use load_expenditure_milk() to load the dataset.
Nhanes2, Nhanes2brr, and Nhanes2jk: these datasets were obtained from the NHANES (McDowell et al. 1981)_. As mentioned above, the datasets are only subsets of the full sample and do not represent the NHANES II study. This data is only useful for illustrating the syntax of samplics. These datasets should not be used to conduct any analysis of NHANES nor use the numbers for any statistical analysis. The original data was obtained through the public stata API. Use load_nhanes2(), load_nhanes2brr(), and load_nhanes2jk() to load the datasets.
Nmihs: The dataset is a subset of the National Maternal and Infant Health Survey (NMIHS) sample (Gonzalez Jr, N, and C 1992). The dataset should not be used to conduct any analysis of NMIHS nor use the numbers for any statistical analysis. The original data was obtained through the public stata API. Use load_nmihs() to load the dataset.
PSUFrame, PSUSample, and SSUSample: these are simulated datasets to illustrate the selection of primary and secondary sampling units. Use load_psu_frame(), load_psu_sample(), and load_ssu_sample() to load the datasets.

Let’s assume we want to load the PSU frame, we could write the following code.

import samplics

# Import the appropriate class.
from samplics.datasets import load_psu_frame

# Load the dataset and its metadata into 
# the variable (dictionary) psu_frame_dict
psu_frame_dict = load_psu_frame()

# Store the datasets in the variable psu_frame (optional)
psu_frame = psu_frame_dict["data"]

Important

The datasets should not be used for any statistical analysis.
No number shown in this tutorial shall be used for any statistical analysis.
All the examples are exclusively for illustrating the syntax and APIs of Samplics.

References

Battese, G E, R M Harter, and W A Fuller. 1988. “An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data.” J. Amer. Statist. Assoc. 83: 28–36.

Gonzalez Jr, J F, Krauss N, and Scott C. 1992. “Estimation in the 1988 National Maternal and Infant Health Survey.” In Proceedings of the Section on Statistics Education, edited by American Statistical Association, 343–48. https://doi.org/ 10.25080/Majora-92bf1922-00a .

McDowell, A, A Engel, J T Massey, and K Maurer. 1981. “Lan and Operation of the Second National Health and Nutrition Examination Survey, 1976–1980.” Vital and Health Statistics 1 (15): 1–144.