from pprint import pprint
from samplics.datasets import load_birth, load_nhanes2
from samplics.categorical import Tabulation, CrossTabulation
from samplics.utils.types import PopParam
Tabulation
In this tutorial, we will explore samplics’ APIs for creating design-based tabulations. There are two main python classes for tabulation i.e. Tabulation()
for one-way tables and CrossTabulation()
for two-way tables.
One-way tabulation
The birth dataset has four variables: region, agecat, birthcat, and pop. The variables agecat and birthcat are categirical. By default, pandas read them as numerical, because they are coded with numerical values. We use dtype="string"
or dtype="category"
to ensure that pandas codes the variables as categorical responses.
# Load Birth sample data
= load_birth()
birth_dict = birth_dict["data"].astype(
birth "region": str, "agecat": str, "birthcat": str}
{
)
= birth["region"]
region = birth["agecat"]
agecat = birth["birthcat"]
birthcat
15) birth.head(
region | agecat | birthcat | pop | |
---|---|---|---|---|
0 | 1 | 1 | 1.0 | 28152 |
1 | 1 | 1 | 1.0 | 103101 |
2 | 1 | 1 | 1.0 | 113299 |
3 | 1 | 1 | 1.0 | 112028 |
4 | 1 | 1 | 1.0 | 99588 |
5 | 1 | 1 | 1.0 | 22356 |
6 | 1 | 1 | 1.0 | 102926 |
7 | 1 | 1 | 1.0 | 12627 |
8 | 1 | 1 | 1.0 | 112885 |
9 | 1 | 1 | 1.0 | 150297 |
10 | 1 | 1 | 1.0 | 52785 |
11 | 1 | 1 | 2.0 | 109108 |
12 | 1 | 1 | 2.0 | 87768 |
13 | 1 | 1 | 2.0 | 175886 |
14 | 1 | 1 | 2.0 | 107847 |
When requesting a table, the user can set param="count"
which results in a tabulation with counts in the cells while param="proportion
leads to cells with proportions. The expression Tabulation("count")
instantiates the class Tabulation()
which has a method tabulate()
to produce the table.
= Tabulation(param=PopParam.count)
birth_count =True)
birth_count.tabulate(birthcat, remove_nan
print(birth_count)
Tabulation of birthcat
Number of strata: 1
Number of PSUs: 923
Number of observations: 923
Degrees of freedom: 922.00
variable category PopParam.count stderror lower_ci upper_ci
birthcat 1.0 240.0 13.333695 213.832087 266.167913
birthcat 2.0 450.0 15.193974 420.181215 479.818785
birthcat 3.0 233.0 13.204959 207.084737 258.915263
When remove_nan=False
, the numpy and pandas special values NaNs, respectively np.nan and NaN, are treated as valid categories and added to the tables as shown below
= Tabulation(param=PopParam.count)
birth_count =False)
birth_count.tabulate(birthcat, remove_nan
print(birth_count)
Tabulation of birthcat
Number of strata: 1
Number of PSUs: 956
Number of observations: 956
Degrees of freedom: 955.00
variable category PopParam.count stderror lower_ci upper_ci
birthcat 1.0 240.0 13.414066 213.675550 266.324450
birthcat 2.0 450.0 15.441157 419.697485 480.302515
birthcat 3.0 233.0 13.281448 206.935807 259.064193
birthcat nan 33.0 5.647499 21.917060 44.082940
The data associated with the tabulation are stored in nested python dictionaries. The higher level key is the variable name and the inner keys are the response categories. Each of the last four columns shown above are stored in separated dictionaries. Two of those dictionaries for the counts and standard errors shown below.
print("\nThe designed-based estimated counts are:")
pprint(birth_count.point_est)
print("\nThe designed-based standard errors are:")
pprint(birth_count.stderror)
The designed-based estimated counts are:
{'birthcat': {'1.0': 240.0, '2.0': 450.0, '3.0': 233.0, 'nan': 33.0}}
The designed-based standard errors are:
{'birthcat': {'1.0': 13.414066228212418,
'2.0': 15.441156672080245,
'3.0': 13.281447911984001,
'nan': 5.647498635475369}}
Sometimes, the user may want to run multiple one-way tables of several variables. In this case, the user can provide the data as a two-dimensional dataframe where each column represents one categorical variable. In this situation, each categorical variable is tabulated individually then combined into Python dictionaries.
= Tabulation(param=PopParam.count)
birth_count2
birth_count2.tabulate("region", "agecat", "birthcat"]],
birth[[=True
remove_nan
)
print(birth_count2)
Tabulation of region
Number of strata: 1
Number of PSUs: 923
Number of observations: 923
Degrees of freedom: 922.00
variable category PopParam.count stderror lower_ci upper_ci
region 1 166.0 11.718335 143.003340 188.996660
region 2 284.0 14.136507 256.257795 311.742205
region 3 250.0 13.594733 223.321002 276.678998
region 4 256.0 13.698320 229.117716 282.882284
agecat 1 507.0 15.439224 476.701278 537.298722
agecat 2 316.0 14.552307 287.441809 344.558191
agecat 3 133.0 10.705921 111.990152 154.009848
birthcat 1.0 240.0 13.333695 213.832087 266.167913
birthcat 2.0 450.0 15.193974 420.181215 479.818785
birthcat 3.0 233.0 13.204959 207.084737 258.915263
Two of the associated Python dictionaries are shown below. The structure of the inner dictionaries remain the same but additional key-value pairs are added to represent the several categorical variables.
print("\nThe designed-based estimated counts are:")
pprint(birth_count2.point_est)
print("\nThe designed-based standard errors are:")
pprint(birth_count2.stderror)
The designed-based estimated counts are:
{'agecat': {'1': 507.0, '2': 316.0, '3': 133.0},
'birthcat': {'1.0': 240.0, '2.0': 450.0, '3.0': 233.0},
'region': {'1': 166.0, '2': 284.0, '3': 250.0, '4': 256.0}}
The designed-based standard errors are:
{'agecat': {'1': 15.439223863518952,
'2': 14.55230681053191,
'3': 10.705921442206721},
'birthcat': {'1.0': 13.333694861331516,
'2.0': 15.19397357414444,
'3.0': 13.20495864267966},
'region': {'1': 11.718334853030537,
'2': 14.13650726651876,
'3': 13.594732580183488,
'4': 13.698320300591277}}
In the example above, we used pandas series and dataframes with labelled variables. In some situations, the user may want to tabulate numpy arrays, lists or tuples without variable names atrribute from the data. For these situations, the varnames
parameter provides a way to assign names for the categorical variables. Even when the variables have labels, users can leverage varnames
to rename the categorical variables.
= birth["region"].to_numpy()
region_no_name = birth["agecat"].to_numpy()
agecat_no_name = birth["birthcat"].to_numpy()
birthcat_no_name
= Tabulation(param=PopParam.prop)
birth_prop_new_name
birth_prop_new_name.tabulate(vars=[region_no_name, agecat_no_name, birthcat_no_name],
=["Region", "AgeGroup", "BirthType"],
varnames=True,
remove_nan
)
print(birth_prop_new_name)
Tabulation of Region
Number of strata: 1
Number of PSUs: 923
Number of observations: 923
Degrees of freedom: 922.00
variable category PopParam.prop stderror lower_ci upper_ci
Region 1 0.173640 0.012258 0.150883 0.199025
Region 2 0.297071 0.014787 0.268892 0.326883
Region 3 0.261506 0.014220 0.234574 0.290357
Region 4 0.267782 0.014329 0.240614 0.296819
AgeGroup 1 0.530335 0.016150 0.498562 0.561864
AgeGroup 2 0.330544 0.015222 0.301383 0.361068
AgeGroup 3 0.139121 0.011199 0.118564 0.162586
BirthType 1.0 0.260022 0.014446 0.232687 0.289357
BirthType 2.0 0.487541 0.016462 0.455331 0.519854
BirthType 3.0 0.252438 0.014307 0.225406 0.281533
If the user does not specify varnames
, the tabulate()
creates generic variables names var_1
, var_2
, etc.
= Tabulation(param=PopParam.prop)
birth_prop_new_name2
birth_prop_new_name2.tabulate(vars=[region_no_name, agecat_no_name, birthcat_no_name],
=True
remove_nan
)
print(birth_prop_new_name2)
Tabulation of var_1
Number of strata: 1
Number of PSUs: 923
Number of observations: 923
Degrees of freedom: 922.00
variable category PopParam.prop stderror lower_ci upper_ci
var_1 1 0.173640 0.012258 0.150883 0.199025
var_1 2 0.297071 0.014787 0.268892 0.326883
var_1 3 0.261506 0.014220 0.234574 0.290357
var_1 4 0.267782 0.014329 0.240614 0.296819
var_2 1 0.530335 0.016150 0.498562 0.561864
var_2 2 0.330544 0.015222 0.301383 0.361068
var_2 3 0.139121 0.011199 0.118564 0.162586
var_3 1.0 0.260022 0.014446 0.232687 0.289357
var_3 2.0 0.487541 0.016462 0.455331 0.519854
var_3 3.0 0.252438 0.014307 0.225406 0.281533
If the data was collected from a complex survey sample, the user may provide the sample design information to derive design-based statistics for the tabulation.
# Load Nhanes sample data
= load_nhanes2()
nhanes2_dict = nhanes2_dict["data"]
nhanes2
= nhanes2["stratid"]
stratum = nhanes2["psuid"]
psu = nhanes2["finalwgt"]
weight
= Tabulation(param=PopParam.prop)
diabetes_nhanes
diabetes_nhanes.tabulate(vars=nhanes2[["race", "diabetes"]],
=weight,
samp_weight=stratum,
stratum=psu,
psu=True,
remove_nan
)
print(diabetes_nhanes)
Tabulation of race
Number of strata: 31
Number of PSUs: 62
Number of observations: 10335
Degrees of freedom: 31.00
variable category PopParam.prop stderror lower_ci upper_ci
race 1.0 0.879016 0.016722 0.840568 0.909194
race 2.0 0.095615 0.012778 0.072541 0.125039
race 3.0 0.025369 0.010554 0.010781 0.058528
diabetes 0.0 0.965715 0.001820 0.961803 0.969238
diabetes 1.0 0.034285 0.001820 0.030762 0.038197
Two-way tabulation (cross-tabulation)
Cross-tabulation of two categorical variables is achieved by using the class CrossTabulation()
. As above, cross-tabulation is possible for counts and proportions using CrossTabulation(param="count")
and CrossTabulation(param="proportion")
, respectively. The Python script below creates a design-based cross-tabulation of race by diabetes status. The sample design information is optional; when not provided, a simple random sample (srs) is assumed.
= CrossTabulation(param=PopParam.prop)
crosstab_nhanes
crosstab_nhanes.tabulate(vars=nhanes2[["race", "diabetes"]],
=weight,
samp_weight=stratum,
stratum=psu,
psu=True,
remove_nan
)
print(crosstab_nhanes)
Cross-tabulation of race and diabetes
Number of strata: 31
Number of PSUs: 62
Number of observations: 10335
Degrees of freedom: 31.00
race diabetes PopParam.prop stderror lower_ci upper_ci
1 0.0 0.850866 0.015850 0.815577 0.880392
1 1.0 0.028123 0.001938 0.024430 0.032357
2 0.0 0.089991 0.012171 0.068062 0.118090
2 1.0 0.005646 0.000847 0.004157 0.007663
3 0.0 0.024858 0.010188 0.010702 0.056669
3 1.0 0.000516 0.000387 0.000112 0.002383
Pearson (with Rao-Scott adjustment):
Unadjusted - chi2(2): 21.2661 with p-value of 0.0000
Adjusted - F(1.52, 47.26): 14.9435 with p-value of 0.0000
Likelihood ratio (with Rao-Scott adjustment):
Unadjusted - chi2(2): 18.3925 with p-value of 0.0001
Adjusted - F(1.52, 47.26): 12.9242 with p-value of 0.0001
In addition to pandas dataframe, the categorical variables may be provided as an numpy array, list or tuple. In the examples below, the categorical variables are provided as a tuple vars=(rage, diabetes)
. In this case, race
and diabetes
are numpy arrays and do not have a name attribute. The parameter varnames
allows the user to name the categorical variables. If varnames is not specified then `var_1
and var_2
are used as variables names.
= nhanes2["race"].to_numpy()
race = nhanes2["diabetes"].to_numpy()
diabetes
= CrossTabulation(param=PopParam.prop)
crosstab_nhanes
crosstab_nhanes.tabulate(vars=(race, diabetes),
=weight,
samp_weight=stratum,
stratum=psu,
psu=True,
remove_nan
)
print(crosstab_nhanes)
Cross-tabulation of var_1 and var_2
Number of strata: 31
Number of PSUs: 62
Number of observations: 10335
Degrees of freedom: 31.00
var_1 var_2 PopParam.prop stderror lower_ci upper_ci
1.0 0.0 0.850866 0.015850 0.815577 0.880392
1.0 1.0 0.028123 0.001938 0.024430 0.032357
2.0 0.0 0.089991 0.012171 0.068062 0.118090
2.0 1.0 0.005646 0.000847 0.004157 0.007663
3.0 0.0 0.024858 0.010188 0.010702 0.056669
3.0 1.0 0.000516 0.000387 0.000112 0.002383
Pearson (with Rao-Scott adjustment):
Unadjusted - chi2(2): 21.2661 with p-value of 0.0000
Adjusted - F(1.52, 47.26): 14.9435 with p-value of 0.0000
Likelihood ratio (with Rao-Scott adjustment):
Unadjusted - chi2(2): 18.3925 with p-value of 0.0001
Adjusted - F(1.52, 47.26): 12.9242 with p-value of 0.0001
Same as the above example with variables names specified by varnames=["Race", DiabetesStatus"]
= CrossTabulation(param=PopParam.prop)
crosstab_nhanes
crosstab_nhanes.tabulate(vars=(race, diabetes),
=["Race", "DiabetesStatus"],
varnames=weight,
samp_weight=stratum,
stratum=psu,
psu=True,
remove_nan
)
print(crosstab_nhanes)
Cross-tabulation of Race and DiabetesStatus
Number of strata: 31
Number of PSUs: 62
Number of observations: 10335
Degrees of freedom: 31.00
Race DiabetesStatus PopParam.prop stderror lower_ci upper_ci
1.0 0.0 0.850866 0.015850 0.815577 0.880392
1.0 1.0 0.028123 0.001938 0.024430 0.032357
2.0 0.0 0.089991 0.012171 0.068062 0.118090
2.0 1.0 0.005646 0.000847 0.004157 0.007663
3.0 0.0 0.024858 0.010188 0.010702 0.056669
3.0 1.0 0.000516 0.000387 0.000112 0.002383
Pearson (with Rao-Scott adjustment):
Unadjusted - chi2(2): 21.2661 with p-value of 0.0000
Adjusted - F(1.52, 47.26): 14.9435 with p-value of 0.0000
Likelihood ratio (with Rao-Scott adjustment):
Unadjusted - chi2(2): 18.3925 with p-value of 0.0001
Adjusted - F(1.52, 47.26): 12.9242 with p-value of 0.0001