Tabulation

In this tutorial, we will explore samplics’ APIs for creating design-based tabulations. There are two main python classes for tabulation i.e. Tabulation() for one-way tables and CrossTabulation() for two-way tables.

from pprint import pprint

from samplics.datasets import load_birth, load_nhanes2
from samplics.categorical import Tabulation, CrossTabulation

One-way tabulation

The birth dataset has four variables: region, agecat, birthcat, and pop. The variables agecat and birthcat are categirical. By default, pandas read them as numerical, because they are coded with numerical values. We use dtype="string" or dtype="category" to ensure that pandas codes the variables as categorical responses.

# Load Birth sample data
birth_dict = load_birth()
birth = birth_dict["data"].astype(
    {"region": str, "agecat": str, "birthcat": str}
)

region = birth["region"]
agecat = birth["agecat"]
birthcat = birth["birthcat"]

birth.head(15)
region agecat birthcat pop
0 1 1 1.0 28152
1 1 1 1.0 103101
2 1 1 1.0 113299
3 1 1 1.0 112028
4 1 1 1.0 99588
5 1 1 1.0 22356
6 1 1 1.0 102926
7 1 1 1.0 12627
8 1 1 1.0 112885
9 1 1 1.0 150297
10 1 1 1.0 52785
11 1 1 2.0 109108
12 1 1 2.0 87768
13 1 1 2.0 175886
14 1 1 2.0 107847

When requesting a table, the user can set param="count" which results in a tabulation with counts in the cells while param="proportion leads to cells with proportions. The expression Tabulation("count") instantiates the class Tabulation() which has a method tabulate() to produce the table.

birth_count = Tabulation(param="count")
birth_count.tabulate(birthcat, remove_nan=True)

print(birth_count)

Tabulation of birthcat
 Number of strata: 1
 Number of PSUs: 923
 Number of observations: 923
 Degrees of freedom: 922.00

 variable category  count  stderror   lower_ci   upper_ci
birthcat      1.0  240.0 13.333695 213.832087 266.167913
birthcat      2.0  450.0 15.193974 420.181215 479.818785
birthcat      3.0  233.0 13.204959 207.084737 258.915263

When remove_nan=False, the numpy and pandas special values NaNs, respectively np.nan and NaN, are treated as valid categories and added to the tables as shown below

birth_count = Tabulation(param="count")
birth_count.tabulate(birthcat, remove_nan=False)

print(birth_count)

Tabulation of birthcat
 Number of strata: 1
 Number of PSUs: 956
 Number of observations: 956
 Degrees of freedom: 955.00

 variable category  count  stderror   lower_ci   upper_ci
birthcat      1.0  240.0 13.414066 213.675550 266.324450
birthcat      2.0  450.0 15.441157 419.697485 480.302515
birthcat      3.0  233.0 13.281448 206.935807 259.064193
birthcat      nan   33.0  5.647499  21.917060  44.082940

The data associated with the tabulation are stored in nested python dictionaries. The higher level key is the variable name and the inner keys are the response categories. Each of the last four columns shown above are stored in separated dictionaries. Two of those dictionaries for the counts and standard errors shown below.

print("\nThe designed-based estimated counts are:")
pprint(birth_count.point_est)

print("\nThe designed-based standard errors are:")
pprint(birth_count.stderror)

The designed-based estimated counts are:
{'birthcat': {'1.0': 240.0, '2.0': 450.0, '3.0': 233.0, 'nan': 33.0}}

The designed-based standard errors are:
{'birthcat': {'1.0': 13.414066228212418,
              '2.0': 15.441156672080245,
              '3.0': 13.281447911984001,
              'nan': 5.647498635475369}}

Sometimes, the user may want to run multiple one-way tables of several variables. In this case, the user can provide the data as a two-dimensional dataframe where each column represents one categorical variable. In this situation, each categorical variable is tabulated individually then combined into Python dictionaries.

birth_count2 = Tabulation(param="count")
birth_count2.tabulate(
    birth[["region", "agecat", "birthcat"]], 
    remove_nan=True
    )

print(birth_count2)

Tabulation of region
 Number of strata: 1
 Number of PSUs: 923
 Number of observations: 923
 Degrees of freedom: 922.00

 variable category  count  stderror   lower_ci   upper_ci
  region        1  166.0 11.718335 143.003340 188.996660
  region        2  284.0 14.136507 256.257795 311.742205
  region        3  250.0 13.594733 223.321002 276.678998
  region        4  256.0 13.698320 229.117716 282.882284
  agecat        1  507.0 15.439224 476.701278 537.298722
  agecat        2  316.0 14.552307 287.441809 344.558191
  agecat        3  133.0 10.705921 111.990152 154.009848
birthcat      1.0  240.0 13.333695 213.832087 266.167913
birthcat      2.0  450.0 15.193974 420.181215 479.818785
birthcat      3.0  233.0 13.204959 207.084737 258.915263

Two of the associated Python dictionaries are shown below. The structure of the inner dictionaries remain the same but additional key-value pairs are added to represent the several categorical variables.

print("\nThe designed-based estimated counts are:")
pprint(birth_count2.point_est)

print("\nThe designed-based standard errors are:")
pprint(birth_count2.stderror)

The designed-based estimated counts are:
{'agecat': {'1': 507.0, '2': 316.0, '3': 133.0},
 'birthcat': {'1.0': 240.0, '2.0': 450.0, '3.0': 233.0},
 'region': {'1': 166.0, '2': 284.0, '3': 250.0, '4': 256.0}}

The designed-based standard errors are:
{'agecat': {'1': 15.439223863518952,
            '2': 14.55230681053191,
            '3': 10.705921442206721},
 'birthcat': {'1.0': 13.333694861331516,
              '2.0': 15.19397357414444,
              '3.0': 13.20495864267966},
 'region': {'1': 11.718334853030537,
            '2': 14.13650726651876,
            '3': 13.594732580183488,
            '4': 13.698320300591277}}

In the example above, we used pandas series and dataframes with labelled variables. In some situations, the user may want to tabulate numpy arrays, lists or tuples without variable names atrribute from the data. For these situations, the varnames parameter provides a way to assign names for the categorical variables. Even when the variables have labels, users can leverage varnames to rename the categorical variables.

region_no_name = birth["region"].to_numpy()
agecat_no_name = birth["agecat"].to_numpy()
birthcat_no_name = birth["birthcat"].to_numpy()

birth_prop_new_name = Tabulation(param="proportion")
birth_prop_new_name.tabulate(
    vars=[region_no_name, agecat_no_name, birthcat_no_name],
    varnames=["Region", "AgeGroup", "BirthType"],
    remove_nan=True,
)

print(birth_prop_new_name)

Tabulation of Region
 Number of strata: 1
 Number of PSUs: 923
 Number of observations: 923
 Degrees of freedom: 922.00

  variable category  proportion  stderror  lower_ci  upper_ci
   Region        1    0.173640  0.012258  0.150883  0.199025
   Region        2    0.297071  0.014787  0.268892  0.326883
   Region        3    0.261506  0.014220  0.234574  0.290357
   Region        4    0.267782  0.014329  0.240614  0.296819
 AgeGroup        1    0.530335  0.016150  0.498562  0.561864
 AgeGroup        2    0.330544  0.015222  0.301383  0.361068
 AgeGroup        3    0.139121  0.011199  0.118564  0.162586
BirthType      1.0    0.260022  0.014446  0.232687  0.289357
BirthType      2.0    0.487541  0.016462  0.455331  0.519854
BirthType      3.0    0.252438  0.014307  0.225406  0.281533

If the user does not specify varnames, the tabulate() creates generic variables names var_1, var_2, etc.

birth_prop_new_name2 = Tabulation(param="proportion")
birth_prop_new_name2.tabulate(
    vars=[region_no_name, agecat_no_name, birthcat_no_name], 
    remove_nan=True
)

print(birth_prop_new_name2)

Tabulation of var_1
 Number of strata: 1
 Number of PSUs: 923
 Number of observations: 923
 Degrees of freedom: 922.00

 variable category  proportion  stderror  lower_ci  upper_ci
   var_1        1    0.173640  0.012258  0.150883  0.199025
   var_1        2    0.297071  0.014787  0.268892  0.326883
   var_1        3    0.261506  0.014220  0.234574  0.290357
   var_1        4    0.267782  0.014329  0.240614  0.296819
   var_2        1    0.530335  0.016150  0.498562  0.561864
   var_2        2    0.330544  0.015222  0.301383  0.361068
   var_2        3    0.139121  0.011199  0.118564  0.162586
   var_3      1.0    0.260022  0.014446  0.232687  0.289357
   var_3      2.0    0.487541  0.016462  0.455331  0.519854
   var_3      3.0    0.252438  0.014307  0.225406  0.281533

If the data was collected from a complex survey sample, the user may provide the sample design information to derive design-based statistics for the tabulation.

# Load Nhanes sample data
nhanes2_dict = load_nhanes2()
nhanes2 = nhanes2_dict["data"]

stratum = nhanes2["stratid"]
psu = nhanes2["psuid"]
weight = nhanes2["finalwgt"]

diabetes_nhanes = Tabulation("proportion")
diabetes_nhanes.tabulate(
    vars=nhanes2[["race", "diabetes"]],
    samp_weight=weight,
    stratum=stratum,
    psu=psu,
    remove_nan=True,
)

print(diabetes_nhanes)

Tabulation of race
 Number of strata: 31
 Number of PSUs: 62
 Number of observations: 10335
 Degrees of freedom: 31.00

 variable  category  proportion  stderror  lower_ci  upper_ci
    race       1.0    0.879016  0.016722  0.840568  0.909194
    race       2.0    0.095615  0.012778  0.072541  0.125039
    race       3.0    0.025369  0.010554  0.010781  0.058528
diabetes       0.0    0.965715  0.001820  0.961803  0.969238
diabetes       1.0    0.034285  0.001820  0.030762  0.038197

Two-way tabulation (cross-tabulation)

Cross-tabulation of two categorical variables is achieved by using the class CrossTabulation(). As above, cross-tabulation is possible for counts and proportions using CrossTabulation(param="count") and CrossTabulation(param="proportion"), respectively. The Python script below creates a design-based cross-tabulation of race by diabetes status. The sample design information is optional; when not provided, a simple random sample (srs) is assumed.

crosstab_nhanes = CrossTabulation("proportion")
crosstab_nhanes.tabulate(
    vars=nhanes2[["race", "diabetes"]],
    samp_weight=weight,
    stratum=stratum,
    psu=psu,
    remove_nan=True,
)

print(crosstab_nhanes)

Cross-tabulation of race and diabetes
 Number of strata: 31
 Number of PSUs: 62
 Number of observations: 10335
 Degrees of freedom: 31.00

 race diabetes  proportion  stderror  lower_ci  upper_ci
   1      0.0    0.850866  0.015850  0.815577  0.880392
   1      1.0    0.028123  0.001938  0.024430  0.032357
   2      0.0    0.089991  0.012171  0.068062  0.118090
   2      1.0    0.005646  0.000847  0.004157  0.007663
   3      0.0    0.024858  0.010188  0.010702  0.056669
   3      1.0    0.000516  0.000387  0.000112  0.002383

Pearson (with Rao-Scott adjustment):
    Unadjusted - chi2(2): 21.2661 with p-value of 0.0000
    Adjusted - F(1.52, 47.26): 14.9435  with p-value of 0.0000

  Likelihood ratio (with Rao-Scott adjustment):
     Unadjusted - chi2(2): 18.3925 with p-value of 0.0001
     Adjusted - F(1.52, 47.26): 12.9242  with p-value of 0.0001

In addition to pandas dataframe, the categorical variables may be provided as an numpy array, list or tuple. In the examples below, the categorical variables are provided as a tuple vars=(rage, diabetes). In this case, race and diabetes are numpy arrays and do not have a name attribute. The parameter varnames allows the user to name the categorical variables. If varnames is not specified then `var_1 and var_2 are used as variables names.

race = nhanes2["race"].to_numpy()
diabetes = nhanes2["diabetes"].to_numpy()

crosstab_nhanes = CrossTabulation("proportion")
crosstab_nhanes.tabulate(
    vars=(race, diabetes),
    samp_weight=weight,
    stratum=stratum,
    psu=psu,
    remove_nan=True,
)

print(crosstab_nhanes)

Cross-tabulation of var_1 and var_2
 Number of strata: 31
 Number of PSUs: 62
 Number of observations: 10335
 Degrees of freedom: 31.00

 var_1 var_2  proportion  stderror  lower_ci  upper_ci
  1.0   0.0    0.850866  0.015850  0.815577  0.880392
  1.0   1.0    0.028123  0.001938  0.024430  0.032357
  2.0   0.0    0.089991  0.012171  0.068062  0.118090
  2.0   1.0    0.005646  0.000847  0.004157  0.007663
  3.0   0.0    0.024858  0.010188  0.010702  0.056669
  3.0   1.0    0.000516  0.000387  0.000112  0.002383

Pearson (with Rao-Scott adjustment):
    Unadjusted - chi2(2): 21.2661 with p-value of 0.0000
    Adjusted - F(1.52, 47.26): 14.9435  with p-value of 0.0000

  Likelihood ratio (with Rao-Scott adjustment):
     Unadjusted - chi2(2): 18.3925 with p-value of 0.0001
     Adjusted - F(1.52, 47.26): 12.9242  with p-value of 0.0001

Same as the above example with variables names specified by varnames=["Race", DiabetesStatus"]

crosstab_nhanes = CrossTabulation("proportion")
crosstab_nhanes.tabulate(
    vars=(race, diabetes),
    varnames=["Race", "DiabetesStatus"],
    samp_weight=weight,
    stratum=stratum,
    psu=psu,
    remove_nan=True,
)

print(crosstab_nhanes)

Cross-tabulation of Race and DiabetesStatus
 Number of strata: 31
 Number of PSUs: 62
 Number of observations: 10335
 Degrees of freedom: 31.00

 Race DiabetesStatus  proportion  stderror  lower_ci  upper_ci
 1.0            0.0    0.850866  0.015850  0.815577  0.880392
 1.0            1.0    0.028123  0.001938  0.024430  0.032357
 2.0            0.0    0.089991  0.012171  0.068062  0.118090
 2.0            1.0    0.005646  0.000847  0.004157  0.007663
 3.0            0.0    0.024858  0.010188  0.010702  0.056669
 3.0            1.0    0.000516  0.000387  0.000112  0.002383

Pearson (with Rao-Scott adjustment):
    Unadjusted - chi2(2): 21.2661 with p-value of 0.0000
    Adjusted - F(1.52, 47.26): 14.9435  with p-value of 0.0000

  Likelihood ratio (with Rao-Scott adjustment):
     Unadjusted - chi2(2): 18.3925 with p-value of 0.0001
     Adjusted - F(1.52, 47.26): 12.9242  with p-value of 0.0001