Census Disclosure

Helper functions related to the process of Census data disclosure.

Most functionality is provided by the Disclosure class—see Disclosure.create_summary() for an end-to-end example.

API Documentation

oi_tools.drb.MINIMUM_CELL_SIZES: Mapping[Literal['national', 'state', 'substate', 'zip'], int] = {'national': 3, 'state': 10, 'substate': 20, 'zip': 100}

The minimum number of observations for a valid cell, taken from section V.A of the procedures handbook.

class oi_tools.drb.Disclosure(
folder: Path | str,
*,
overwrite: bool = False,
)

Manages the creation of DRB disclosure files for a single release.

Creates and populates the following folder structure:

{folder}/
  to_disclose/tables_T13_T26.xlsx      # rounded/masked data
  support/tables_T13_T26_support.xlsx  # unmodified data
  support/tables_T13_T26_summary.xlsx  # estimate counts

Call add_table() once per table, then call create_summary() to write the summary file.

Parameters:
  • folder (Path | str) – Root directory for output (e.g. drb/2026_01_01). Created if it does not already exist.

  • overwrite (bool) – If True, delete the three output Excel files if they already exist. If False (default), raise FileExistsError if any of them exist.

TO_DISCLOSE = PosixPath('to_disclose')

Folder for rounded/noise-infused tables to be submitted for disclosure review.

SUPPORT = PosixPath('support')

Folder for supporting files (raw data, summaries, README files).

PAPERWORK = PosixPath('paperwork')

Folder for paperwork.

ROUNDED_EXCEL_FILE = PosixPath('to_disclose/tables_T13_T26.xlsx')

Excel workbook containing rounded/masked estimates to be submitted for release. Each table is stored as a separate sheet within the workbook.

RAW_EXCEL_FILE = PosixPath('support/tables_T13_T26_support.xlsx')

Unmodified estimates for DRB reference in the same structure as the rounded file.

SUMMARY_EXCEL_FILE = PosixPath('support/tables_T13_T26_summary.xlsx')

Excel file containing estimate counts, both by table and in total.

add_table(
name: str,
raw_table: DataFrame,
*,
count_columns: str | Collection[str] | Selector = [],
proportion_columns: str | Collection[str] | Selector = [],
other_columns: str | Collection[str] | Selector = [],
drop_columns: str | Collection[str] | Selector = [],
sample_size_column: str | Collection[str] | Selector = 'n',
allow_nulls: bool = False,
se_pattern: str | None = '^.*_se$',
geographic_level: Literal['national', 'state', 'substate', 'zip'] = 'national',
) None

Round the specified columns and add a sheet to the release Excel workbooks.

Each call appends a new sheet (named name) to the Excel workbooks. Call this method once per table you want to release, then call create_summary() when all tables have been added.

Columns are rounded according to the rules described in round_and_mask_columns().

Parameters:
  • name (str) – Worksheet name used in both output Excel files.

  • raw_table (DataFrame) – Input DataFrame. Should contain no null values unless allow_nulls=True.

  • count_columns (str | Collection[str] | Selector) – Columns containing unweighted counts to be rounded per section V.B.3 of the handbook (see round_count_column()).

  • proportion_columns (str | Collection[str] | Selector) – Columns containing proportions or ratios to be rounded per section V.B.4 of the handbook (see round_proportion_column()).

  • other_columns (str | Collection[str] | Selector) – All other estimate columns (e.g. weighted means, regression coefficients) to be rounded to 4 significant figures per section V.B.1 of the handbook.

  • drop_columns (str | Collection[str] | Selector) – Columns to drop from the rounded output (e.g. intermediate columns not intended for release).

  • sample_size_column (str | Collection[str] | Selector) – Column containing the sample size used for masking and rounding thresholds.

  • allow_nulls (bool) – If False (default), assert that the input DataFrame contains no null values.

  • se_pattern (str | None) – Regex pattern identifying standard error columns. Matching columns are excluded when counting non-null estimates. Pass None to include all columns.

  • geographic_level (Literal['national', 'state', 'substate', 'zip']) – Geographic level of the estimates, used to determine the minimum cell size threshold (see MINIMUM_CELL_SIZES).

Raises:

AssertionError – If allow_nulls=False and nulls are found in the input DataFrame.

Return type:

None

Examples

>>> import tempfile
>>> from pathlib import Path
>>> import polars as pl
>>> table = pl.DataFrame(
...     {
...         "county": [1001, 1003],
...         "population": [1523, 847],
...         "hs_graduation_rate": [0.872, 0.914],
...         "mean_income": [51234.5, 48765.4],
...         "median_wages": [37856.7, None],
...     }
... )
>>> tmp = Path(tempfile.mkdtemp()) / "drb" / "2026_01_01"
>>> disclosure = Disclosure(tmp)
>>> disclosure.add_table(
...     "statistics",
...     table,
...     count_columns="population",
...     proportion_columns="hs_graduation_rate",
...     other_columns=["mean_income", "median_wages"],
...     sample_size_column="population",
...     allow_nulls=True,
... )
>>> # print out the created file structure
>>> for p in sorted(tmp.rglob("*")):
...     print(p.relative_to(tmp))
support
support/tables_T13_T26_support.xlsx
to_disclose
to_disclose/tables_T13_T26.xlsx
>>> with pl.Config(tbl_width_chars=120, fmt_str_lengths=20):
...     raw_statistics = pl.read_excel(tmp / Disclosure.RAW_EXCEL_FILE)
...     print(raw_statistics)
shape: (2, 5)
┌────────┬────────────┬────────────────────┬─────────────┬──────────────┐
│ county ┆ population ┆ hs_graduation_rate ┆ mean_income ┆ median_wages │
│ ---    ┆ ---        ┆ ---                ┆ ---         ┆ ---          │
│ i64    ┆ i64        ┆ f64                ┆ f64         ┆ f64          │
╞════════╪════════════╪════════════════════╪═════════════╪══════════════╡
│ 1001   ┆ 1523       ┆ 0.872              ┆ 51234.5     ┆ 37856.7      │
│ 1003   ┆ 847        ┆ 0.914              ┆ 48765.4     ┆ null         │
└────────┴────────────┴────────────────────┴─────────────┴──────────────┘
>>> with pl.Config(tbl_width_chars=120, fmt_str_lengths=20):
...     rounded_statistics = pl.read_excel(tmp / Disclosure.ROUNDED_EXCEL_FILE)
...     print(rounded_statistics)
shape: (2, 5)
┌────────┬────────────┬────────────────────┬─────────────┬──────────────┐
│ county ┆ population ┆ hs_graduation_rate ┆ mean_income ┆ median_wages │
│ ---    ┆ ---        ┆ ---                ┆ ---         ┆ ---          │
│ i64    ┆ i64        ┆ f64                ┆ i64         ┆ i64          │
╞════════╪════════════╪════════════════════╪═════════════╪══════════════╡
│ 1001   ┆ 1500       ┆ 0.872              ┆ 51230       ┆ 37860        │
│ 1003   ┆ 850        ┆ 0.91               ┆ 48770       ┆ null         │
└────────┴────────────┴────────────────────┴─────────────┴──────────────┘
create_summary(
total_from_previous_releases: int | None = None,
) None

Write a summary Excel file aggregating estimate counts across tables.

Reads the per-variable counts accumulated by add_table() and writes a single summary sheet:

{folder}/
  support/tables_T13_T26_summary.xlsx  # estimate counts

Each table’s rows are followed by a subtotal row, and a grand-total row appears at the end of the sheet.

Parameters:

total_from_previous_releases (int | None) – If provided, two additional rows are appended after the this-release total: one for the previous-releases cumulative total and one for the grand cumulative total across all releases.

Return type:

None

Examples

>>> import tempfile
>>> from pathlib import Path
>>> import polars as pl
>>> table = pl.DataFrame(
...     {
...         "county": [1001, 1003],
...         "population": [1523, 847],
...         "hs_graduation_rate": [0.872, 0.914],
...         "mean_income": [51234.5, 48765.4],
...         "median_wages": [37856.7, None],
...     }
... )
>>> employment = pl.DataFrame(
...     {
...         "county": [1001, 1003, 1005],
...         "employed": [120, 95, 60],
...         "employment_rate": [0.801, 0.743, 0.812],
...     }
... )
>>> tmp = Path(tempfile.mkdtemp()) / "drb" / "2026_01_01"
>>> disclosure = Disclosure(tmp)  # create disclosure object
>>> # add two tables
>>> disclosure.add_table(
...     "statistics",
...     table,
...     count_columns="population",
...     proportion_columns="hs_graduation_rate",
...     other_columns=["mean_income", "median_wages"],
...     sample_size_column="population",
...     allow_nulls=True,
... )
>>> disclosure.add_table(
...     "employment",
...     employment,
...     count_columns="employed",
...     proportion_columns="employment_rate",
...     sample_size_column="employed",
... )
>>> # create summary excel sheet
>>> disclosure.create_summary(total_from_previous_releases=10)
>>> file = tmp / Disclosure.SUMMARY_EXCEL_FILE
>>> with pl.Config(tbl_width_chars=120, fmt_str_lengths=60, tbl_rows=20):
...     print(pl.read_excel(file).with_columns(pl.col("Table").fill_null("")))
shape: (11, 3)
┌────────────┬──────────────────────────────────────────┬─────────────────────┐
│ Table      ┆ Variable                                 ┆ Number of Estimates │
│ ---        ┆ ---                                      ┆ ---                 │
│ str        ┆ str                                      ┆ i64                 │
╞════════════╪══════════════════════════════════════════╪═════════════════════╡
│ statistics ┆ population                               ┆ 2                   │
│            ┆ hs_graduation_rate                       ┆ 2                   │
│            ┆ mean_income                              ┆ 2                   │
│            ┆ median_wages                             ┆ 1                   │
│            ┆ — Total for this table                   ┆ 7                   │
│ employment ┆ employed                                 ┆ 3                   │
│            ┆ employment_rate                          ┆ 3                   │
│            ┆ — Total for this table                   ┆ 6                   │
│ TOTAL      ┆ — Grand total for this release           ┆ 13                  │
│            ┆ — Cumlative total from previous releases ┆ 10                  │
│            ┆ — Cumlative total INCLUDING this release ┆ 23                  │
└────────────┴──────────────────────────────────────────┴─────────────────────┘
oi_tools.drb.round_and_mask_columns(
df: DataFrame,
*,
count_columns: str | Collection[str] | Selector = [],
proportion_columns: str | Collection[str] | Selector = [],
other_columns: str | Collection[str] | Selector = [],
n: str | Collection[str] | Selector | Expr | int | float = 'n',
geographic_level: Literal['national', 'state', 'substate', 'zip'] = 'national',
) DataFrame

Mask small cells and round the columns of a DataFrame for disclosure.

Adheres to the rules defined in sections V.A and V.B of the disclosure methods handbook:

  • Small cells are censored according to Section V.A (see mask_small_cells()).

  • Counts are rounded according to section V.B.3 (see round_count_column()).

  • Ratio/proportions are rounded according to section V.B.4 (see round_proportion_column()).

  • Other estimates (including regression output and weighted means/variances) are rounded to four significant figures (see section V.B.1).

Parameters:
  • df (DataFrame) – Input DataFrame to round and mask.

  • count_columns (str | Collection[str] | Selector) – Columns containing unweighted counts (rounded per section V.B.3).

  • proportion_columns (str | Collection[str] | Selector) – Columns containing proportions or ratios (rounded per section V.B.4).

  • other_columns (str | Collection[str] | Selector) – All other estimate columns (rounded to 4 significant figures).

  • n (str | Collection[str] | Selector | Expr | int | float) – The sample size used to determine the rounding rule.

  • geographic_level (Literal['national', 'state', 'substate', 'zip']) – Geographic level of the estimates, used to determine the minimum cell size threshold (see MINIMUM_CELL_SIZES).

Returns:

A copy of df with the specified columns masked and rounded.

Return type:

pl.DataFrame

oi_tools.drb.mask_small_cells(
x: Expr,
n: Expr,
geographic_level: Literal['national', 'state', 'substate', 'zip'],
) Expr

Mask cells that fail the cell-size cutoffs.

From section V.A of the disclosure methods handbook:

For Title 26 counts and estimates from Internal Revenue Service (IRS) data and commingled data (from the Census Bureau and the IRS), we enforce the following thresholds based on IRS requirements:

  • At least 3 entities (unique firms, persons, or households) for national estimates

  • At least 10 entities for state-level estimates

  • At least 20 entities for substate-level estimates, except for zip codes

  • At least 100 entities for ZIP code-level estimates

Parameters:
  • x (Expr) – The expression to mask.

  • n (Expr) – The sample size used to determine the suppression rule.

  • geographic_level (Literal['national', 'state', 'substate', 'zip']) – Geographic level used to look up the minimum cell size in MINIMUM_CELL_SIZES.

Returns:

x where n meets the threshold, null otherwise.

Return type:

pl.Expr

oi_tools.drb.round_count_column(
col: Expr,
n: Expr,
) Expr

Round a count column to the appropriate precision.

From section V.B.3 of the disclosure methods handbook:

The rounding rule for unweighted counts is as follows:

  • If N is less than 15, report N < 15

  • If N is between 15 and 99, round to the nearest 10

  • If N is between 100-999, round to the nearest 50

  • If N is between 1,000-9,999, round to the nearest 100

  • If N is between 10,000-99,999, round to the nearest 500

  • If N is between 100,000-999,999, round to the nearest 1,000

  • If N is 1,000,000 or more, round to four significant digits as described earlier.

Parameters:
  • col (Expr) – The count expression to round.

  • n (Expr) – The sample size used to determine the rounding rule.

Returns:

A Polars expression with rounded values. Returns null when n < 15.

Return type:

pl.Expr

oi_tools.drb.round_proportion_column(
col: Expr,
n: Expr,
) Expr

Round a non-count column to the appropriate number of significant figures.

From section V.B.4 of the disclosure methods handbook:

For thresholds based on an unweighted denominator (D) and an unrounded unweighted proportion (P):

  • If 15 <= D < 100 then P should be rounded to 1 significant digit

  • Else if D < 1,000 then P should be rounded to no more than 2 significant digits

  • Else if D < 10,000 then P should be rounded to no more than 3 significant digits

  • Else if D >= 10,000 then P should be rounded to no more than 4 significant digits.

Parameters:
  • col (Expr) – The proportion/ratio expression to round.

  • n (Expr) – Denominator used to determine the number of significant digits.

Returns:

A Polars expression with rounded values. Returns null when n < 15.

Return type:

pl.Expr