Census Disclosure¶
Helper functions related to the process of Census data disclosure.
Most functionality is provided by the Disclosure class—see Disclosure.create_summary() for an end-to-end example.
See the procedures handbook for information about the logistics/procedure of release.
See the methods handbook for resources on disclosure avoidance methods (e.g. rounding, noise, local-area estimates).
API Documentation¶
- oi_tools.drb.MINIMUM_CELL_SIZES: Mapping[Literal['national', 'state', 'substate', 'zip'], int] = {'national': 3, 'state': 10, 'substate': 20, 'zip': 100}¶
The minimum number of observations for a valid cell, taken from section V.A of the procedures handbook.
- class oi_tools.drb.Disclosure( )¶
Manages the creation of DRB disclosure files for a single release.
Creates and populates the following folder structure:
{folder}/ to_disclose/tables_T13_T26.xlsx # rounded/masked data support/tables_T13_T26_support.xlsx # unmodified data support/tables_T13_T26_summary.xlsx # estimate countsCall
add_table()once per table, then callcreate_summary()to write the summary file.- Parameters:
folder (Path | str) – Root directory for output (e.g.
drb/2026_01_01). Created if it does not already exist.overwrite (bool) – If
True, delete the three output Excel files if they already exist. IfFalse(default), raiseFileExistsErrorif any of them exist.
- TO_DISCLOSE = PosixPath('to_disclose')¶
Folder for rounded/noise-infused tables to be submitted for disclosure review.
- SUPPORT = PosixPath('support')¶
Folder for supporting files (raw data, summaries, README files).
- PAPERWORK = PosixPath('paperwork')¶
Folder for paperwork.
- ROUNDED_EXCEL_FILE = PosixPath('to_disclose/tables_T13_T26.xlsx')¶
Excel workbook containing rounded/masked estimates to be submitted for release. Each table is stored as a separate sheet within the workbook.
- RAW_EXCEL_FILE = PosixPath('support/tables_T13_T26_support.xlsx')¶
Unmodified estimates for DRB reference in the same structure as the rounded file.
- SUMMARY_EXCEL_FILE = PosixPath('support/tables_T13_T26_summary.xlsx')¶
Excel file containing estimate counts, both by table and in total.
- add_table(
- name: str,
- raw_table: DataFrame,
- *,
- count_columns: str | Collection[str] | Selector = [],
- proportion_columns: str | Collection[str] | Selector = [],
- other_columns: str | Collection[str] | Selector = [],
- drop_columns: str | Collection[str] | Selector = [],
- sample_size_column: str | Collection[str] | Selector = 'n',
- allow_nulls: bool = False,
- se_pattern: str | None = '^.*_se$',
- geographic_level: Literal['national', 'state', 'substate', 'zip'] = 'national',
Round the specified columns and add a sheet to the release Excel workbooks.
Each call appends a new sheet (named
name) to the Excel workbooks. Call this method once per table you want to release, then callcreate_summary()when all tables have been added.Columns are rounded according to the rules described in
round_and_mask_columns().- Parameters:
name (str) – Worksheet name used in both output Excel files.
raw_table (DataFrame) – Input DataFrame. Should contain no null values unless
allow_nulls=True.count_columns (str | Collection[str] | Selector) – Columns containing unweighted counts to be rounded per section V.B.3 of the handbook (see
round_count_column()).proportion_columns (str | Collection[str] | Selector) – Columns containing proportions or ratios to be rounded per section V.B.4 of the handbook (see
round_proportion_column()).other_columns (str | Collection[str] | Selector) – All other estimate columns (e.g. weighted means, regression coefficients) to be rounded to 4 significant figures per section V.B.1 of the handbook.
drop_columns (str | Collection[str] | Selector) – Columns to drop from the rounded output (e.g. intermediate columns not intended for release).
sample_size_column (str | Collection[str] | Selector) – Column containing the sample size used for masking and rounding thresholds.
allow_nulls (bool) – If
False(default), assert that the input DataFrame contains no null values.se_pattern (str | None) – Regex pattern identifying standard error columns. Matching columns are excluded when counting non-null estimates. Pass
Noneto include all columns.geographic_level (Literal['national', 'state', 'substate', 'zip']) – Geographic level of the estimates, used to determine the minimum cell size threshold (see
MINIMUM_CELL_SIZES).
- Raises:
AssertionError – If
allow_nulls=Falseand nulls are found in the input DataFrame.- Return type:
None
Examples
>>> import tempfile >>> from pathlib import Path >>> import polars as pl >>> table = pl.DataFrame( ... { ... "county": [1001, 1003], ... "population": [1523, 847], ... "hs_graduation_rate": [0.872, 0.914], ... "mean_income": [51234.5, 48765.4], ... "median_wages": [37856.7, None], ... } ... ) >>> tmp = Path(tempfile.mkdtemp()) / "drb" / "2026_01_01" >>> disclosure = Disclosure(tmp) >>> disclosure.add_table( ... "statistics", ... table, ... count_columns="population", ... proportion_columns="hs_graduation_rate", ... other_columns=["mean_income", "median_wages"], ... sample_size_column="population", ... allow_nulls=True, ... ) >>> # print out the created file structure >>> for p in sorted(tmp.rglob("*")): ... print(p.relative_to(tmp)) support support/tables_T13_T26_support.xlsx to_disclose to_disclose/tables_T13_T26.xlsx >>> with pl.Config(tbl_width_chars=120, fmt_str_lengths=20): ... raw_statistics = pl.read_excel(tmp / Disclosure.RAW_EXCEL_FILE) ... print(raw_statistics) shape: (2, 5) ┌────────┬────────────┬────────────────────┬─────────────┬──────────────┐ │ county ┆ population ┆ hs_graduation_rate ┆ mean_income ┆ median_wages │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 │ ╞════════╪════════════╪════════════════════╪═════════════╪══════════════╡ │ 1001 ┆ 1523 ┆ 0.872 ┆ 51234.5 ┆ 37856.7 │ │ 1003 ┆ 847 ┆ 0.914 ┆ 48765.4 ┆ null │ └────────┴────────────┴────────────────────┴─────────────┴──────────────┘ >>> with pl.Config(tbl_width_chars=120, fmt_str_lengths=20): ... rounded_statistics = pl.read_excel(tmp / Disclosure.ROUNDED_EXCEL_FILE) ... print(rounded_statistics) shape: (2, 5) ┌────────┬────────────┬────────────────────┬─────────────┬──────────────┐ │ county ┆ population ┆ hs_graduation_rate ┆ mean_income ┆ median_wages │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │ ╞════════╪════════════╪════════════════════╪═════════════╪══════════════╡ │ 1001 ┆ 1500 ┆ 0.872 ┆ 51230 ┆ 37860 │ │ 1003 ┆ 850 ┆ 0.91 ┆ 48770 ┆ null │ └────────┴────────────┴────────────────────┴─────────────┴──────────────┘
- create_summary( ) None¶
Write a summary Excel file aggregating estimate counts across tables.
Reads the per-variable counts accumulated by
add_table()and writes a single summary sheet:{folder}/ support/tables_T13_T26_summary.xlsx # estimate countsEach table’s rows are followed by a subtotal row, and a grand-total row appears at the end of the sheet.
- Parameters:
total_from_previous_releases (int | None) – If provided, two additional rows are appended after the this-release total: one for the previous-releases cumulative total and one for the grand cumulative total across all releases.
- Return type:
None
Examples
>>> import tempfile >>> from pathlib import Path >>> import polars as pl >>> table = pl.DataFrame( ... { ... "county": [1001, 1003], ... "population": [1523, 847], ... "hs_graduation_rate": [0.872, 0.914], ... "mean_income": [51234.5, 48765.4], ... "median_wages": [37856.7, None], ... } ... ) >>> employment = pl.DataFrame( ... { ... "county": [1001, 1003, 1005], ... "employed": [120, 95, 60], ... "employment_rate": [0.801, 0.743, 0.812], ... } ... ) >>> tmp = Path(tempfile.mkdtemp()) / "drb" / "2026_01_01" >>> disclosure = Disclosure(tmp) # create disclosure object >>> # add two tables >>> disclosure.add_table( ... "statistics", ... table, ... count_columns="population", ... proportion_columns="hs_graduation_rate", ... other_columns=["mean_income", "median_wages"], ... sample_size_column="population", ... allow_nulls=True, ... ) >>> disclosure.add_table( ... "employment", ... employment, ... count_columns="employed", ... proportion_columns="employment_rate", ... sample_size_column="employed", ... ) >>> # create summary excel sheet >>> disclosure.create_summary(total_from_previous_releases=10) >>> file = tmp / Disclosure.SUMMARY_EXCEL_FILE >>> with pl.Config(tbl_width_chars=120, fmt_str_lengths=60, tbl_rows=20): ... print(pl.read_excel(file).with_columns(pl.col("Table").fill_null(""))) shape: (11, 3) ┌────────────┬──────────────────────────────────────────┬─────────────────────┐ │ Table ┆ Variable ┆ Number of Estimates │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞════════════╪══════════════════════════════════════════╪═════════════════════╡ │ statistics ┆ population ┆ 2 │ │ ┆ hs_graduation_rate ┆ 2 │ │ ┆ mean_income ┆ 2 │ │ ┆ median_wages ┆ 1 │ │ ┆ — Total for this table ┆ 7 │ │ employment ┆ employed ┆ 3 │ │ ┆ employment_rate ┆ 3 │ │ ┆ — Total for this table ┆ 6 │ │ TOTAL ┆ — Grand total for this release ┆ 13 │ │ ┆ — Cumlative total from previous releases ┆ 10 │ │ ┆ — Cumlative total INCLUDING this release ┆ 23 │ └────────────┴──────────────────────────────────────────┴─────────────────────┘
- oi_tools.drb.round_and_mask_columns(
- df: DataFrame,
- *,
- count_columns: str | Collection[str] | Selector = [],
- proportion_columns: str | Collection[str] | Selector = [],
- other_columns: str | Collection[str] | Selector = [],
- n: str | Collection[str] | Selector | Expr | int | float = 'n',
- geographic_level: Literal['national', 'state', 'substate', 'zip'] = 'national',
Mask small cells and round the columns of a DataFrame for disclosure.
Adheres to the rules defined in sections V.A and V.B of the disclosure methods handbook:
Small cells are censored according to Section V.A (see
mask_small_cells()).Counts are rounded according to section V.B.3 (see
round_count_column()).Ratio/proportions are rounded according to section V.B.4 (see
round_proportion_column()).Other estimates (including regression output and weighted means/variances) are rounded to four significant figures (see section V.B.1).
- Parameters:
df (DataFrame) – Input DataFrame to round and mask.
count_columns (str | Collection[str] | Selector) – Columns containing unweighted counts (rounded per section V.B.3).
proportion_columns (str | Collection[str] | Selector) – Columns containing proportions or ratios (rounded per section V.B.4).
other_columns (str | Collection[str] | Selector) – All other estimate columns (rounded to 4 significant figures).
n (str | Collection[str] | Selector | Expr | int | float) – The sample size used to determine the rounding rule.
geographic_level (Literal['national', 'state', 'substate', 'zip']) – Geographic level of the estimates, used to determine the minimum cell size threshold (see
MINIMUM_CELL_SIZES).
- Returns:
A copy of
dfwith the specified columns masked and rounded.- Return type:
pl.DataFrame
- oi_tools.drb.mask_small_cells( ) Expr¶
Mask cells that fail the cell-size cutoffs.
From section V.A of the disclosure methods handbook:
For Title 26 counts and estimates from Internal Revenue Service (IRS) data and commingled data (from the Census Bureau and the IRS), we enforce the following thresholds based on IRS requirements:
At least 3 entities (unique firms, persons, or households) for national estimates
At least 10 entities for state-level estimates
At least 20 entities for substate-level estimates, except for zip codes
At least 100 entities for ZIP code-level estimates
- Parameters:
x (Expr) – The expression to mask.
n (Expr) – The sample size used to determine the suppression rule.
geographic_level (Literal['national', 'state', 'substate', 'zip']) – Geographic level used to look up the minimum cell size in
MINIMUM_CELL_SIZES.
- Returns:
xwherenmeets the threshold,nullotherwise.- Return type:
pl.Expr
- oi_tools.drb.round_count_column( ) Expr¶
Round a count column to the appropriate precision.
From section V.B.3 of the disclosure methods handbook:
The rounding rule for unweighted counts is as follows:
If N is less than 15, report N < 15
If N is between 15 and 99, round to the nearest 10
If N is between 100-999, round to the nearest 50
If N is between 1,000-9,999, round to the nearest 100
If N is between 10,000-99,999, round to the nearest 500
If N is between 100,000-999,999, round to the nearest 1,000
If N is 1,000,000 or more, round to four significant digits as described earlier.
- oi_tools.drb.round_proportion_column( ) Expr¶
Round a non-count column to the appropriate number of significant figures.
From section V.B.4 of the disclosure methods handbook:
For thresholds based on an unweighted denominator (D) and an unrounded unweighted proportion (P):
If 15 <= D < 100 then P should be rounded to 1 significant digit
Else if D < 1,000 then P should be rounded to no more than 2 significant digits
Else if D < 10,000 then P should be rounded to no more than 3 significant digits
Else if D >= 10,000 then P should be rounded to no more than 4 significant digits.