Convenience Functions¶

Various helper/utility functions.

API Documentation¶

class oi_tools.helpers.ParquetCache(

location: str | Path | None = None,

verbose: int = 1,

**kwargs,

)¶

A thin wrapper around joblib.Memory that caches polars.DataFrame to parquets.

See the examples section below or the joblib documentation for more.

Parameters:

location (str | Path | None) – Path to the cache directory. If None, caching is disabled.
verbose (int) – Verbosity level passed to joblib.
**kwargs – Additional keyword arguments passed to polars.DataFrame.write_parquet() as backend_options.

Examples

>>> import time
>>> import polars as pl
>>> from oi_tools.helpers import ParquetCache
>>>
>>> CACHE = ParquetCache("/tmp/example/cache/path/", verbose=0)
>>>
>>> @CACHE.cache()
... def slow_query() -> pl.DataFrame:
...     print("Computing expensive function...")
...     time.sleep(2)
...     return pl.DataFrame({"x": [1, 2, 3]})
>>>
>>> slow_query()  # slow on first call
Computing expensive function...
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘
>>> slow_query()  # fast on subsequent calls
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

In addition to using this wrapper, it’s also possible to use the parquet backend (ParquetStoreBackend) directly from joblib.Memory by passing backend=”parquet”:

>>> from joblib import Memory
>>> CACHE = Memory("/tmp/another/cache/path/", backend="parquet")

oi_tools.helpers.clean_col_name( name: str, ) → str¶

Normalize a column name to snake_case.

Inspired by janitor::clean_names in R.

Parameters:: name (str) – Raw column name string.
Returns:: Cleaned column name in snake case with runs of non alpha-numeric characters replaced by underscores.
Return type:: str

Examples

>>> clean_col_name("First Name")
'first_name'
>>> clean_col_name("testScore")
'test_score'
>>> clean_col_name("Total (USD)")
'total_usd'
>>> clean_col_name("FIPS Code")
'fips_code'
>>> clean_col_name("café")
'cafe'

This function is especially useful in polars.DataFrame.rename():

>>> import polars as pl
>>> df = pl.DataFrame({"First Name": ["Alice"], "testScore": [1]})
>>> df.columns
['First Name', 'testScore']
>>> df.rename(clean_col_name).columns
['first_name', 'test_score']

oi_tools.helpers.complete(

df: DF,

*columns: str | Sequence[str],

**kwargs: Sequence[Any],

) → DF¶

Turn implicitly missing rows into explicit null values.

Computes the cartesian product of all unique values for the specified columns, then left-joins the original data onto that spine so that missing combinations appear as null rows.

Parameters:

df (DF) – Input DataFrame or LazyFrame.
*columns (str | Sequence[str]) – Column names whose unique values form the spine. Pass a list of names to treat multiple columns as a single composite key. (See below for an example.)
**kwargs (Sequence[Any]) – Explicit value sets for additional key columns. The kwarg name becomes the column name and the value becomes its unique levels.

Returns:

DataFrame or LazyFrame with all combinations present and missing values filled with nulls.

Return type:

Raises:

ValueError – If the same column appears in both columns and kwargs.
polars.exceptions.ComputeError – At collection time, if the specified columns do not uniquely identify rows in df (i.e. the 1:1 join constraint is violated).

Examples

>>> df = pl.DataFrame({"a": [1, 1, 2], "b": ["x", "y", "x"], "val": [10, 20, 30]})
>>> df
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ val │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ x   ┆ 10  │
│ 1   ┆ y   ┆ 20  │
│ 2   ┆ x   ┆ 30  │
└─────┴─────┴─────┘

Passing column names causes implicitly missing rows to be converted to explicit null values:

>>> complete(df, "a", "b").sort("a", "b")
shape: (4, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ val  │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ str ┆ i64  │
╞═════╪═════╪══════╡
│ 1   ┆ x   ┆ 10   │
│ 1   ┆ y   ┆ 20   │
│ 2   ┆ x   ┆ 30   │
│ 2   ┆ y   ┆ null │
└─────┴─────┴──────┘

You can use a kwarg to extend the expected values beyond what appears in the data:

>>> complete(df, "b", a=[1, 2, 3]).sort("b", "a")
shape: (6, 3)
┌─────┬─────┬──────┐
│ b   ┆ a   ┆ val  │
│ --- ┆ --- ┆ ---  │
│ str ┆ i64 ┆ i64  │
╞═════╪═════╪══════╡
│ x   ┆ 1   ┆ 10   │
│ x   ┆ 2   ┆ 30   │
│ x   ┆ 3   ┆ null │
│ y   ┆ 1   ┆ 20   │
│ y   ┆ 2   ┆ null │
│ y   ┆ 3   ┆ null │
└─────┴─────┴──────┘

Passing a list treats multiple columns as a single composite key, computing unique values for the pair rather than each column independently. This is useful when one column is only meaningful in the context of another — for example, three-digit county FIPS codes are only unique within a state, so ["state", "county"] should be completed together rather than separately:

>>> df2 = pl.DataFrame(
...     {
...         "year": [2020, 2020, 2021],
...         "state": ["CA", "TX", "CA"],
...         "county": ["001", "001", "003"],
...         "val": [10, 20, 30],
...     }
... )
>>> complete(df2, "year", ["state", "county"]).sort("year", "state", "county")
shape: (6, 4)
┌──────┬───────┬────────┬──────┐
│ year ┆ state ┆ county ┆ val  │
│ ---  ┆ ---   ┆ ---    ┆ ---  │
│ i64  ┆ str   ┆ str    ┆ i64  │
╞══════╪═══════╪════════╪══════╡
│ 2020 ┆ CA    ┆ 001    ┆ 10   │
│ 2020 ┆ CA    ┆ 003    ┆ null │
│ 2020 ┆ TX    ┆ 001    ┆ 20   │
│ 2021 ┆ CA    ┆ 001    ┆ null │
│ 2021 ┆ CA    ┆ 003    ┆ 30   │
│ 2021 ┆ TX    ┆ 001    ┆ null │
└──────┴───────┴────────┴──────┘

Notice that TX, 003 does not appear in the output.

Adjust for inflation using the Consumer Price Index.

Useful references:

Parameters:

col (str | Collection[str] | Selector | Expr | int | float) – The column (or columns) to adjust.
from_year (str | Collection[str] | Selector | Expr | int | float) – The year in which the dollar value is currently measured.
to_year (str | Collection[str] | Selector | Expr | int | float) – The year to which you would like to inflation adjust.
series (str) – The CPI series used for inflation adjustment.

Return type:

Expr

Examples

>>> df = pl.DataFrame({"income": [50000, 75000], "year": [2010, 2015]})
>>> df.with_columns(
...     income_2023=inflation_adjust("income", from_year="year", to_year=2023)
... )
shape: (2, 3)
┌────────┬──────┬──────────────┐
│ income ┆ year ┆ income_2023  │
│ ---    ┆ ---  ┆ ---          │
│ i64    ┆ i64  ┆ f64          │
╞════════╪══════╪══════════════╡
│ 50000  ┆ 2010 ┆ 69867.832117 │
│ 75000  ┆ 2015 ┆ 96417.767502 │
└────────┴──────┴──────────────┘

Regress an expression on some covariates.

At the moment, this is just a thin wrapper around polars_ols.compute_least_squares().

Parameters:

lhs (str | Collection[str] | Selector | Expr | int | float) – The dependent variable (outcome) to regress.
*rhs (str | Collection[str] | Selector | Expr | int | float) – One or more independent variables (predictors).
include_intercept (bool) – Whether to add an intercept term.
weights (str | Collection[str] | Selector | Expr | int | float | None) – Optional observation weights for weighted least squares.
output (Literal['predictions', 'residuals', 'coefficients', 'statistics']) – What to return from the regression. See the “Returns” section for how this will effect the output.
**kwargs – Additional keyword arguments forwarded to polars_ols.OLSKwargs.

Returns:

The result depends on output:

"predictions": fitted values from the regression (same length as lhs).
"residuals": difference between observed and fitted values (same length as lhs).
"coefficients": estimated coefficients for each predictor (same length as the number of columns that rhs expands to).
"statistics": coefficient estimates with standard errors and p-values.

Return type:

pl.Expr

Examples

>>> import polars as pl
>>> from oi_tools.helpers import regress
>>> df = pl.DataFrame({"y": [1.0, 2.0, 3.0, 4.0], "x": [2.0, 4.0, 5.0, 7.0]})

Fitted values (default):

>>> df.select(regress("y", "x"))
shape: (4, 1)
┌──────────┐
│ y        │
│ ---      │
│ f64      │
╞══════════╡
│ 0.961538 │
│ 2.192308 │
│ 2.807692 │
│ 4.038462 │
└──────────┘

Residuals:

>>> df.select(regress("y", "x", output="residuals"))
shape: (4, 1)
┌───────────┐
│ y         │
│ ---       │
│ f64       │
╞═══════════╡
│ 0.038462  │
│ -0.192308 │
│ 0.192308  │
│ -0.038462 │
└───────────┘

Coefficients as a one-row struct, one field per predictor:

>>> df.select(regress("y", "x", output="coefficients")).unnest("coefficients")
shape: (1, 2)
┌──────────┬───────────┐
│ x        ┆ const     │
│ ---      ┆ ---       │
│ f64      ┆ f64       │
╞══════════╪═══════════╡
│ 0.615385 ┆ -0.269231 │
└──────────┴───────────┘

Model statistics (r2, mae, mse, and per-predictor coefficients, standard errors, t-values, and p-values):

>>> stats = df.select(regress("y", "x", output="statistics")).unnest("statistics")
>>> stats.select(["r2", "feature_names", "coefficients", "p_values"])
shape: (1, 4)
┌──────────┬────────────────┬──────────────────┬──────────────────┐
│ r2       ┆ feature_names  ┆ coefficients     ┆ p_values         │
│ ---      ┆ ---            ┆ ---              ┆ ---              │
│ f64      ┆ list[str]      ┆ list[f64]        ┆ list[f64]        │
╞══════════╪════════════════╪══════════════════╪══════════════════╡
│ 0.984615 ┆ ["x", "const"] ┆ [0.615385, -0.2… ┆ [0.007722, 0.41… │
└──────────┴────────────────┴──────────────────┴──────────────────┘

Convert the input to a Polars expression.

Parameters:: x (str | Collection[str] | Selector | Expr | int | float) – Something that can be coerced to an expression.
Returns:: A Polars expression.
Return type:: pl.Expr

Examples

>>> import polars as pl
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

Strings become column references:

>>> df.select(to_expr("a"))
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Numerics become literals:

>>> df.select(to_expr(10).mul("a"))
shape: (3, 1)
┌─────────┐
│ literal │
│ ---     │
│ i64     │
╞═════════╡
│ 10      │
│ 20      │
│ 30      │
└─────────┘

Lists of strings become column selectors:

>>> columns = ["a", "b"]
>>> df.select(*columns, to_expr(columns).mul(2).name.suffix("_times_two"))
shape: (3, 4)
┌─────┬─────┬─────────────┬─────────────┐
│ a   ┆ b   ┆ a_times_two ┆ b_times_two │
│ --- ┆ --- ┆ ---         ┆ ---         │
│ i64 ┆ i64 ┆ i64         ┆ i64         │
╞═════╪═════╪═════════════╪═════════════╡
│ 1   ┆ 4   ┆ 2           ┆ 8           │
│ 2   ┆ 5   ┆ 4           ┆ 10          │
│ 3   ┆ 6   ┆ 6           ┆ 12          │
└─────┴─────┴─────────────┴─────────────┘

Expressions pass through unchanged:

>>> expr = pl.col("a") * 2
>>> to_expr(expr) is expr
True

Create a set of expressions with a standardized null mask.

All output expressions evaluate to null wherever any input expression is null.

Parameters:: *xs (str | Collection[str] | Selector | Expr | int | float) – Expressions to mask.
Returns:: Expressions that evaluate to null when any input is null.
Return type:: Sequence[pl.Expr]

Examples

>>> import polars as pl
>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, None, 4, 5, None],
...         "b": [1, None, 3, 4, None, None],
...     }
... )
>>> df
shape: (6, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 1    │
│ 2    ┆ null │
│ null ┆ 3    │
│ 4    ┆ 4    │
│ 5    ┆ null │
│ null ┆ null │
└──────┴──────┘
>>> df.with_columns(*to_masked_expr("a", "b"))
shape: (6, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 1    │
│ null ┆ null │
│ null ┆ null │
│ 4    ┆ 4    │
│ null ┆ null │
│ null ┆ null │
└──────┴──────┘

oi_tools.helpers.to_selector( x: str | Collection[str] | Selector, ) → Selector¶

Convert the input to a Polars selector.

Parameters:: x (str | Collection[str] | Selector) – Something that can be coerced to a column selector.
Returns:: A Polars column selector.
Return type:: cs.Selector

Examples

>>> import polars as pl
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
>>> df.select(to_selector(["a", "b"]))
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘