Convenience Functions¶
Various helper/utility functions.
API Documentation¶
- class oi_tools.helpers.ParquetCache( )¶
A thin wrapper around
joblib.Memorythat cachespolars.DataFrameto parquets.See the examples section below or the joblib documentation for more.
- Parameters:
Examples
>>> import time >>> import polars as pl >>> from oi_tools.helpers import ParquetCache >>> >>> CACHE = ParquetCache("/tmp/example/cache/path/", verbose=0) >>> >>> @CACHE.cache() ... def slow_query() -> pl.DataFrame: ... print("Computing expensive function...") ... time.sleep(2) ... return pl.DataFrame({"x": [1, 2, 3]}) >>> >>> slow_query() # slow on first call Computing expensive function... shape: (3, 1) ┌─────┐ │ x │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘ >>> slow_query() # fast on subsequent calls shape: (3, 1) ┌─────┐ │ x │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
In addition to using this wrapper, it’s also possible to use the parquet backend (
ParquetStoreBackend) directly fromjoblib.Memoryby passing backend=”parquet”:>>> from joblib import Memory >>> CACHE = Memory("/tmp/another/cache/path/", backend="parquet")
- oi_tools.helpers.clean_col_name(
- name: str,
Normalize a column name to
snake_case.Inspired by janitor::clean_names in R.
- Parameters:
name (str) – Raw column name string.
- Returns:
Cleaned column name in snake case with runs of non alpha-numeric characters replaced by underscores.
- Return type:
Examples
>>> clean_col_name("First Name") 'first_name' >>> clean_col_name("testScore") 'test_score' >>> clean_col_name("Total (USD)") 'total_usd' >>> clean_col_name("FIPS Code") 'fips_code' >>> clean_col_name("café") 'cafe'
This function is especially useful in
polars.DataFrame.rename():>>> import polars as pl >>> df = pl.DataFrame({"First Name": ["Alice"], "testScore": [1]}) >>> df.columns ['First Name', 'testScore'] >>> df.rename(clean_col_name).columns ['first_name', 'test_score']
- oi_tools.helpers.complete( ) DF¶
Turn implicitly missing rows into explicit null values.
Computes the cartesian product of all unique values for the specified columns, then left-joins the original data onto that spine so that missing combinations appear as null rows.
- Parameters:
df (DF) – Input DataFrame or LazyFrame.
*columns (str | Sequence[str]) – Column names whose unique values form the spine. Pass a list of names to treat multiple columns as a single composite key. (See below for an example.)
**kwargs (Sequence[Any]) – Explicit value sets for additional key columns. The kwarg name becomes the column name and the value becomes its unique levels.
- Returns:
DataFrame or LazyFrame with all combinations present and missing values filled with nulls.
- Return type:
DF
- Raises:
ValueError – If the same column appears in both
columnsandkwargs.polars.exceptions.ComputeError – At collection time, if the specified columns do not uniquely identify rows in
df(i.e. the 1:1 join constraint is violated).
Examples
>>> df = pl.DataFrame({"a": [1, 1, 2], "b": ["x", "y", "x"], "val": [10, 20, 30]}) >>> df shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ val │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ x ┆ 10 │ │ 1 ┆ y ┆ 20 │ │ 2 ┆ x ┆ 30 │ └─────┴─────┴─────┘
Passing column names causes implicitly missing rows to be converted to explicit null values:
>>> complete(df, "a", "b").sort("a", "b") shape: (4, 3) ┌─────┬─────┬──────┐ │ a ┆ b ┆ val │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 │ ╞═════╪═════╪══════╡ │ 1 ┆ x ┆ 10 │ │ 1 ┆ y ┆ 20 │ │ 2 ┆ x ┆ 30 │ │ 2 ┆ y ┆ null │ └─────┴─────┴──────┘
You can use a kwarg to extend the expected values beyond what appears in the data:
>>> complete(df, "b", a=[1, 2, 3]).sort("b", "a") shape: (6, 3) ┌─────┬─────┬──────┐ │ b ┆ a ┆ val │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪══════╡ │ x ┆ 1 ┆ 10 │ │ x ┆ 2 ┆ 30 │ │ x ┆ 3 ┆ null │ │ y ┆ 1 ┆ 20 │ │ y ┆ 2 ┆ null │ │ y ┆ 3 ┆ null │ └─────┴─────┴──────┘
Passing a list treats multiple columns as a single composite key, computing unique values for the pair rather than each column independently. This is useful when one column is only meaningful in the context of another — for example, three-digit county FIPS codes are only unique within a state, so
["state", "county"]should be completed together rather than separately:>>> df2 = pl.DataFrame( ... { ... "year": [2020, 2020, 2021], ... "state": ["CA", "TX", "CA"], ... "county": ["001", "001", "003"], ... "val": [10, 20, 30], ... } ... ) >>> complete(df2, "year", ["state", "county"]).sort("year", "state", "county") shape: (6, 4) ┌──────┬───────┬────────┬──────┐ │ year ┆ state ┆ county ┆ val │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ i64 │ ╞══════╪═══════╪════════╪══════╡ │ 2020 ┆ CA ┆ 001 ┆ 10 │ │ 2020 ┆ CA ┆ 003 ┆ null │ │ 2020 ┆ TX ┆ 001 ┆ 20 │ │ 2021 ┆ CA ┆ 001 ┆ null │ │ 2021 ┆ CA ┆ 003 ┆ 30 │ │ 2021 ┆ TX ┆ 001 ┆ null │ └──────┴───────┴────────┴──────┘
Notice that
TX, 003does not appear in the output.
- oi_tools.helpers.inflation_adjust(
- col: str | Collection[str] | Selector | Expr | int | float,
- *,
- from_year: str | Collection[str] | Selector | Expr | int | float,
- to_year: str | Collection[str] | Selector | Expr | int | float,
- series: str = 'CUUR0000SA0',
Adjust for inflation using the Consumer Price Index.
Useful references:
- Parameters:
col (str | Collection[str] | Selector | Expr | int | float) – The column (or columns) to adjust.
from_year (str | Collection[str] | Selector | Expr | int | float) – The year in which the dollar value is currently measured.
to_year (str | Collection[str] | Selector | Expr | int | float) – The year to which you would like to inflation adjust.
series (str) – The CPI series used for inflation adjustment.
- Return type:
Examples
>>> df = pl.DataFrame({"income": [50000, 75000], "year": [2010, 2015]}) >>> df.with_columns( ... income_2023=inflation_adjust("income", from_year="year", to_year=2023) ... ) shape: (2, 3) ┌────────┬──────┬──────────────┐ │ income ┆ year ┆ income_2023 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ f64 │ ╞════════╪══════╪══════════════╡ │ 50000 ┆ 2010 ┆ 69867.832117 │ │ 75000 ┆ 2015 ┆ 96417.767502 │ └────────┴──────┴──────────────┘
- oi_tools.helpers.regress(
- lhs: str | Collection[str] | Selector | Expr | int | float,
- *rhs: str | Collection[str] | Selector | Expr | int | float,
- include_intercept: bool = True,
- weights: str | Collection[str] | Selector | Expr | int | float | None = None,
- output: Literal['predictions', 'residuals', 'coefficients', 'statistics'] = 'predictions',
- **kwargs,
Regress an expression on some covariates.
At the moment, this is just a thin wrapper around
polars_ols.compute_least_squares().- Parameters:
lhs (str | Collection[str] | Selector | Expr | int | float) – The dependent variable (outcome) to regress.
*rhs (str | Collection[str] | Selector | Expr | int | float) – One or more independent variables (predictors).
include_intercept (bool) – Whether to add an intercept term.
weights (str | Collection[str] | Selector | Expr | int | float | None) – Optional observation weights for weighted least squares.
output (Literal['predictions', 'residuals', 'coefficients', 'statistics']) – What to return from the regression. See the “Returns” section for how this will effect the output.
**kwargs – Additional keyword arguments forwarded to
polars_ols.OLSKwargs.
- Returns:
The result depends on
output:"predictions": fitted values from the regression (same length aslhs)."residuals": difference between observed and fitted values (same length aslhs)."coefficients": estimated coefficients for each predictor (same length as the number of columns thatrhsexpands to)."statistics": coefficient estimates with standard errors and p-values.
- Return type:
pl.Expr
Examples
>>> import polars as pl >>> from oi_tools.helpers import regress >>> df = pl.DataFrame({"y": [1.0, 2.0, 3.0, 4.0], "x": [2.0, 4.0, 5.0, 7.0]})
Fitted values (default):
>>> df.select(regress("y", "x")) shape: (4, 1) ┌──────────┐ │ y │ │ --- │ │ f64 │ ╞══════════╡ │ 0.961538 │ │ 2.192308 │ │ 2.807692 │ │ 4.038462 │ └──────────┘
Residuals:
>>> df.select(regress("y", "x", output="residuals")) shape: (4, 1) ┌───────────┐ │ y │ │ --- │ │ f64 │ ╞═══════════╡ │ 0.038462 │ │ -0.192308 │ │ 0.192308 │ │ -0.038462 │ └───────────┘
Coefficients as a one-row struct, one field per predictor:
>>> df.select(regress("y", "x", output="coefficients")).unnest("coefficients") shape: (1, 2) ┌──────────┬───────────┐ │ x ┆ const │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════╪═══════════╡ │ 0.615385 ┆ -0.269231 │ └──────────┴───────────┘
Model statistics (r2, mae, mse, and per-predictor coefficients, standard errors, t-values, and p-values):
>>> stats = df.select(regress("y", "x", output="statistics")).unnest("statistics") >>> stats.select(["r2", "feature_names", "coefficients", "p_values"]) shape: (1, 4) ┌──────────┬────────────────┬──────────────────┬──────────────────┐ │ r2 ┆ feature_names ┆ coefficients ┆ p_values │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ list[str] ┆ list[f64] ┆ list[f64] │ ╞══════════╪════════════════╪══════════════════╪══════════════════╡ │ 0.984615 ┆ ["x", "const"] ┆ [0.615385, -0.2… ┆ [0.007722, 0.41… │ └──────────┴────────────────┴──────────────────┴──────────────────┘
- oi_tools.helpers.to_expr( ) Expr¶
Convert the input to a Polars expression.
- Parameters:
x (str | Collection[str] | Selector | Expr | int | float) – Something that can be coerced to an expression.
- Returns:
A Polars expression.
- Return type:
pl.Expr
Examples
>>> import polars as pl >>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
Strings become column references:
>>> df.select(to_expr("a")) shape: (3, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Numerics become literals:
>>> df.select(to_expr(10).mul("a")) shape: (3, 1) ┌─────────┐ │ literal │ │ --- │ │ i64 │ ╞═════════╡ │ 10 │ │ 20 │ │ 30 │ └─────────┘
Lists of strings become column selectors:
>>> columns = ["a", "b"] >>> df.select(*columns, to_expr(columns).mul(2).name.suffix("_times_two")) shape: (3, 4) ┌─────┬─────┬─────────────┬─────────────┐ │ a ┆ b ┆ a_times_two ┆ b_times_two │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════════════╪═════════════╡ │ 1 ┆ 4 ┆ 2 ┆ 8 │ │ 2 ┆ 5 ┆ 4 ┆ 10 │ │ 3 ┆ 6 ┆ 6 ┆ 12 │ └─────┴─────┴─────────────┴─────────────┘
Expressions pass through unchanged:
>>> expr = pl.col("a") * 2 >>> to_expr(expr) is expr True
- oi_tools.helpers.to_masked_expr( ) Sequence[Expr]¶
Create a set of expressions with a standardized null mask.
All output expressions evaluate to null wherever any input expression is null.
- Parameters:
*xs (str | Collection[str] | Selector | Expr | int | float) – Expressions to mask.
- Returns:
Expressions that evaluate to null when any input is null.
- Return type:
Sequence[pl.Expr]
Examples
>>> import polars as pl >>> df = pl.DataFrame( ... { ... "a": [1, 2, None, 4, 5, None], ... "b": [1, None, 3, 4, None, None], ... } ... ) >>> df shape: (6, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 1 ┆ 1 │ │ 2 ┆ null │ │ null ┆ 3 │ │ 4 ┆ 4 │ │ 5 ┆ null │ │ null ┆ null │ └──────┴──────┘ >>> df.with_columns(*to_masked_expr("a", "b")) shape: (6, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 1 ┆ 1 │ │ null ┆ null │ │ null ┆ null │ │ 4 ┆ 4 │ │ null ┆ null │ │ null ┆ null │ └──────┴──────┘
- oi_tools.helpers.to_selector(
- x: str | Collection[str] | Selector,
Convert the input to a Polars selector.
- Parameters:
x (str | Collection[str] | Selector) – Something that can be coerced to a column selector.
- Returns:
A Polars column selector.
- Return type:
cs.Selector
Examples
>>> import polars as pl >>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]}) >>> df.select(to_selector(["a", "b"])) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 2 ┆ 5 │ │ 3 ┆ 6 │ └─────┴─────┘