Weighted Statistics¶

Statistics-related helper functions.

Polars expression helpers are available as regular functions and through the oi namespace:

>>> import polars as pl
>>> from oi_tools.stats import weighted_mean
>>> df = pl.DataFrame({"x": [1.0, 2.0], "w": [1.0, 3.0]})
>>> df.select(pl.col("x").pipe(weighted_mean, "w")).item()
1.75
>>> df.select(pl.col("x").oi.weighted_mean("w")).item()
1.75

API Documentation¶

oi_tools.stats.center( x: Expr | str | int | float, w: Expr | str | int | float, ) → Expr¶

Subtract the weighted mean from an expression.

Parameters:

x (Expr | str | int | float) – The expression to center.
w (Expr | str | int | float) – Weights.

Returns:

An expression with mean zero.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [1.0, 2.0, 3.0], "w": [1.0, 1.0, 2.0]})
>>> df.select(center("x", "w")).to_series().to_list()
[-1.25, -0.25, 0.75]

oi_tools.stats.scale( x: Expr | str | int | float, w: Expr | str | int | float, *, weight_type: Literal['frequency', 'precision'] = 'precision', ddof: int = 1, ) → Expr¶

Divide an expression by its weighted standard deviation.

Parameters:

x (Expr | str | int | float) – The expression to scale.
w (Expr | str | int | float) – Weights.
weight_type (Literal['frequency', 'precision']) – The type of weight; see weighted_variance().
ddof (int) – Delta degrees of freedom.

Returns:

An expression with unit variance.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 3.0, 1.0]})
>>> df.select(scale("x", "w", weight_type="frequency")).to_series().round(
...     4
... ).to_list()
[0.0, 1.4142, 2.8284]

oi_tools.stats.standardize( x: Expr | str | int | float, w: Expr | str | int | float, *, weight_type: Literal['frequency', 'precision'] = 'precision', ddof: int = 1, ) → Expr¶

Center and scale an expression to zero mean and unit variance.

Equivalent to composing center() and scale().

Parameters:

x (Expr | str | int | float) – The expression to standardize.
w (Expr | str | int | float) – Weights.
weight_type (Literal['frequency', 'precision']) – The type of weight; see weighted_variance().
ddof (int) – Delta degrees of freedom.

Returns:

An expression with mean zero and unit variance.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 1.0, 1.0]})
>>> df.select(standardize("x", "w")).to_series().to_list()
[-1.0, 0.0, 1.0]

oi_tools.stats.weighted_covariance( x: Expr | str | int | float, y: Expr | str | int | float, w: Expr | str | int | float | None = None, *, weight_type: Literal['frequency', 'precision'] = 'precision', ddof: int = 1, ) → Expr¶

Compute the weighted covariance between two expressions.

Rows where any of x, y, or w is null are omitted. See the documentation for np.cov for more.

Parameters:

x (Expr | str | int | float) – First expression.
y (Expr | str | int | float) – Second expression.
w (Expr | str | int | float | None) – Weights.
weight_type (Literal['frequency', 'precision']) –
The type of weight.
- "frequency" weights treat each weight as a repeat count, giving normalization 1 / (sum(w) - ddof). Like fweights in Stata.
- "precision" (analytic/reliability) weights treat each weight as an inverse variance, giving normalization sum(w) / (sum(w)**2 - ddof * sum(w**2)). Like aweights in Stata.
ddof (int) – Delta degrees of freedom.

Returns:

The weighted covariance.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame(
...     {"x": [0.0, 1.0, 2.0], "y": [2.0, 1.0, 0.0], "w": [1.0, 3.0, 1.0]}
... )
>>> df.select(weighted_covariance("x", "y", "w", weight_type="frequency")).item()
-0.5

>>> df = pl.DataFrame(
...     {"x": [0.0, 1.0, 2.0], "y": [2.0, 1.0, 0.0], "w": [1.0, 2.0, 1.0]}
... )
>>> df.select(weighted_covariance("x", "y", "w", weight_type="precision")).item()
-0.8

See also

weighted_variance

oi_tools.stats.weighted_mean( x: Expr | str | int | float, w: Expr | str | int | float | None = None, ) → Expr¶

Compute the weighted mean of an expression.

Rows where either x or w is null are omitted.

Parameters:

x (Expr | str | int | float) – The expression.
w (Expr | str | int | float | None) – Weights.

Returns:

The weighted mean.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [0.0, 1.0], "w": [1.0, 3.0]})
>>> df.select(weighted_mean("x", "w")).item()
0.75

Null values in either x or w are omitted:

>>> df = pl.DataFrame({"x": [0.0, None, 1.0], "w": [1.0, 1.0, 3.0]})
>>> df.select(weighted_mean("x", "w")).item()
0.75

Compute the weighted quantile rank of an expression.

Parameters:

x (str | Collection[str] | Selector | Expr | int | float) – The values to rank.
w (str | Collection[str] | Selector | Expr | int | float) – The weights associated with each value.
ties (Literal['arbitrary', 'average']) –
How to handle assigning quantiles in the case of ties:
- "arbitrary": break ties arbitrarily,
- "average": assign each unit the average rank of all units with the same x value.

Returns:

A Polars expression producing ranks in (0, 1).

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [1.0, 2.0], "w": [1.0, 3.0]})
>>> df.select(weighted_rank("x", "w")).to_series().to_list()
[0.125, 0.625]

Notes

Behavior is undefined if w contains null values.

oi_tools.stats.weighted_variance( x: Expr | str | int | float, w: Expr | str | int | float | None = None, *, weight_type: Literal['frequency', 'precision'] = 'precision', ddof: int = 1, ) → Expr¶

Compute the weighted variance of an expression.

Rows where either x or w is null are omitted. See the documentation for np.cov for more.

Parameters:

x (Expr | str | int | float) – The expression.
w (Expr | str | int | float | None) – Weights.
weight_type (Literal['frequency', 'precision']) – See func:weighted_covariance.
ddof (int)

Returns:

The weighted variance.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 3.0, 1.0]})
>>> df.select(weighted_variance("x", "w", weight_type="frequency")).item()
0.5

>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 2.0, 1.0]})
>>> df.select(weighted_variance("x", "w", weight_type="precision")).item()
0.8