Weighted Statistics

Statistics-related helper functions including weighted means, variances, and correlations.

API Documentation

oi_tools.stats.center(
x: Expr | str | int | float,
w: Expr | str | int | float,
) Expr

Subtract the weighted mean from an expression.

Parameters:
Returns:

An expression with mean zero.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [1.0, 2.0, 3.0], "w": [1.0, 1.0, 2.0]})
>>> df.select(center("x", "w")).to_series().to_list()
[-1.25, -0.25, 0.75]
oi_tools.stats.scale(
x: Expr | str | int | float,
w: Expr | str | int | float,
*,
weight_type: Literal['frequency', 'precision'] = 'precision',
ddof: int = 1,
) Expr

Divide an expression by its weighted standard deviation.

Parameters:
Returns:

An expression with unit variance.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 3.0, 1.0]})
>>> df.select(scale("x", "w", weight_type="frequency")).to_series().round(
...     4
... ).to_list()
[0.0, 1.4142, 2.8284]
oi_tools.stats.weighted_covariance(
x: Expr | str | int | float,
y: Expr | str | int | float,
w: Expr | str | int | float | None = None,
*,
weight_type: Literal['frequency', 'precision'] = 'precision',
ddof: int = 1,
) Expr

Compute the weighted covariance between two expressions.

Rows where any of x, y, or w is null are omitted. See the documentation for np.cov for more.

Parameters:
  • x (Expr | str | int | float) – First expression.

  • y (Expr | str | int | float) – Second expression.

  • w (Expr | str | int | float | None) – Weights.

  • weight_type (Literal['frequency', 'precision']) –

    The type of weight.

    • "frequency" weights treat each weight as a repeat count, giving normalization 1 / (sum(w) - ddof). Like fweights in Stata.

    • "precision" (analytic/reliability) weights treat each weight as an inverse variance, giving normalization sum(w) / (sum(w)**2 - ddof * sum(w**2)). Like aweights in Stata.

  • ddof (int) – Delta degrees of freedom.

Returns:

The weighted covariance.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame(
...     {"x": [0.0, 1.0, 2.0], "y": [2.0, 1.0, 0.0], "w": [1.0, 3.0, 1.0]}
... )
>>> df.select(weighted_covariance("x", "y", "w", weight_type="frequency")).item()
-0.5
>>> df = pl.DataFrame(
...     {"x": [0.0, 1.0, 2.0], "y": [2.0, 1.0, 0.0], "w": [1.0, 2.0, 1.0]}
... )
>>> df.select(weighted_covariance("x", "y", "w", weight_type="precision")).item()
-0.8
oi_tools.stats.weighted_mean(
x: Expr | str | int | float,
w: Expr | str | int | float | None = None,
) Expr

Compute the weighted mean of an expression.

Rows where either x or w is null are omitted.

Parameters:
Returns:

The weighted mean.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [0.0, 1.0], "w": [1.0, 3.0]})
>>> df.select(weighted_mean("x", "w")).item()
0.75

Null values in either x or w are omitted:

>>> df = pl.DataFrame({"x": [0.0, None, 1.0], "w": [1.0, 1.0, 3.0]})
>>> df.select(weighted_mean("x", "w")).item()
0.75
oi_tools.stats.weighted_rank(
x: str | Collection[str] | Selector | Expr | int | float,
w: str | Collection[str] | Selector | Expr | int | float,
*,
ties: Literal['arbitrary', 'average'] = 'average',
) Expr

Compute the weighted quantile rank of an expression.

Parameters:
  • x (str | Collection[str] | Selector | Expr | int | float) – The values to rank.

  • w (str | Collection[str] | Selector | Expr | int | float) – The weights associated with each value.

  • ties (Literal['arbitrary', 'average']) –

    How to handle assigning quantiles in the case of ties:

    • "arbitrary": break ties arbitrarily,

    • "average": assign each unit the average rank of all units with the same x value.

Returns:

A Polars expression producing ranks in (0, 1).

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [1.0, 2.0], "w": [1.0, 3.0]})
>>> df.select(weighted_rank("x", "w")).to_series().to_list()
[0.125, 0.625]

Notes

Behavior is undefined if w contains null values.

oi_tools.stats.weighted_variance(
x: Expr | str | int | float,
w: Expr | str | int | float | None = None,
*,
weight_type: Literal['frequency', 'precision'] = 'precision',
ddof: int = 1,
) Expr

Compute the weighted variance of an expression.

Rows where either x or w is null are omitted. See the documentation for np.cov for more.

Parameters:
Returns:

The weighted variance.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 3.0, 1.0]})
>>> df.select(weighted_variance("x", "w", weight_type="frequency")).item()
0.5
>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 2.0, 1.0]})
>>> df.select(weighted_variance("x", "w", weight_type="precision")).item()
0.8