Reading and Writing Files

IO utilities for reading fixed-width, line-based, and HTML files.

For reading Stata (.dta) and SAS (.sas7bdat, .xpt) files, see the excellent polars_readstat package written by Jon Rothbaum at Census.

API Documentation

oi_tools.io.read_fwf(
file: str | Path,
cols: Mapping[str, tuple[int, int]] | Sequence[tuple[str | None, int]],
*,
infer_schema: bool = True,
infer_schema_length: int = 100,
**kwargs,
) LazyFrame

An eager version of oi_tools.io.scan_fwf(). See that function for documentation.

Parameters:
Return type:

LazyFrame

oi_tools.io.read_html(
source: str | TextIO | BinaryIO,
*,
header: str | Sequence[str] | None = './/thead[1]',
table_index: int | None = None,
table_xpath: str | None = None,
table_css: str | None = None,
encoding: str = 'utf-8',
) DataFrame

Read an HTML table into a DataFrame.

Parameters:
  • source (str | TextIO | BinaryIO) – A URL string (fetched with httpx), or a file-like object (text or binary) containing HTML.

  • header (str | Sequence[str] | None) –

    Controls how column names are determined.

    • str (default ".//thead[1]"): an XPath expression whose text nodes are joined to form column names.

    • Sequence[str]: explicit column names supplied by the caller.

    • None: no header; columns are numbered column_0, column_1, …

  • table_index (int | None) – Zero-based index of the table to parse when the selector matches multiple tables. Mutually exclusive with table_xpath/table_css uniquely identifying a single table.

  • table_xpath (str | None) – XPath expression used to locate the table element. Defaults to "descendant-or-self::table". Mutually exclusive with table_css.

  • table_css (str | None) – CSS selector used to locate the table element (converted to XPath internally). Mutually exclusive with table_xpath.

  • encoding (str) – Character encoding used to decode a binary source. Ignored when source is already a text stream or URL (decoded by httpx).

Returns:

A DataFrame with one column per table column and one row per <tr> in the <tbody>.

Return type:

polars.DataFrame

Raises:
  • ValueError – If no table matches the selector, if multiple tables match and table_index is not provided, or if both table_xpath and table_css are specified.

  • TypeError – If source is not a URL string or a file-like object.

Examples

Read the only table from a URL:

>>> df = read_html("https://example.com/table.html")

Read the second table from a local HTML file:

>>> with open("report.html") as f:
...     df = read_html(f, table_index=1)

Select a table by CSS selector and supply column names explicitly:

>>> with open("report.html", "rb") as f:
...     df = read_html(
...         f,
...         table_css="#summary-table",
...         header=["name", "value", "unit"],
...     )
oi_tools.io.read_lines(
file: str | Path,
col_name: str = 'line',
col_dtype: PolarsDataType = String,
**kwargs,
) pl.LazyFrame

An eager version of oi_tools.io.scan_lines(). See that function for documentation.

Parameters:
  • file (str | Path)

  • col_name (str)

  • col_dtype (PolarsDataType)

Return type:

pl.LazyFrame

oi_tools.io.scan_fwf(
file: str | Path,
cols: Mapping[str, tuple[int, int]] | Sequence[tuple[str | None, int]],
*,
infer_schema: bool = True,
infer_schema_length: int = 100,
**kwargs,
) LazyFrame

Lazily read a fixed-width file.

Parameters:
  • file (str | Path) – The file to read.

  • cols (Mapping[str, tuple[int, int]] | Sequence[tuple[str | None, int]]) – The locations of the relevant columns in the fixed-width file. Either a mapping from column names to (start, end) pairs or a sequence of (name, width) pairs.

  • infer_schema (bool) – Whether to infer column data types from the first infer_schema_length rows. If False, all columns are returned as pl.String.

  • infer_schema_length (int) – Number of rows to use for schema inference. Only used when infer_schema=True.

  • kwargs – Other kwargs to pass to pl.scan_csv.

Returns:

A LazyFrame with one column per entry in cols.

Return type:

polars.LazyFrame

Examples

Read the file based on column start/end positions:

>>> lf = scan_fwf(
...     "data.txt", cols={"name": (0, 20), "age": (20, 23)}
... )

If the columns are packed densely, can also specify their locations using (name, width) pairs:

>>> lf = scan_fwf("data.txt", cols=[("name", 20), ("age", 3)])
oi_tools.io.scan_lines(
file: str | Path,
col_name: str = 'line',
col_dtype: PolarsDataType = String,
**kwargs,
) pl.LazyFrame

Read a newline-delimited text file into a single-column LazyFrame.

Parameters:
  • file (str | Path) – The file to read.

  • col_name (str) – The name to assign the column.

  • col_dtype (PolarsDataType) – The data type to assign the single column. String by default.

  • kwargs

    Other kwargs to pass to pl.scan_csv.

Returns:

A LazyFrame with a single column named col_name.

Return type:

polars.LazyFrame

Examples

Read every line of a plain-text file:

>>> lf = scan_lines("data.txt")
>>> lf.collect()

Read with a custom column name and limit to the first 1 000 rows:

>>> lf = scan_lines("data.txt", col_name="record", n_rows=1_000)
>>> lf.collect()