Reading and Writing Files¶

IO utilities for reading fixed-width, line-based, and HTML files.

For reading Stata (.dta) and SAS (.sas7bdat, .xpt) files, see the excellent polars_readstat package written by Jon Rothbaum at Census.

API Documentation¶

oi_tools.io.read_fwf(

file: str | Path,

cols: Mapping[str, tuple[int, int]] | Sequence[tuple[str | None, int]],

*,

infer_schema: bool = True,

infer_schema_length: int = 100,

**kwargs,

) → LazyFrame¶

An eager version of oi_tools.io.scan_fwf(). See that function for documentation.

Parameters:

file (str | Path)
cols (Mapping[str, tuple[int, int]] | Sequence[tuple[str | None, int]])
infer_schema (bool)
infer_schema_length (int)

Return type:

LazyFrame

Read an HTML table into a DataFrame.

Parameters:

source (str | TextIO | BinaryIO) – A URL string (fetched with httpx), or a file-like object (text or binary) containing HTML.
header (str | Sequence[str] | None) –
Controls how column names are determined.
- str (default ".//thead[1]"): an XPath expression whose text nodes are joined to form column names.
- Sequence[str]: explicit column names supplied by the caller.
- None: no header; columns are numbered column_0, column_1, …
table_index (int | None) – Zero-based index of the table to parse when the selector matches multiple tables. Mutually exclusive with table_xpath/table_css uniquely identifying a single table.
table_xpath (str | None) – XPath expression used to locate the table element. Defaults to "descendant-or-self::table". Mutually exclusive with table_css.
table_css (str | None) – CSS selector used to locate the table element (converted to XPath internally). Mutually exclusive with table_xpath.
encoding (str) – Character encoding used to decode a binary source. Ignored when source is already a text stream or URL (decoded by httpx).

Returns:

A DataFrame with one column per table column and one row per <tr> in the <tbody>.

Return type:

polars.DataFrame

Raises:

ValueError – If no table matches the selector, if multiple tables match and table_index is not provided, or if both table_xpath and table_css are specified.
TypeError – If source is not a URL string or a file-like object.

Examples

Read the only table from a URL:

>>> df = read_html("https://example.com/table.html")

Read the second table from a local HTML file:

>>> with open("report.html") as f:
...     df = read_html(f, table_index=1)

Select a table by CSS selector and supply column names explicitly:

>>> with open("report.html", "rb") as f:
...     df = read_html(
...         f,
...         table_css="#summary-table",
...         header=["name", "value", "unit"],
...     )

oi_tools.io.read_lines(

file: str | Path,

col_name: str = 'line',

col_dtype: PolarsDataType = String,

**kwargs,

) → pl.LazyFrame¶

An eager version of oi_tools.io.scan_lines(). See that function for documentation.

Parameters:

file (str | Path)
col_name (str)
col_dtype (PolarsDataType)

Return type:

pl.LazyFrame

oi_tools.io.scan_fwf(

file: str | Path,

cols: Mapping[str, tuple[int, int]] | Sequence[tuple[str | None, int]],

*,

infer_schema: bool = True,

infer_schema_length: int = 100,

**kwargs,

) → LazyFrame¶

Lazily read a fixed-width file.

Parameters:

file (str | Path) – The file to read.
cols (Mapping[str, tuple[int, int]] | Sequence[tuple[str | None, int]]) – The locations of the relevant columns in the fixed-width file. Either a mapping from column names to (start, end) pairs or a sequence of (name, width) pairs.
infer_schema (bool) – Whether to infer column data types from the first infer_schema_length rows. If False, all columns are returned as pl.String.
infer_schema_length (int) – Number of rows to use for schema inference. Only used when infer_schema=True.
kwargs – Other kwargs to pass to pl.scan_csv.

Returns:

A LazyFrame with one column per entry in cols.

Return type:

polars.LazyFrame

Examples

Read the file based on column start/end positions:

>>> lf = scan_fwf(
...     "data.txt", cols={"name": (0, 20), "age": (20, 23)}
... )

If the columns are packed densely, can also specify their locations using (name, width) pairs:

>>> lf = scan_fwf("data.txt", cols=[("name", 20), ("age", 3)])

oi_tools.io.scan_lines(

file: str | Path,

col_name: str = 'line',

col_dtype: PolarsDataType = String,

**kwargs,

) → pl.LazyFrame¶

Read a newline-delimited text file into a single-column LazyFrame.

Parameters:

file (str | Path) – The file to read.
col_name (str) – The name to assign the column.
col_dtype (PolarsDataType) – The data type to assign the single column. String by default.
kwargs –
Other kwargs to pass to pl.scan_csv.

Returns:

A LazyFrame with a single column named col_name.

Return type:

polars.LazyFrame

Examples

Read every line of a plain-text file:

>>> lf = scan_lines("data.txt")
>>> lf.collect()

Read with a custom column name and limit to the first 1 000 rows:

>>> lf = scan_lines("data.txt", col_name="record", n_rows=1_000)
>>> lf.collect()