Reading and Writing Files¶
IO utilities for reading fixed-width, line-based, and HTML files.
For reading Stata (.dta) and SAS (.sas7bdat, .xpt) files, see the excellent
polars_readstat
package written by Jon Rothbaum at Census.
API Documentation¶
- oi_tools.io.read_fwf(
- file: str | Path,
- cols: Mapping[str, tuple[int, int]] | Sequence[tuple[str | None, int]],
- *,
- infer_schema: bool = True,
- infer_schema_length: int = 100,
- **kwargs,
An eager version of
oi_tools.io.scan_fwf(). See that function for documentation.
- oi_tools.io.read_html(
- source: str | TextIO | BinaryIO,
- *,
- header: str | Sequence[str] | None = './/thead[1]',
- table_index: int | None = None,
- table_xpath: str | None = None,
- table_css: str | None = None,
- encoding: str = 'utf-8',
Read an HTML table into a DataFrame.
- Parameters:
source (str | TextIO | BinaryIO) – A URL string (fetched with
httpx), or a file-like object (text or binary) containing HTML.header (str | Sequence[str] | None) –
Controls how column names are determined.
str(default".//thead[1]"): an XPath expression whose text nodes are joined to form column names.Sequence[str]: explicit column names supplied by the caller.None: no header; columns are numberedcolumn_0,column_1, …
table_index (int | None) – Zero-based index of the table to parse when the selector matches multiple tables. Mutually exclusive with
table_xpath/table_cssuniquely identifying a single table.table_xpath (str | None) – XPath expression used to locate the table element. Defaults to
"descendant-or-self::table". Mutually exclusive withtable_css.table_css (str | None) – CSS selector used to locate the table element (converted to XPath internally). Mutually exclusive with
table_xpath.encoding (str) – Character encoding used to decode a binary source. Ignored when
sourceis already a text stream or URL (decoded byhttpx).
- Returns:
A DataFrame with one column per table column and one row per
<tr>in the<tbody>.- Return type:
- Raises:
ValueError – If no table matches the selector, if multiple tables match and
table_indexis not provided, or if bothtable_xpathandtable_cssare specified.TypeError – If
sourceis not a URL string or a file-like object.
Examples
Read the only table from a URL:
>>> df = read_html("https://example.com/table.html")
Read the second table from a local HTML file:
>>> with open("report.html") as f: ... df = read_html(f, table_index=1)
Select a table by CSS selector and supply column names explicitly:
>>> with open("report.html", "rb") as f: ... df = read_html( ... f, ... table_css="#summary-table", ... header=["name", "value", "unit"], ... )
- oi_tools.io.read_lines( ) pl.LazyFrame¶
An eager version of
oi_tools.io.scan_lines(). See that function for documentation.
- oi_tools.io.scan_fwf(
- file: str | Path,
- cols: Mapping[str, tuple[int, int]] | Sequence[tuple[str | None, int]],
- *,
- infer_schema: bool = True,
- infer_schema_length: int = 100,
- **kwargs,
Lazily read a fixed-width file.
- Parameters:
cols (Mapping[str, tuple[int, int]] | Sequence[tuple[str | None, int]]) – The locations of the relevant columns in the fixed-width file. Either a mapping from column names to
(start, end)pairs or a sequence of(name, width)pairs.infer_schema (bool) – Whether to infer column data types from the first
infer_schema_lengthrows. IfFalse, all columns are returned aspl.String.infer_schema_length (int) – Number of rows to use for schema inference. Only used when
infer_schema=True.kwargs – Other kwargs to pass to pl.scan_csv.
- Returns:
A LazyFrame with one column per entry in
cols.- Return type:
Examples
Read the file based on column start/end positions:
>>> lf = scan_fwf( ... "data.txt", cols={"name": (0, 20), "age": (20, 23)} ... )
If the columns are packed densely, can also specify their locations using
(name, width)pairs:>>> lf = scan_fwf("data.txt", cols=[("name", 20), ("age", 3)])
- oi_tools.io.scan_lines( ) pl.LazyFrame¶
Read a newline-delimited text file into a single-column LazyFrame.
- Parameters:
file (str | Path) – The file to read.
col_name (str) – The name to assign the column.
col_dtype (PolarsDataType) – The data type to assign the single column. String by default.
kwargs –
Other kwargs to pass to pl.scan_csv.
- Returns:
A LazyFrame with a single column named
col_name.- Return type:
Examples
Read every line of a plain-text file:
>>> lf = scan_lines("data.txt") >>> lf.collect()
Read with a custom column name and limit to the first 1 000 rows:
>>> lf = scan_lines("data.txt", col_name="record", n_rows=1_000) >>> lf.collect()