π π β
Data validation for scientists, engineers, and analysts seeking correctness.
Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects. The goal of Pandera is to make data processing pipelines more readable and robust with statistically typed dataframes.
Pandera supports multiple dataframe libraries, including pandas, polars, pyspark, and more. To validate pandas DataFrames, install Pandera with the pandas extra:
With pip:
pip install 'pandera[pandas]' With uv:
uv pip install 'pandera[pandas]' With conda:
conda install -c conda-forge pandera-pandas First, create a dataframe:
import pandas as pd import pandera.pandas as pa # data to validate df = pd.DataFrame({ "column1": [1, 2, 3], "column2": [1.1, 1.2, 1.3], "column3": ["a", "b", "c"], })Validate the data using the object-based API:
# define a schema schema = pa.DataFrameSchema({ "column1": pa.Column(int, pa.Check.ge(0)), "column2": pa.Column(float, pa.Check.lt(10)), "column3": pa.Column( str, [ pa.Check.isin([*"abc"]), pa.Check(lambda series: series.str.len() == 1), ] ), }) print(schema.validate(df)) # column1 column2 column3 # 0 1 1.1 a # 1 2 1.2 b # 2 3 1.3 cOr validate the data using the class-based API:
# define a schema class Schema(pa.DataFrameModel): column1: int = pa.Field(ge=0) column2: float = pa.Field(lt=10) column3: str = pa.Field(isin=[*"abc"]) @pa.check("column3") def custom_check(cls, series: pd.Series) -> pd.Series: return series.str.len() == 1 print(Schema.validate(df)) # column1 column2 column3 # 0 1 1.1 a # 1 2 1.2 b # 2 3 1.3 cWarning
Pandera v0.24.0 introduces the pandera.pandas module, which is now the (highly) recommended way of defining DataFrameSchemas and DataFrameModels for pandas data structures like DataFrames. Defining a dataframe schema from the top-level pandera module will produce a FutureWarning:
import pandera as pa schema = pa.DataFrameSchema({"col": pa.Column(str)})Update your import to:
import pandera.pandas as paAnd all of the rest of your pandera code should work. Using the top-level pandera module to access DataFrameSchema and the other pandera classes or functions will be deprecated in version 0.29.0
See the official documentation to learn more.
