Back in August of last year (roughly 8 months ago), I hunched over my desk at 4 am desperate to fire off a post before boarding a flight the next morning. The article was titled Creating Database Schemas: a Job for Robots, or Perhaps Pandas. It was my intent at the time to solve a common annoyance: creating database tables out of raw data, without the obnoxious process of explicitly setting each column's datatype. I had a few leads that led me to believe I had the answer... boy was I wrong.
The task seems somewhat reasonable from the surface. Surely we can spot columns where the data is always in integers, or match the expected format of a date, right? If anything, we'll fall back to text or varchar and call it a day. Hell, even MongoDB's Compass does a great job of this by merely uploading a CSV... this has got to be some trivial task handled by third-party libraries by now.
For one reason or another, searching for a solution to this problem almost always comes up empty. Software developers probably have little need for dynamically generated tables if their applications run solely on self-defined models. Full-time Data Scientists have access to plenty of expensive tools which seem to claim this functionality, yet it all seems so... inaccessible.
Is This NOT a Job For Pandas?
From my experience, no. Pandas does offer hope but doesn't seem to get the job done quite right. Let's start with a dataset so you can see what I mean. Here's a bunch of fake identities I'll be using to mimic the outcome I experienced when working with real data:
id | initiated | hiredate | firstname | lastname | title | department | location | country | type | |
---|---|---|---|---|---|---|---|---|---|---|
100035435 | 2015-12-11T09:16:20.722-08:00 | 3/22/67 | GretchenRMorrow@jourrapide.com | Gretchen | Morrow | Power plant operator | Physical Product | Britling Cafeterias | United Kingdom | Employee |
100056435 | 2015-12-15T10:11:24.604-08:00 | 6/22/99 | ElizabethLSnow@armyspy.com | Elizabeth | Snow | Oxygen therapist | Physical Product | Grade A Investment | United States of America | Employee |
100037955 | 2015-12-16T14:31:32.765-08:00 | 5/31/74 | AlbertMPeterson@einrot.com | Albert | Peterson | Psychologist | Physical Product | Grass Roots Yard Services | United States of America | Employee |
100035435 | 2016-01-20T11:15:47.249-08:00 | 9/9/69 | JohnMLynch@dayrep.com | John | Lynch | Environmental hydrologist | Physical Product | Waccamaw's Homeplace | United States of America | Employee |
100057657 | 2016-01-21T12:45:38.261-08:00 | 4/9/83 | TheresaJCahoon@teleworm.us | Theresa | Cahoon | Personal chef | Physical Product | Cala Foods | United States of America | Employee |
100056747 | 2016-02-01T11:25:39.317-08:00 | 6/26/98 | KennethHPayne@dayrep.com | Kenneth | Payne | Central office operator | Frontline | Magna Consulting | United States of America | Employee |
100035435 | 2016-02-01T11:28:11.953-08:00 | 4/16/82 | LeifTSpeights@fleckens.hu | Leif | Speights | Staff development director | Frontline | Rivera Property Maintenance | United States of America | Employee |
100035435 | 2016-02-01T12:21:01.756-08:00 | 8/6/80 | JamesSRobinson@teleworm.us | James | Robinson | Scheduling clerk | Frontline | Discount Furniture Showcase | United States of America | Employee |
100074688 | 2016-02-01T13:29:19.147-08:00 | 12/14/74 | AnnaDMoberly@jourrapide.com | Anna | Moberly | Playwright | Physical Product | The Wiz | United States of America | Employee |
100665778 | 2016-02-04T14:40:05.223-08:00 | 9/13/66 | MarjorieBCrawford@armyspy.com | Marjorie | Crawford | Court, municipal, and license clerk | Physical Product | The Serendipity Dip | United Kingdom | Employee |
100876876 | 2016-02-24T12:39:25.872-08:00 | 12/19/67 | LyleCHackett@fleckens.hu | Lyle | Hackett | Airframe mechanic | Physical Product | Infinity Investment Plan | United States of America | Employee |
100658565 | 2016-02-29T15:52:12.933-08:00 | 11/17/83 | MaryJDensmore@jourrapide.com | Mary | Densmore | Employer relations representative | Frontline | One-Up Realtors | United States of America | Employee |
100766547 | 2016-03-01T12:32:53.357-08:00 | 10/1/87 | CindyRDiaz@armyspy.com | Cindy | Diaz | Student affairs administrator | Physical Product | Mr. AG's | United States of America | Employee |
100045677 | 2016-03-02T12:07:44.264-08:00 | 8/16/65 | AndreaTLigon@einrot.com | Andrea | Ligon | Railroad engineer | Central Growth | Robinson Furniture | United States of America | Employee |
There are some juicy datatypes in there: integers , timestamps , dates , strings.... and those are only the first four columns! Let's load this thing into a DataFrame and see what information we can get that way:
import pandas as pd csv = 'data/fake.csv' workers_df = pd.read_csv(csv, header=0, encoding='utf-8') meta = workers_df.info(verbose=True) print(meta)
Using Pandas' info()
should do the trick! This returns a list of columns and their data types:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 22 entries, 0 to 21 Data columns (total 11 columns): id 14 non-null float64 initiated 14 non-null object hiredate 14 non-null object email 14 non-null object firstname 14 non-null object lastname 14 non-null object title 14 non-null object department 14 non-null object location 14 non-null object country 14 non-null object type 14 non-null object dtypes: float64(1), object(10) memory usage: 2.0+ KB None
...Or not. What is this garbage? Only one of our 11 columns identified a data type, and it was incorrectly listed as a float! Okay, so maybe Pandas doesn't have a secret one-liner for this. So who does?
What about PySpark?
It's always been a matter of time before we'd turn to Apache's family of aged data science products. Hadoop, Spark, Kafka... all of them have a particular musty stench about them that tastes like "I feel like I should be writing in Java right now." Heads up: they do want you to write in Java. Misery loves company.
Nonetheless, PySpark does support reading data as DataFrames in Python, and also comes with the elusive ability to infer schemas. Installing Hadoop and Spark locally still kind of sucks for solving this one particular problem. Cue Databricks: a company that spun off from the Apache team way back in the day, and offers free cloud notebooks integrated with- you guessed it: Spark.
With Databricks, we can upload our CSV and load it into a DataFrame by spinning up a free notebook. The source looks something like this:
# File location and type file_location = "/FileStore/tables/fake.csv" file_type = "csv" # CSV options infer_schema = "true" first_row_is_header = "true" delimiter = "," df = spark.read.format(file_type) \ .option("inferSchema", infer_schema) \ .option("header", first_row_is_header) \ .option("sep", delimiter) \ .load(file_location) display(df)
Let's see out the output looks:
df:pyspark.sql.dataframe.DataFrame id:integer initiated:timestamp hiredate:string email:string firstname:string lastname:string title:string department:string location:string country:string type:string
Not bad! We correctly 'upgraded' our ID from float to integer, and we managed to get the timestamp correct also. With a bit of messing around, we could probably have even gotten the date correct too, given that we stated the format beforehand.
And Yet, This Still Kind of Sucks
Even though we can solve our problem in a notebook, we still haven't solved the use case: I want a drop-in solution to create tables out of CSVs... whenever I want! I want to accomplish this while writing any app, at the drop of a hat without warning. I don't want to install Hadoop and have Java errors coming back at me through my terminal. Don't EVER let me see Java in my terminal. UGH:
py4j.protocol.Py4JJavaError: An error occurred while calling o43.count. : java.lang.IllegalArgumentException: Unsupported class file major version 55 at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166) at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148) at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136) at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237) at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49) at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517) at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500) at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175) at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238) at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
Python's "tableschema" Library
Thankfully, there's at least one other person out there who has shared this desire. That brings us to tableschema, a not-quite-perfect-but-perhaps-good-enough library to gunsling data like some kind of wild data cowboy. Let's give it a go:
import csv from tableschema import Table data = 'data/fake.csv' schema = infer(data, limit=500, headers=1, confidence=0.85) print(schema)
If our dataset is particularly large, we can use the limit
attribute to limit the sample size to the first X number of rows. Another nice feature is the confidence
attribute: a 0-1 ratio for allowing casting errors during the inference. Here's what comes back:
{ "fields": [{ "name": "id", "type": "integer", "format": "default" }, { "name": "initiated", "type": "string", "format": "default" }, { "name": "hiredate", "type": "date", "format": "default" }, { "name": "email", "type": "string", "format": "default" }, { "name": "firstname", "type": "string", "format": "default" }, { "name": "lastname", "type": "string", "format": "default" }, { "name": "title", "type": "string", "format": "default" }, { "name": "department", "type": "string", "format": "default" }, { "name": "location", "type": "string", "format": "default" }, { "name": "country", "type": "string", "format": "default" }, { "name": "type", "type": "string", "format": "default" }], "missingValues": [""] }
Hey, that's good enough for me! Now let's automate the shit out this.
Creating a Table in SQLAlchemy With Our New Schema
I'm about to throw a bunch in your face right here. Here's a monster of a class:
from sqlalchemy import create_engine import config import pandas as pd import psycopg2 from tableschema import Table, infer, Schema from functions.recursivejson import extract_values from sqlalchemy.types import Integer, Text, Date class CreateTablesFromCSVs: """Infer a table schema from a CSV.""" __uri = config.PG_URI __engine = create_engine(__uri, convert_unicode=True, echo=True) __data = 'data/fake.csv' @classmethod def get_data(cls): """Pull latest data.""" test_df = pd.read_csv(cls.__data, header=0, encoding='utf-8') return test_df @classmethod def get_schema_from_csv(cls, csv): """Infers schema from CSV.""" table = Table(csv) table.infer(limit=500, confidence=0.55) schema = table.schema.descriptor names = cls.get_column_names(schema, 'name') datatypes = cls.get_column_datatypes(schema, 'type') schema_dict = dict(zip(names, datatypes)) return schema_dict @classmethod def get_column_names(cls, schema, key): """Get names of columns.""" names = extract_values(schema, key) return names @classmethod def get_column_datatypes(cls, schema, key): """Convert schema to recognizable by SQLAlchemy.""" values = extract_values(schema, key) for i, value in enumerate(values): if value == 'integer': values[i] = Integer elif value == 'string': values[i] = Text elif value == 'date': values[i] = Date return values @classmethod def create_new_table(cls, data, schema): """Create new table from CSV and generated schema.""" workday_table.to_sql('faketable', con=cls.__engine, schema='testschema', if_exists='replace', chunksize=300, dtype=schema) data = CreateTablesFromCSVs.get_schema_from_csv() schema = CreateTablesFromCSVs.get_schema_from_csv(data) CreateTablesFromCSVs.create_new_table(data, schema)
The first thing worth mentioning is I'm importing a function from my personal secret library to extract values from JSON objects. I've spoken about it before.
Let's break down this class:
-
get_data()
reads our CSV into a Pandas DataFrame. -
get_schema_from_csv()
kicks off building a Schema that SQLAlchemy can use to build a table. -
get_column_names()
simply pulls column names as half our schema. -
get_column_datatypes()
manually replaces the datatype names we received from tableschema and replaces them with SQLAlchemy datatypes. -
create_new_table
Uses a beautiful marriage between Pandas and SQLAlchemy to create a table in our database with the correct datatypes mapped.
Promising Potential, Room to Grow
While tableschema works some of the time, it isn't perfect. The base of what we accomplish still stands: we now have a reliable formula for how we would create schemas on the fly if we trust our schemas to be accurate.
Just wait until next time when we introduce Google BigQuery into the mix.
Top comments (0)