CSV Processor

CSV Processor#

This page explains how to Load and Process CSV files CSV files that are saved in the local filesystem or remote. For newcomers, CSV (Comma-Separated Values) is a simple, text-based file format used for storing tabular data. Each line in a CSV file represents a row of data, with individual values separated by commas.

👉 New to App-Generator? Sign IN with GitHub or Generate Web Apps in no time (free service).

CSV files are widely used for data exchange between different software applications due to their simplicity and compatibility.

Regarding our processor, here is the list with all supported operations:

  • load local and remote files

  • print values

  • print column types

  • print the mapping types to a Django Model

Quick Start#

The parser can be used using the CLI and the files shipped in the media directory.

$ python manage.py tool_inspect_source -f media/tool_inspect/csv_inspect.json 

The tool performs the following tasks:

  • validate the input

  • locate the CSV file (exit with error if not found)

  • loads the information and detects the column types

  • detects the Django column type

  • print the first 10 rows

The same can be applied for local and remote files. For instance, we can analyze the notorious Titanic.cvs by running this one-liner:

$ python manage.py tool_inspect_source -f media/tool_inspect/csv_inspect_distant.json # Output > Processing .\media\tool_inspect\csv_inspect_distant.json  |-- file: https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv  |-- type: csv Field CSV Type Django Types ----------- ---------- ------------------------------------------ PassengerId int64 models.IntegerField(blank=True, null=True) Survived int64 models.IntegerField(blank=True, null=True) Pclass int64 models.IntegerField(blank=True, null=True) Name object models.TextField(blank=True, null=True) Sex object models.TextField(blank=True, null=True) Age float64 models.FloatField(blank=True, null=True) SibSp int64 models.IntegerField(blank=True, null=True) Parch int64 models.IntegerField(blank=True, null=True) Ticket object models.TextField(blank=True, null=True) Fare float64 models.FloatField(blank=True, null=True) Cabin object models.TextField(blank=True, null=True) Embarked object models.TextField(blank=True, null=True) [1] - PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked [2] - 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S [3] - 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C [4] - 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S [5] - 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S [6] - 5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S [7] - 6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q [8] - 7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S [9] - 8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S [10] - 9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S ... (truncated output) 

Source Code#

The source for the above can be found on GitHub here. The relevant parts of this simple CSV Processor are:

Loads the information and prior checks the source if is local or remote

print( '> Processing ' + ARG_JSON ) print( ' |-- file: ' + JSON_DATA['source'] ) print( ' |-- type: ' + JSON_DATA['type' ] ) print( '\n') tmp_file_path = None if 'http' in JSON_DATA['source']: url = JSON_DATA['source'] r = requests.get(url) tmp_file = h_random_ascii( 8 ) + '.csv' tmp_file_path = os.path.join( DIR_TMP, tmp_file ) if not file_write(tmp_file_path, r.text ): return JSON_DATA['source'] = tmp_file_path else: if not file_exists( JSON_DATA['source'] ): print( ' > Err loading SOURCE: ' + JSON_DATA['source'] ) return csv_types = parse_csv( JSON_DATA['source'] ) 

Analyze the headers and maps the detected types to Django Types. For the tabular view, Tabulate Library is used:

csv_types = parse_csv( JSON_DATA['source'] ) #pprint.pp ( csv_types ) table_headers = ['Field', 'CSV Type', 'Django Types'] table_rows = [] for t in csv_types: t_type = csv_types[t]['type'] t_type_django = django_fields[ t_type ] table_rows.append( [t, t_type, t_type_django] ) print(tabulate(table_rows, table_headers)) 

The last step is to Print the CSV data:

csv_data = load_csv_data( JSON_DATA['source'] ) idx = 0 for l in csv_data: idx += 1 print( '['+str(idx)+'] - ' + str(l) ) # Truncate output .. if idx == 10: print( ' ... (truncated output) ' ) break