Data science at the command line Rapid prototyping and reproducible science Sharat Chikkerur sharat@alum.mit.edu Principal Data Scientist Nanigans Inc. 1
Outline Introduction Motivation Data science workflow Obtaining data Scrubbing data Exploring data Managing large workflows Modeling using vowpal wabbit 2
Introduction
Setup • Follow instructions at https://github.com/sharatsc/cdse • Install virtualbox virtualbox.org • Install vagrant vagrantup.com • Initialize virtual machine mkdir datascience cd datascience vagrant init data-science-toolbox/data-science-at-the-command-line vagrant up && vagrant ssh 3
About me • UB Alumni, 2005 (M.S., EE Dept, cubs.buffalo.edu) • MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu) • Senior Software Engineer, Google AdWords modeling • Senior Software Engineer, Microsoft Machine learning • Principal data scientist, Nanigans Inc 4
About the workshop • Based on a book by Jeroen Janssens 1 • Vowpal wabbit 2 1 http://datascienceatthecommandline.com/ 2 https://github.com/JohnLangford/vowpal_wabbit 5
POSIX command line • Exposes operating system functionalities through a shell • Example shells include bash, zsh, tcsh, fish etc. • Comprises of a large number of utility programs • Examples: grep, awk, sed, etc. • GNU user space programs • Common API: Input through stdin and output to stdout • stdin and stdout can be redirected during execution • Pipes (|) allows composition through pipes (chaining). • Allows redirection of output of one command as input to another ls -lh | more cat file | sort | uniq -c 6
Why command line ? • REPL allows rapid iteration (through immediate feedback) • Allows composition of scripts and commands using pipes • Automation and scaling • Reproducibility • Extensibility • R, python, perl, ruby scripts can be invoked like command line utilities 7
Data science workflow A typical workflow OSEMN model • Obtaining data • Scrubbing data • Exploring data • Modeling data • Interpreting data 8
Workshop outline • Background • Getting data: Curl, scrape • Scrubbing: jq, csvkit • Exploring: csvkit • Modeling: vowpal wabbit • Scaling: parallel Hands on exercises • Obtaining data walkthrough • JQ walkthrough • CSVKit walkthrough • Vowpal wabbit 9
Workflow example: Boston housing dataset https://github.com/sharatsc/cdse/blob/master/ boston-housing 10
Boston housing dataset Python workflow 3 import urllib import pandas as pd # Obtain data urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin df = pd.read_csv(’boston.csv’) # Scrub data df = df.fillna(0) # Model data from statsmodels import regression from statsmodels.formula import api as smf formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’]) model = smf.ols(formula=formula, data=df) res=model.fit() res.summary() Command line workflow URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv" curl $URL| Rio -e ’model=lm("medv~.", df);model’ 3 https://github.com/sharatsc/cdse/blob/master/boston-housing 11
Obtaining data
Obtaining data from the web: curl • CURL (curl.haxx.se) • Cross platform command line tool that supports data transfer using HTTP, HTTPS, FTP, IMAP, SCP, SFTP • Supports cookies, user+password authentication • can be used to get data from RESTful APIs 4 • http GET curl http://www.google.com • ftp GET curl ftp://catless.ncl.ac.uk • scp COPY curl -u username: --key key_file --pass password scp://example.com/~/file.txt 4 www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/ 12
Scrubbing data
Scrubbing web data: Scrape • Scrape is a python command line tool to parse html documents • Queries can be made in CSS selector or XPath syntax htmldoc=$(cat << EOF <div id=a> <a href="x.pdf">x</a> </div> <div id=b> <a href="png.png">y</a> <a href="pdf.pdf">y</a> </div> EOF ) # Select liks that end with pdf and are within div with id=b (Use CSS3 selector) echo $htmldoc | scrape -e "$b a[href$=pdf]" # Select all anchors (use Xpath) echo $htmldoc | scrape -e "//a" <a href="pdf.pdf">y</a> 13
CSS selectors .class selects all elements with class=’class’ div p selects all <p> elements inside div elements div > p selects <p> elements where parent is <div> [target=blank] selects all elements with target="blank" [href^=https] selects urls beginning with https [href$=pdf] selects urls ending with pdf More examples at https://www.w3schools.com/cssref/css_selectors.asp 14
XPath Query author selects all <author> elements at the current level //author selects all <author> elements at any level //author[@class=’x’] selects <author> elements with class=famous //book//author All <author> elements that are below <book> element" //author/* All children of <author> nodes More examples at https://msdn.microsoft.com/en-us/library/ms256086 15
Getting data walkthrough https: //github.com/sharatsc/cdse/tree/master/curl 16
Scrubbing JSON data: JQ 5 • JQ is a portable command line utility to manipulate and filter JSON data • Filters can be defined to access individual fields, transform records or produce derived objets • Filters can be composed and combined • Provides builtin functions an operators Example: curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’ 5 https://stedolan.github.io/jq/ 17
Basic filters # Identity ’.’ echo ’"Hello world"’ | jq ’.’ "Hello world" # Examples ’.foo’, ’.foo.bar’, ’.foo|.bar’, ’.["foo"] echo ’{"foo": 42, "bar": 10}’ | jq ’.foo’ 42 echo ’{"foo": {"bar": 10, "baz": 20}} | jq ’.foo.bar’ 10 #Arrays ’.[]’, ’.[0]’, ’.[1:3]’ echo ’["foo", "bar"]’ | jq ’.[0]’ "foo" echo ’["foo", "bar", "baz"]’ | jq ’.[1:3]’ ["bar", "baz"] # Pipe ’.foo|.bar’, ’.[]|.bar’ echo ’[{"f": 10}, {"f": 20}, {"f": 30}]’ | jq ’.[] | .f 10 20 30 18
Object construction Object construction allows you to derive new objects out of existing ones. # Field selection echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’ {"foo": "F"} # Array expansion echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’ {"foo": "F", "bar": "X"} {"foo": "F", "bar": "Y"} # Expression evaluation, key and value can be substituted echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’ {"A": "X"} {"A": "Y"} 19
Operators Addition • Numbers are added by normal arithmetic. • Arrays are added by being concatenated into a larger array. • Strings are added by being joined into a larger string. • Objects are added by merging, that is, inserting all the key-value pairs from both objects into a single combined object. # Adding fields echo ’{"foo": 10}’ | jq ’.foo + 1’ 11 # Adding arrays echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’ [1,2,3,11,12,13] 20
JQ walkthrough https: //github.com/sharatsc/cdse/tree/master/jq 21
Exploring data
CSV • CSV (Comma separated value) is the common demoniator for data exchange • Tabular data with ’,’ as a separator • Can be ingested by R, python, excel etc. • No explicit specification of data types (ARFF supports type annotation) Example state,county,quantity NE,ADAMS,1 NE,BUFFALO,1 NE,THURSTON,1 22
CSVKit (Groskopf and contributors [2016]) csvkit 6 is a suite of command line tools for converting and working with CSV, the defacto standard for tabular file formats. Example use cases • Importing data from excel, sql • Select subset of columns • Reorder columns • Mergeing multiple files (row and column wise) • Summary statistics 6 https://csvkit.readthedocs.io/ 23
Importing data # Fetch data in XLS format # (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed # This data was widely cited in the aftermath of the Ferguson, Missouri protests. T curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10 # Convert to csv in2csv ne_1033_data.xlxs > data.csv # Inspect the columns csvcut -n data.csv # Inspect the data in specific columns csvcut -c county, quantity data.csv | csvlook 24
CSVKit: Examining data csvstat provides a summary view of the data similar to summary() function in R. # Get summary for county, and cost csvcut -c county,acquisition_cost,ship_date data.csv | csvstat 1. county Text Nulls: False Unique values: 35 Max length: 10 5 most frequent values: DOUGLAS: 760 DAKOTA: 42 CASS: 37 HALL: 23 LANCASTER: 18 2. acquisition_cost Number Nulls: False Min: 0.0 Max: 412000.0 Sum: 5430787.55 Mean: 5242.072924710424710424710425 Median: 6000.0 Standard Deviation: 13368.07836799839045093904423 Unique values: 75 5 most frequent values: 6800.0: 304 25
CSVKit: searching data csvgrep can be used to search the content of the CSV file. Options include • Exact match -m • Regex match -r • Invert match -i • Search specific columns -c columns csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook | county | item_name | total_cost | | --------- | ------------------------------ | ---------- | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | 26
CSVKit: Power tools • csvjoin can be used to combine columns from multiple files csvjoin -c join_column data.csv other_data.csv • csvsort can be used to sort the file based on specific columns csvsort -c total_population | csvlook | head • csvstack allows you to merge mutiple files together (row-wise) curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re in2csv ne_1033_data.xls > ne_1033_data.csv csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv 27
CSVKit: SQL • csvsql allows queryring against one or more CSV files. • The results of the query can be inserted back into a db Examples • Import from csv into a table # Inserts into a specific table csvsql --db postgresql:///test --table data --insert data.csv # Inserts each file into a separate table csvsql --db postgresql:///test --insert examples/*_tables.csv • Regular SQL query csvsql --query "select count(*) from data" data.csv 28
CSVKit Walkthrough https: //github.com/sharatsc/cdse/tree/master/csvkit 29
Managing large workflows
GNU parallel • GNU parallel (Tange [2011]) is a tool for executing jobs in parallel on one or more machines. • It can be used to parallelize across arguments, lines and files. Examples # Parallelize across lines seq 1000 | parallel "echo {}" # Parallelize across file content cat input.csv | parallel -C, "mv {1} {2}" cat input.csv | parallel -C --header "mv {source} {dest}" 30
Parallel (cont.) • By default, parallel runs one job per cpu core • Concurrency can be controlled by --jobs or -j option seq 100 | parallel -j2 "echo number: {}" seq 100 | parallel -j200% "echo number: {}" Logging • Output from each parallel job can be captured separately using --results seq 10 | parallel --results data/outdir "echo number: {}" find data/outdir 31
Parallel (cont.) • Remote execution parallel --nonall --slf instances hostname # nonall - no argument command to follow # slf - uses ~/.parallel/sshloginfile as the list of sshlogins • Distributing data # Split 1-1000 into sections of 100 and pipe it to remote instances seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)" #transmit, retrieve and cleanup # sends jq to all instances # transmits the input file, retrives the results into {.}csv and cleanup ls *.gz | parallel -v --basefile jq --trc {.} csv 32
Modeling using vowpal wabbit
Overview • Fast, online , scalable learning system • Supports out of core execution with in memory model • Scalable (Terascale) • 1000 nodes (??) • Billions of examples • Trillions of unique features. • Actively developed https://github.com/JohnLangford/vowpal_wabbit 33
Swiss army knife of online algorithms • Binary classification • Multiclass classification • Linear regression • Quantile regression • Topic modeling (online LDA) • Structured prediction • Active learning • Recommendation (Matrix factorization) • Contextual bandit learning (explore/exploit algorithms) • Reductions 34
Features • Flexible input format: Allows free form text, identifying tags, multiple labels • Speed: Supports online learning • Scalable • Streaming removes row limit. • Hashing removes dimension limit. • Distributed training: models merged using AllReduce operation. • Cross product features: Allows multi-task learning 35
Optimization VW solves optimization of the form i l(wT xi; yi) + λR(w) Here, l() is convex, R(w) = λ1|w| + λ2||w||2. VW support a variety of loss function Linear regression (y − wT x)2 Logistic regression log(1 + exp(−ywT x)) SVM regression max(0, 1 − ywT x) Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I 36
Detour: Feature hashing • Feature hashing can be used to reduce dimensions of sparse features. • Unlike random projections ? , retains sparsity • Preserves dot products (random projection preserves distances). • Model can fit in memory. • Unsigned ? Consider a hash function h(x) : [0 . . . N] → [0 . . . m], m << N. φi(x) = j:h(j)=i xj • Signed ? Consider additionaly a hash function ξ(x) : [0 . . . N] → {1, −1}. φi(x) = ξ(j)xj 37
Detour: Generalized linear models A generalized linear predictor specifies • A linear predictor of the form η(x) = wT x • A mean estimate µ • A link function g(µ) such that g(µ) = η(x) that relates the mean estimate to the linear predictor. This framework supports a variety of regression problems Linear regression µ = wT x Logistic regression log( µ 1−µ) = wT x Poisson regression log(µ) = wT x 38
fragileInput format Label Importance [Tag]|namespace Feature . . . | namespace Feature . . . namespace = String[:Float] feature = String[:Float] Examples: • 1 | 1:0.01 32:-0.1 • example|namespace normal text features • 1 3 tag|ad-features ad description |user-features name address age 39
Input options • data file -d datafile • network --daemon --port <port=26542> • compressed data --compressed • stdin cat <data> | vw 40
Manipulation options • ngrams --ngram • skips --skips • quadratic interaction -q args. e.g -q ab • cubic interaction --cubic args. e.g. --cubic ccc 41
Output options • Examining feature construction --audit • Generating prediction --predictions or -p • Unnormalized predictions --raw_predictions • Testing only --testonly or -t 42
Model options • Model size --bit_precision or -b . Number of coefficients limited to 2b • Update existing model --initial_regressor or -i. • Final model destination --final_regressor or -f • Readable model definition --readable_model • Readable feature values --invert_hash • Snapshot model every pass --save_per_pass • Weight initialization --initial_weight or --random_weights 43
Regression (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/linear-regression • Linear regression --loss_function square • Quantile regression --loss_function quantile --quantile_tau <=0.5> 44
Binary classification (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/classification • Note: a linear regressor can be used as a classifier as well • Logistic loss --loss_function logistic, --link logistic • Hinge loss (SVM loss function) --loss_function hinge • Report binary loss instead of logistic loss --binary 45
Multiclass classification (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/multiclass • One against all --oaa <k> • Error correcting tournament --ect <k> • Online decision trees ---log_multi <k> • Cost sensitive one-against-all --csoaa <k> 46
LDA options (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/lda • Number of topics --lda • Prior on per-document topic weights --lda_alpha • Prior on topic distributions --lda_rho • Estimated number of documents --lda_D • Convergence parameter for topic estimation --lda_epsilon • Mini batch size --minibatch 47
Daemon mode (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-respond https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-request • Loads model and answers any prediction request coming over the network • Preferred way to deploy a VW model • Options • --daemon. Enables demon mode • --testonly or -t. Does not update the model in response to requests • --initial_model or -i. Model to load • --port <arg>. Port to listen to the request • --num_children <arg>. Number of threads listening to request 48
References
Christopher Groskopf and contributors. csvkit, 2016. URL https://csvkit.readthedocs.org/. O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011. doi: http://dx.doi.org/10.5281/zenodo.16303. URL http://www.gnu.org/s/parallel. 48

Data science at the command line

  • 1.
    Data science atthe command line Rapid prototyping and reproducible science Sharat Chikkerur sharat@alum.mit.edu Principal Data Scientist Nanigans Inc. 1
  • 2.
    Outline Introduction Motivation Data science workflow Obtainingdata Scrubbing data Exploring data Managing large workflows Modeling using vowpal wabbit 2
  • 3.
  • 4.
    Setup • Follow instructionsat https://github.com/sharatsc/cdse • Install virtualbox virtualbox.org • Install vagrant vagrantup.com • Initialize virtual machine mkdir datascience cd datascience vagrant init data-science-toolbox/data-science-at-the-command-line vagrant up && vagrant ssh 3
  • 5.
    About me • UBAlumni, 2005 (M.S., EE Dept, cubs.buffalo.edu) • MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu) • Senior Software Engineer, Google AdWords modeling • Senior Software Engineer, Microsoft Machine learning • Principal data scientist, Nanigans Inc 4
  • 6.
    About the workshop •Based on a book by Jeroen Janssens 1 • Vowpal wabbit 2 1 http://datascienceatthecommandline.com/ 2 https://github.com/JohnLangford/vowpal_wabbit 5
  • 7.
    POSIX command line •Exposes operating system functionalities through a shell • Example shells include bash, zsh, tcsh, fish etc. • Comprises of a large number of utility programs • Examples: grep, awk, sed, etc. • GNU user space programs • Common API: Input through stdin and output to stdout • stdin and stdout can be redirected during execution • Pipes (|) allows composition through pipes (chaining). • Allows redirection of output of one command as input to another ls -lh | more cat file | sort | uniq -c 6
  • 8.
    Why command line? • REPL allows rapid iteration (through immediate feedback) • Allows composition of scripts and commands using pipes • Automation and scaling • Reproducibility • Extensibility • R, python, perl, ruby scripts can be invoked like command line utilities 7
  • 9.
    Data science workflow Atypical workflow OSEMN model • Obtaining data • Scrubbing data • Exploring data • Modeling data • Interpreting data 8
  • 10.
    Workshop outline • Background •Getting data: Curl, scrape • Scrubbing: jq, csvkit • Exploring: csvkit • Modeling: vowpal wabbit • Scaling: parallel Hands on exercises • Obtaining data walkthrough • JQ walkthrough • CSVKit walkthrough • Vowpal wabbit 9
  • 11.
    Workflow example: Bostonhousing dataset https://github.com/sharatsc/cdse/blob/master/ boston-housing 10
  • 12.
    Boston housing dataset Pythonworkflow 3 import urllib import pandas as pd # Obtain data urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin df = pd.read_csv(’boston.csv’) # Scrub data df = df.fillna(0) # Model data from statsmodels import regression from statsmodels.formula import api as smf formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’]) model = smf.ols(formula=formula, data=df) res=model.fit() res.summary() Command line workflow URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv" curl $URL| Rio -e ’model=lm("medv~.", df);model’ 3 https://github.com/sharatsc/cdse/blob/master/boston-housing 11
  • 13.
  • 14.
    Obtaining data fromthe web: curl • CURL (curl.haxx.se) • Cross platform command line tool that supports data transfer using HTTP, HTTPS, FTP, IMAP, SCP, SFTP • Supports cookies, user+password authentication • can be used to get data from RESTful APIs 4 • http GET curl http://www.google.com • ftp GET curl ftp://catless.ncl.ac.uk • scp COPY curl -u username: --key key_file --pass password scp://example.com/~/file.txt 4 www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/ 12
  • 15.
  • 16.
    Scrubbing web data:Scrape • Scrape is a python command line tool to parse html documents • Queries can be made in CSS selector or XPath syntax htmldoc=$(cat << EOF <div id=a> <a href="x.pdf">x</a> </div> <div id=b> <a href="png.png">y</a> <a href="pdf.pdf">y</a> </div> EOF ) # Select liks that end with pdf and are within div with id=b (Use CSS3 selector) echo $htmldoc | scrape -e "$b a[href$=pdf]" # Select all anchors (use Xpath) echo $htmldoc | scrape -e "//a" <a href="pdf.pdf">y</a> 13
  • 17.
    CSS selectors .class selectsall elements with class=’class’ div p selects all <p> elements inside div elements div > p selects <p> elements where parent is <div> [target=blank] selects all elements with target="blank" [href^=https] selects urls beginning with https [href$=pdf] selects urls ending with pdf More examples at https://www.w3schools.com/cssref/css_selectors.asp 14
  • 18.
    XPath Query author selects all<author> elements at the current level //author selects all <author> elements at any level //author[@class=’x’] selects <author> elements with class=famous //book//author All <author> elements that are below <book> element" //author/* All children of <author> nodes More examples at https://msdn.microsoft.com/en-us/library/ms256086 15
  • 19.
  • 20.
    Scrubbing JSON data:JQ 5 • JQ is a portable command line utility to manipulate and filter JSON data • Filters can be defined to access individual fields, transform records or produce derived objets • Filters can be composed and combined • Provides builtin functions an operators Example: curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’ 5 https://stedolan.github.io/jq/ 17
  • 21.
    Basic filters # Identity’.’ echo ’"Hello world"’ | jq ’.’ "Hello world" # Examples ’.foo’, ’.foo.bar’, ’.foo|.bar’, ’.["foo"] echo ’{"foo": 42, "bar": 10}’ | jq ’.foo’ 42 echo ’{"foo": {"bar": 10, "baz": 20}} | jq ’.foo.bar’ 10 #Arrays ’.[]’, ’.[0]’, ’.[1:3]’ echo ’["foo", "bar"]’ | jq ’.[0]’ "foo" echo ’["foo", "bar", "baz"]’ | jq ’.[1:3]’ ["bar", "baz"] # Pipe ’.foo|.bar’, ’.[]|.bar’ echo ’[{"f": 10}, {"f": 20}, {"f": 30}]’ | jq ’.[] | .f 10 20 30 18
  • 22.
    Object construction Object constructionallows you to derive new objects out of existing ones. # Field selection echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’ {"foo": "F"} # Array expansion echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’ {"foo": "F", "bar": "X"} {"foo": "F", "bar": "Y"} # Expression evaluation, key and value can be substituted echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’ {"A": "X"} {"A": "Y"} 19
  • 23.
    Operators Addition • Numbers areadded by normal arithmetic. • Arrays are added by being concatenated into a larger array. • Strings are added by being joined into a larger string. • Objects are added by merging, that is, inserting all the key-value pairs from both objects into a single combined object. # Adding fields echo ’{"foo": 10}’ | jq ’.foo + 1’ 11 # Adding arrays echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’ [1,2,3,11,12,13] 20
  • 24.
  • 25.
  • 26.
    CSV • CSV (Commaseparated value) is the common demoniator for data exchange • Tabular data with ’,’ as a separator • Can be ingested by R, python, excel etc. • No explicit specification of data types (ARFF supports type annotation) Example state,county,quantity NE,ADAMS,1 NE,BUFFALO,1 NE,THURSTON,1 22
  • 27.
    CSVKit (Groskopf andcontributors [2016]) csvkit 6 is a suite of command line tools for converting and working with CSV, the defacto standard for tabular file formats. Example use cases • Importing data from excel, sql • Select subset of columns • Reorder columns • Mergeing multiple files (row and column wise) • Summary statistics 6 https://csvkit.readthedocs.io/ 23
  • 28.
    Importing data # Fetchdata in XLS format # (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed # This data was widely cited in the aftermath of the Ferguson, Missouri protests. T curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10 # Convert to csv in2csv ne_1033_data.xlxs > data.csv # Inspect the columns csvcut -n data.csv # Inspect the data in specific columns csvcut -c county, quantity data.csv | csvlook 24
  • 29.
    CSVKit: Examining data csvstatprovides a summary view of the data similar to summary() function in R. # Get summary for county, and cost csvcut -c county,acquisition_cost,ship_date data.csv | csvstat 1. county Text Nulls: False Unique values: 35 Max length: 10 5 most frequent values: DOUGLAS: 760 DAKOTA: 42 CASS: 37 HALL: 23 LANCASTER: 18 2. acquisition_cost Number Nulls: False Min: 0.0 Max: 412000.0 Sum: 5430787.55 Mean: 5242.072924710424710424710425 Median: 6000.0 Standard Deviation: 13368.07836799839045093904423 Unique values: 75 5 most frequent values: 6800.0: 304 25
  • 30.
    CSVKit: searching data csvgrepcan be used to search the content of the CSV file. Options include • Exact match -m • Regex match -r • Invert match -i • Search specific columns -c columns csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook | county | item_name | total_cost | | --------- | ------------------------------ | ---------- | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | 26
  • 31.
    CSVKit: Power tools •csvjoin can be used to combine columns from multiple files csvjoin -c join_column data.csv other_data.csv • csvsort can be used to sort the file based on specific columns csvsort -c total_population | csvlook | head • csvstack allows you to merge mutiple files together (row-wise) curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re in2csv ne_1033_data.xls > ne_1033_data.csv csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv 27
  • 32.
    CSVKit: SQL • csvsqlallows queryring against one or more CSV files. • The results of the query can be inserted back into a db Examples • Import from csv into a table # Inserts into a specific table csvsql --db postgresql:///test --table data --insert data.csv # Inserts each file into a separate table csvsql --db postgresql:///test --insert examples/*_tables.csv • Regular SQL query csvsql --query "select count(*) from data" data.csv 28
  • 33.
  • 34.
  • 35.
    GNU parallel • GNUparallel (Tange [2011]) is a tool for executing jobs in parallel on one or more machines. • It can be used to parallelize across arguments, lines and files. Examples # Parallelize across lines seq 1000 | parallel "echo {}" # Parallelize across file content cat input.csv | parallel -C, "mv {1} {2}" cat input.csv | parallel -C --header "mv {source} {dest}" 30
  • 36.
    Parallel (cont.) • Bydefault, parallel runs one job per cpu core • Concurrency can be controlled by --jobs or -j option seq 100 | parallel -j2 "echo number: {}" seq 100 | parallel -j200% "echo number: {}" Logging • Output from each parallel job can be captured separately using --results seq 10 | parallel --results data/outdir "echo number: {}" find data/outdir 31
  • 37.
    Parallel (cont.) • Remoteexecution parallel --nonall --slf instances hostname # nonall - no argument command to follow # slf - uses ~/.parallel/sshloginfile as the list of sshlogins • Distributing data # Split 1-1000 into sections of 100 and pipe it to remote instances seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)" #transmit, retrieve and cleanup # sends jq to all instances # transmits the input file, retrives the results into {.}csv and cleanup ls *.gz | parallel -v --basefile jq --trc {.} csv 32
  • 38.
  • 39.
    Overview • Fast, online, scalable learning system • Supports out of core execution with in memory model • Scalable (Terascale) • 1000 nodes (??) • Billions of examples • Trillions of unique features. • Actively developed https://github.com/JohnLangford/vowpal_wabbit 33
  • 40.
    Swiss army knifeof online algorithms • Binary classification • Multiclass classification • Linear regression • Quantile regression • Topic modeling (online LDA) • Structured prediction • Active learning • Recommendation (Matrix factorization) • Contextual bandit learning (explore/exploit algorithms) • Reductions 34
  • 41.
    Features • Flexible inputformat: Allows free form text, identifying tags, multiple labels • Speed: Supports online learning • Scalable • Streaming removes row limit. • Hashing removes dimension limit. • Distributed training: models merged using AllReduce operation. • Cross product features: Allows multi-task learning 35
  • 42.
    Optimization VW solves optimizationof the form i l(wT xi; yi) + λR(w) Here, l() is convex, R(w) = λ1|w| + λ2||w||2. VW support a variety of loss function Linear regression (y − wT x)2 Logistic regression log(1 + exp(−ywT x)) SVM regression max(0, 1 − ywT x) Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I 36
  • 43.
    Detour: Feature hashing •Feature hashing can be used to reduce dimensions of sparse features. • Unlike random projections ? , retains sparsity • Preserves dot products (random projection preserves distances). • Model can fit in memory. • Unsigned ? Consider a hash function h(x) : [0 . . . N] → [0 . . . m], m << N. φi(x) = j:h(j)=i xj • Signed ? Consider additionaly a hash function ξ(x) : [0 . . . N] → {1, −1}. φi(x) = ξ(j)xj 37
  • 44.
    Detour: Generalized linearmodels A generalized linear predictor specifies • A linear predictor of the form η(x) = wT x • A mean estimate µ • A link function g(µ) such that g(µ) = η(x) that relates the mean estimate to the linear predictor. This framework supports a variety of regression problems Linear regression µ = wT x Logistic regression log( µ 1−µ) = wT x Poisson regression log(µ) = wT x 38
  • 45.
    fragileInput format Label Importance[Tag]|namespace Feature . . . | namespace Feature . . . namespace = String[:Float] feature = String[:Float] Examples: • 1 | 1:0.01 32:-0.1 • example|namespace normal text features • 1 3 tag|ad-features ad description |user-features name address age 39
  • 46.
    Input options • datafile -d datafile • network --daemon --port <port=26542> • compressed data --compressed • stdin cat <data> | vw 40
  • 47.
    Manipulation options • ngrams--ngram • skips --skips • quadratic interaction -q args. e.g -q ab • cubic interaction --cubic args. e.g. --cubic ccc 41
  • 48.
    Output options • Examiningfeature construction --audit • Generating prediction --predictions or -p • Unnormalized predictions --raw_predictions • Testing only --testonly or -t 42
  • 49.
    Model options • Modelsize --bit_precision or -b . Number of coefficients limited to 2b • Update existing model --initial_regressor or -i. • Final model destination --final_regressor or -f • Readable model definition --readable_model • Readable feature values --invert_hash • Snapshot model every pass --save_per_pass • Weight initialization --initial_weight or --random_weights 43
  • 50.
    Regression (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/linear-regression • Linearregression --loss_function square • Quantile regression --loss_function quantile --quantile_tau <=0.5> 44
  • 51.
    Binary classification (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/classification •Note: a linear regressor can be used as a classifier as well • Logistic loss --loss_function logistic, --link logistic • Hinge loss (SVM loss function) --loss_function hinge • Report binary loss instead of logistic loss --binary 45
  • 52.
    Multiclass classification (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/multiclass •One against all --oaa <k> • Error correcting tournament --ect <k> • Online decision trees ---log_multi <k> • Cost sensitive one-against-all --csoaa <k> 46
  • 53.
    LDA options (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/lda •Number of topics --lda • Prior on per-document topic weights --lda_alpha • Prior on topic distributions --lda_rho • Estimated number of documents --lda_D • Convergence parameter for topic estimation --lda_epsilon • Mini batch size --minibatch 47
  • 54.
    Daemon mode (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-respond https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-request •Loads model and answers any prediction request coming over the network • Preferred way to deploy a VW model • Options • --daemon. Enables demon mode • --testonly or -t. Does not update the model in response to requests • --initial_model or -i. Model to load • --port <arg>. Port to listen to the request • --num_children <arg>. Number of threads listening to request 48
  • 55.
  • 56.
    Christopher Groskopf andcontributors. csvkit, 2016. URL https://csvkit.readthedocs.org/. O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011. doi: http://dx.doi.org/10.5281/zenodo.16303. URL http://www.gnu.org/s/parallel. 48