Data science at the command line

Data science at the command line Rapid prototyping and reproducible science Sharat Chikkerur sharat@alum.mit.edu Principal Data Scientist Nanigans Inc. 1

Outline Introduction Motivation Data science workﬂow Obtaining data Scrubbing data Exploring data Managing large workﬂows Modeling using vowpal wabbit 2

Setup • Follow instructions at https://github.com/sharatsc/cdse • Install virtualbox virtualbox.org • Install vagrant vagrantup.com • Initialize virtual machine mkdir datascience cd datascience vagrant init data-science-toolbox/data-science-at-the-command-line vagrant up && vagrant ssh 3

About me • UB Alumni, 2005 (M.S., EE Dept, cubs.buffalo.edu) • MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu) • Senior Software Engineer, Google AdWords modeling • Senior Software Engineer, Microsoft Machine learning • Principal data scientist, Nanigans Inc 4

About the workshop • Based on a book by Jeroen Janssens 1 • Vowpal wabbit 2 1 http://datascienceatthecommandline.com/ 2 https://github.com/JohnLangford/vowpal_wabbit 5

POSIX command line • Exposes operating system functionalities through a shell • Example shells include bash, zsh, tcsh, ﬁsh etc. • Comprises of a large number of utility programs • Examples: grep, awk, sed, etc. • GNU user space programs • Common API: Input through stdin and output to stdout • stdin and stdout can be redirected during execution • Pipes (|) allows composition through pipes (chaining). • Allows redirection of output of one command as input to another ls -lh | more cat file | sort | uniq -c 6

Why command line ? • REPL allows rapid iteration (through immediate feedback) • Allows composition of scripts and commands using pipes • Automation and scaling • Reproducibility • Extensibility • R, python, perl, ruby scripts can be invoked like command line utilities 7

Data science workﬂow A typical workﬂow OSEMN model • Obtaining data • Scrubbing data • Exploring data • Modeling data • Interpreting data 8

Workshop outline • Background • Getting data: Curl, scrape • Scrubbing: jq, csvkit • Exploring: csvkit • Modeling: vowpal wabbit • Scaling: parallel Hands on exercises • Obtaining data walkthrough • JQ walkthrough • CSVKit walkthrough • Vowpal wabbit 9

Workﬂow example: Boston housing dataset https://github.com/sharatsc/cdse/blob/master/ boston-housing 10

Boston housing dataset Python workﬂow 3 import urllib import pandas as pd # Obtain data urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin df = pd.read_csv(’boston.csv’) # Scrub data df = df.fillna(0) # Model data from statsmodels import regression from statsmodels.formula import api as smf formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’]) model = smf.ols(formula=formula, data=df) res=model.fit() res.summary() Command line workﬂow URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv" curl $URL| Rio -e ’model=lm("medv~.", df);model’ 3 https://github.com/sharatsc/cdse/blob/master/boston-housing 11

Obtaining data from the web: curl • CURL (curl.haxx.se) • Cross platform command line tool that supports data transfer using HTTP, HTTPS, FTP, IMAP, SCP, SFTP • Supports cookies, user+password authentication • can be used to get data from RESTful APIs 4 • http GET curl http://www.google.com • ftp GET curl ftp://catless.ncl.ac.uk • scp COPY curl -u username: --key key_file --pass password scp://example.com/~/file.txt 4 www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/ 12

Scrubbing web data: Scrape • Scrape is a python command line tool to parse html documents • Queries can be made in CSS selector or XPath syntax htmldoc=$(cat << EOF <div id=a> <a href="x.pdf">x</a> </div> <div id=b> <a href="png.png">y</a> <a href="pdf.pdf">y</a> </div> EOF ) # Select liks that end with pdf and are within div with id=b (Use CSS3 selector) echo $htmldoc | scrape -e "$b a[href$=pdf]" # Select all anchors (use Xpath) echo $htmldoc | scrape -e "//a" <a href="pdf.pdf">y</a> 13

CSS selectors .class selects all elements with class=’class’ div p selects all <p> elements inside div elements div > p selects <p> elements where parent is <div> [target=blank] selects all elements with target="blank" [href^=https] selects urls beginning with https [href$=pdf] selects urls ending with pdf More examples at https://www.w3schools.com/cssref/css_selectors.asp 14

XPath Query author selects all <author> elements at the current level //author selects all <author> elements at any level //author[@class=’x’] selects <author> elements with class=famous //book//author All <author> elements that are below <book> element" //author/* All children of <author> nodes More examples at https://msdn.microsoft.com/en-us/library/ms256086 15

Getting data walkthrough https: //github.com/sharatsc/cdse/tree/master/curl 16

Scrubbing JSON data: JQ 5 • JQ is a portable command line utility to manipulate and filter JSON data • Filters can be defined to access individual fields, transform records or produce derived objets • Filters can be composed and combined • Provides builtin functions an operators Example: curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’ 5 https://stedolan.github.io/jq/ 17

Basic ﬁlters # Identity ’.’ echo ’"Hello world"’ | jq ’.’ "Hello world" # Examples ’.foo’, ’.foo.bar’, ’.foo|.bar’, ’.["foo"] echo ’{"foo": 42, "bar": 10}’ | jq ’.foo’ 42 echo ’{"foo": {"bar": 10, "baz": 20}} | jq ’.foo.bar’ 10 #Arrays ’.[]’, ’.[0]’, ’.[1:3]’ echo ’["foo", "bar"]’ | jq ’.[0]’ "foo" echo ’["foo", "bar", "baz"]’ | jq ’.[1:3]’ ["bar", "baz"] # Pipe ’.foo|.bar’, ’.[]|.bar’ echo ’[{"f": 10}, {"f": 20}, {"f": 30}]’ | jq ’.[] | .f 10 20 30 18

Object construction Object construction allows you to derive new objects out of existing ones. # Field selection echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’ {"foo": "F"} # Array expansion echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’ {"foo": "F", "bar": "X"} {"foo": "F", "bar": "Y"} # Expression evaluation, key and value can be substituted echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’ {"A": "X"} {"A": "Y"} 19

Operators Addition • Numbers are added by normal arithmetic. • Arrays are added by being concatenated into a larger array. • Strings are added by being joined into a larger string. • Objects are added by merging, that is, inserting all the key-value pairs from both objects into a single combined object. # Adding fields echo ’{"foo": 10}’ | jq ’.foo + 1’ 11 # Adding arrays echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’ [1,2,3,11,12,13] 20

JQ walkthrough https: //github.com/sharatsc/cdse/tree/master/jq 21

CSV • CSV (Comma separated value) is the common demoniator for data exchange • Tabular data with ’,’ as a separator • Can be ingested by R, python, excel etc. • No explicit speciﬁcation of data types (ARFF supports type annotation) Example state,county,quantity NE,ADAMS,1 NE,BUFFALO,1 NE,THURSTON,1 22

CSVKit (Groskopf and contributors [2016]) csvkit 6 is a suite of command line tools for converting and working with CSV, the defacto standard for tabular ﬁle formats. Example use cases • Importing data from excel, sql • Select subset of columns • Reorder columns • Mergeing multiple ﬁles (row and column wise) • Summary statistics 6 https://csvkit.readthedocs.io/ 23

Importing data # Fetch data in XLS format # (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed # This data was widely cited in the aftermath of the Ferguson, Missouri protests. T curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10 # Convert to csv in2csv ne_1033_data.xlxs > data.csv # Inspect the columns csvcut -n data.csv # Inspect the data in specific columns csvcut -c county, quantity data.csv | csvlook 24

CSVKit: Examining data csvstat provides a summary view of the data similar to summary() function in R. # Get summary for county, and cost csvcut -c county,acquisition_cost,ship_date data.csv | csvstat 1. county Text Nulls: False Unique values: 35 Max length: 10 5 most frequent values: DOUGLAS: 760 DAKOTA: 42 CASS: 37 HALL: 23 LANCASTER: 18 2. acquisition_cost Number Nulls: False Min: 0.0 Max: 412000.0 Sum: 5430787.55 Mean: 5242.072924710424710424710425 Median: 6000.0 Standard Deviation: 13368.07836799839045093904423 Unique values: 75 5 most frequent values: 6800.0: 304 25

CSVKit: searching data csvgrep can be used to search the content of the CSV ﬁle. Options include • Exact match -m • Regex match -r • Invert match -i • Search speciﬁc columns -c columns csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook | county | item_name | total_cost | | --------- | ------------------------------ | ---------- | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | 26

CSVKit: Power tools • csvjoin can be used to combine columns from multiple files csvjoin -c join_column data.csv other_data.csv • csvsort can be used to sort the file based on specific columns csvsort -c total_population | csvlook | head • csvstack allows you to merge mutiple files together (row-wise) curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re in2csv ne_1033_data.xls > ne_1033_data.csv csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv 27

CSVKit: SQL • csvsql allows queryring against one or more CSV ﬁles. • The results of the query can be inserted back into a db Examples • Import from csv into a table # Inserts into a specific table csvsql --db postgresql:///test --table data --insert data.csv # Inserts each file into a separate table csvsql --db postgresql:///test --insert examples/*_tables.csv • Regular SQL query csvsql --query "select count(*) from data" data.csv 28

CSVKit Walkthrough https: //github.com/sharatsc/cdse/tree/master/csvkit 29

GNU parallel • GNU parallel (Tange [2011]) is a tool for executing jobs in parallel on one or more machines. • It can be used to parallelize across arguments, lines and ﬁles. Examples # Parallelize across lines seq 1000 | parallel "echo {}" # Parallelize across file content cat input.csv | parallel -C, "mv {1} {2}" cat input.csv | parallel -C --header "mv {source} {dest}" 30

Parallel (cont.) • By default, parallel runs one job per cpu core • Concurrency can be controlled by --jobs or -j option seq 100 | parallel -j2 "echo number: {}" seq 100 | parallel -j200% "echo number: {}" Logging • Output from each parallel job can be captured separately using --results seq 10 | parallel --results data/outdir "echo number: {}" find data/outdir 31

Parallel (cont.) • Remote execution parallel --nonall --slf instances hostname # nonall - no argument command to follow # slf - uses ~/.parallel/sshloginfile as the list of sshlogins • Distributing data # Split 1-1000 into sections of 100 and pipe it to remote instances seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)" #transmit, retrieve and cleanup # sends jq to all instances # transmits the input file, retrives the results into {.}csv and cleanup ls *.gz | parallel -v --basefile jq --trc {.} csv 32

Overview • Fast, online , scalable learning system • Supports out of core execution with in memory model • Scalable (Terascale) • 1000 nodes (??) • Billions of examples • Trillions of unique features. • Actively developed https://github.com/JohnLangford/vowpal_wabbit 33

Swiss army knife of online algorithms • Binary classiﬁcation • Multiclass classiﬁcation • Linear regression • Quantile regression • Topic modeling (online LDA) • Structured prediction • Active learning • Recommendation (Matrix factorization) • Contextual bandit learning (explore/exploit algorithms) • Reductions 34

Features • Flexible input format: Allows free form text, identifying tags, multiple labels • Speed: Supports online learning • Scalable • Streaming removes row limit. • Hashing removes dimension limit. • Distributed training: models merged using AllReduce operation. • Cross product features: Allows multi-task learning 35

Optimization VW solves optimization of the form i l(wT xi; yi) + λR(w) Here, l() is convex, R(w) = λ1|w| + λ2||w||2. VW support a variety of loss function Linear regression (y − wT x)2 Logistic regression log(1 + exp(−ywT x)) SVM regression max(0, 1 − ywT x) Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I 36

Detour: Feature hashing • Feature hashing can be used to reduce dimensions of sparse features. • Unlike random projections ? , retains sparsity • Preserves dot products (random projection preserves distances). • Model can ﬁt in memory. • Unsigned ? Consider a hash function h(x) : [0 . . . N] → [0 . . . m], m << N. φi(x) = j:h(j)=i xj • Signed ? Consider additionaly a hash function ξ(x) : [0 . . . N] → {1, −1}. φi(x) = ξ(j)xj 37

Detour: Generalized linear models A generalized linear predictor speciﬁes • A linear predictor of the form η(x) = wT x • A mean estimate µ • A link function g(µ) such that g(µ) = η(x) that relates the mean estimate to the linear predictor. This framework supports a variety of regression problems Linear regression µ = wT x Logistic regression log( µ 1−µ) = wT x Poisson regression log(µ) = wT x 38

Input options • data ﬁle -d datafile • network --daemon --port <port=26542> • compressed data --compressed • stdin cat <data> | vw 40

Manipulation options • ngrams --ngram • skips --skips • quadratic interaction -q args. e.g -q ab • cubic interaction --cubic args. e.g. --cubic ccc 41

Output options • Examining feature construction --audit • Generating prediction --predictions or -p • Unnormalized predictions --raw_predictions • Testing only --testonly or -t 42

Model options • Model size --bit_precision or -b . Number of coefﬁcients limited to 2b • Update existing model --initial_regressor or -i. • Final model destination --final_regressor or -f • Readable model deﬁnition --readable_model • Readable feature values --invert_hash • Snapshot model every pass --save_per_pass • Weight initialization --initial_weight or --random_weights 43

Regression (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/linear-regression • Linear regression --loss_function square • Quantile regression --loss_function quantile --quantile_tau <=0.5> 44

Binary classiﬁcation (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/classification • Note: a linear regressor can be used as a classiﬁer as well • Logistic loss --loss_function logistic, --link logistic • Hinge loss (SVM loss function) --loss_function hinge • Report binary loss instead of logistic loss --binary 45

Multiclass classiﬁcation (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/multiclass • One against all --oaa <k> • Error correcting tournament --ect <k> • Online decision trees ---log_multi <k> • Cost sensitive one-against-all --csoaa <k> 46

LDA options (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/lda • Number of topics --lda • Prior on per-document topic weights --lda_alpha • Prior on topic distributions --lda_rho • Estimated number of documents --lda_D • Convergence parameter for topic estimation --lda_epsilon • Mini batch size --minibatch 47

Daemon mode (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-respond https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-request • Loads model and answers any prediction request coming over the network • Preferred way to deploy a VW model • Options • --daemon. Enables demon mode • --testonly or -t. Does not update the model in response to requests • --initial_model or -i. Model to load • --port <arg>. Port to listen to the request • --num_children <arg>. Number of threads listening to request 48

Christopher Groskopf and contributors. csvkit, 2016. URL https://csvkit.readthedocs.org/. O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011. doi: http://dx.doi.org/10.5281/zenodo.16303. URL http://www.gnu.org/s/parallel. 48

Data science at the command line

More Related Content

What's hot

Viewers also liked

Similar to Data science at the command line

Recently uploaded

Data science at the command line