Introduction to Machine Learning with Python and scikit-learn
The document is an introduction to machine learning (ML) using Python and the scikit-learn library, focusing on practical applications and simple examples. It covers concepts such as supervised learning, model fitting, the variance/bias trade-off, and text classification techniques. Additional topics include image recognition and predicting tags for Stack Overflow questions, emphasizing that no extensive theoretical knowledge is required to start using ML methods.
Overview of machine learning (ML), its patterns in data, modeling techniques, and ease of use with Python and scikit-learn.
Introduction of a simple example to demonstrate ML model fitting using Linear Regression and basic Python code implementation.
Discussion on the variance/bias trade-off in ML, including model adaptability and regularization strategies, illustrated with support vector regression.
Fundamentals of supervised learning, focusing on input-output relationships, with multiple inputs showcased.
Application of ML in classifying handwritten digits using Random Forest Classifier with Python code examples.
Text classification techniques, feature extraction methods like term frequency, and a case study involving Stack Overflow questions.
Final thoughts on machine learning's accessibility, job opportunities in data science, and a call to action for potential applicants.
Machine Learning (ML): •Finding patterns in data • Modeling patterns • Use models to make predictions Slide #2 Intro to Machine Learning with Python matt@liveramp.com
3.
ML can beeasy* • You already have ML applications! • You can start applying ML methods now with Python &scikit-learn • Theoretical knowledge of ML not needed (initially)* *Gaining more background, theory, and experience will help Slide #3 Intro to Machine Learning with Python matt@liveramp.com
Variance/Bias Trade Off •Need models that can adapt to relationships in our data • Highly adaptable models can over-fit and will not generalize • Regularization – Common strategy to address variance/bias trade off Slide #8 Intro to Machine Learning with Python matt@liveramp.com
Example: Image Classification •Classify handwritten digits with ML models • Each input is an entire image • Output is digit in the image Slide #13 Intro to Machine Learning with Python matt@liveramp.com
import numpyas np fromsklearn.ensembleimport RandomForestClassifier with np.load(’train.npz') as data: pixels_train = data['pixels'] labels_train = data['labels’] with np.load(’test.npz') as data: pixels_test = data['pixels'] # flatten X_train = pixels_train.reshape(pixels_train.shape[0], -1) X_test = pixels_test.reshape(pixels_test.shape[0], -1) model = RandomForestClassifier(n_estimators=50) model.fit(X_train, labels_train) labels_test = model.predict(X_test) Slide #15 Intro to Machine Learning with Python matt@liveramp.com
16.
Predicting the tagsof Stack Overflow questions with machine learning Kaggle Data Science Competition • Given 6 million training questions labeled with tags • Predict the tags for 2 million unlabeled test questions www.users.globalnet.co.uk/~slocks/instructions.html stackoverflow.com/questions/895371/bubble-sort-homework Slide #16 Intro to Machine Learning with Python matt@liveramp.com
17.
Text Classification Overview FeatureExtraction & Selection Raw Posts Slide #17 Model Selection & Training Vector Space Intro to Machine Learning with Python Machine Learning Model matt@liveramp.com
18.
Term Frequency FeatureExtraction Characterize text by the frequency of specific words in each text entry Slide #18 processing sorted array faster “Why is processing a sorted array faster than processing an array this is not sorted?” Term Frequencies why Example Title: 1 2 2 2 1 Ignore common words (i.e. stop words) Intro to Machine Learning with Python matt@liveramp.com
ML can beeasy* • You already have ML problems! • You can start applying ML methods now with Python &scikit-learn • Theoretical knowledge of ML not needed (initially)* scikit-learn.org github.com/scikit-learn Slide #24 Intro to Machine Learning with Python matt@liveramp.com
23.
Helping companies usetheir marketing data to delight customers Tools Opportunities • Backend Engineers • Data Scientists • Full-Stack Engineers • Java • Hadoop (Map/Reduce) • Ruby Build and work with large distributed systems that process massive data sets. Check out: liveramp.com/careers Slide #25 Intro to Machine Learning with Python matt@liveramp.com