Skip to content

The overall objective of this toolkit is to provide and offer a free collection of data analysis and machine learning that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run this collections either in Jupyter notebook or python alone.

License

Notifications You must be signed in to change notification settings

Olow304/Data-Science-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Complete-Data-Science-Toolkits

The overall objective of this toolkit is to provide and offer a free collection of data analysis and machine learning that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run this collections either in Jupyter notebook or python alone.

Features

Machine Learning

  • Cross-Validation
  • Evaluating Classification Metrics
  • Evaluating Clustering Metrics
  • Evaluating Regression Metrics
  • Grid Search
  • Preprocessing Encoding Categorical Features
  • Preprocessing Binarization
  • Preprocessing Imputing Missing Values
  • Preprocessing Normalization
  • Preprocessing StandardScaler
  • Randomized Parameter Optimization

Numpy

  • Adding, Removing, and Splitting Arrays
  • Sorting arrays
  • Matrix object
  • Statistics Vector Math
  • Structured Arrays
  • Import, Export, Slicing, Indexing
  • Data to from string

Pandas

  • Complete pandas
  • Groupby in Pandas
  • Mapping
  • Filtering
  • Applying

Visualization

  • BarPlots
  • Customization Matplotlib
  • Working with Image
  • Working with text

Naming Conventions

  • The naming convections I followed is:
  • [yyyy-mm-dd-in-project-name-library].extention
  • yyyy = stands for year
  • mm = stands for month
  • dd = stands for day
  • in = my initial, for example: Saleban Olow = so
  • library = numpy, pandas, sklearn, matplotlib
  • project-name = each project name
  • extention = .ipynb, .py, .html
  • Example: 2017-25-11-so-cross-validation-sklearn.ipynb

Code Samples:

Cross Validation

from sklearn.model_selection import cross_val_score model = SVC(kernel='linear', C=1) # let's try it using cv scores = cross_val_score(model, X, y, cv=5)

Grid Search

from sklearn.grid_search import GridSearchCV params = {"n_neighbors": np.arange(1,5), "metric": ["euclidean", "cityblock"]} grid = GridSearchCV(estimator=knn, param_grid=params) grid.fit(X_train, y_train) print(grid.best_score) print(grid.best_estimator_.n_neighbors)

Preprocessing Imputing Missing Values

from sklearn.preprocessing import Imputer impute = Imputer(missing_values = 0, strategy='mean', axis=0) impute.fit_transform(X_train)

Randomized Parameter Optimization

from sklearn.grid_search import RandomizedSearchCV params = {"n_neighbors" : range(1,5), "weights": ["uniform", "distance"]} rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4, n_iter=8, random_state=5) rsearch.fit(X_train, y_train) print(rsearch.best_score_)

Model fitting supervised and unsupervised learning

#supervised learning from sklearn import neighbors knn = neighbors.KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) #unsupervised learning from sklearn.decomposition import PCA pca = PCA(n_components=0.95) pca_model = pca.fit_transform(X_train)

Working with numpy arrays

import numpy as np #appends values to end of arr np.append(arr, values) #inserts values into arr before index 2 np.insert(arr, 2, values)

Indexing and Slicing arrays

import numpy as np #return the element at index 5 arr = np.array([[1,2,3,4,5,6,7]]) arr[5] #returns the 2D array element on index  arr[2,5] #assign array element on index 1 the value 4 arr[1] = 4 #assign array element on index [1][3] the value 10 arr[1,3] = 10

Creating DataFrame

import pandas as pd #specify values for each rows and columns df = pd.DataFrame(	[[4,7,10], [5,8,11], [6,9,12]], index=[1,2,3], columns=['a','b','c'])

groupby pandas

import pandas as pd import pandas as pd #return a groupby object, grouped by values in column named 'cities' df.groupby(by="Cities")

handling missing values

import pandas as pd #drop rows with any column having NA/null data. df.dropna() #replace all NA/null data with value df.fillna(value)

Melt function

import pandas as pd #most pandas methods return a DataFrame so that #this improves readability of code df = (pd.melt(df) .rename(columns={'old_name':'new_name', 'old_name':'new_name'}) .query('new_name >= 200') )

Save plot

mport matplotlib.pyplot as plt #saves plot/figure to image plt.savefig('pic_name.png')

Marker, lines

import matplotlib.pyplot as plt #add * for every data point plt.plot(x,y, marker='*') #adds dot for every data point plt.plot(x,y, marker='.')

Figures, Axis

import matplotlib.pyplot as plt #a container that contains all plot elements fig = plt.figures() #Initializes subplot fig.add_axes() #A subplot is an axes on a grid system, rows-cols num a = fig.add_subplot(222) #adds subplot fig, b = plt.subplots(nrows=3, ncols=2) #creates subplot ax = plt.subplots(2,2)

Working with text plot

import matplotlib.pyplot as plt #places text at coordinates 1/1 plt.text(1,1, 'Example text', style='italic') #annotate the point with coordinates xy with text  ax.annotate('some annotation', xy=(10,10)) #just put math formula plt.title(r'$delta_i=20$',fontsize=10)

About

The overall objective of this toolkit is to provide and offer a free collection of data analysis and machine learning that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run this collections either in Jupyter notebook or python alone.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •