Complete-Data-Science-Toolkits

The overall objective of this toolkit is to provide and offer a free collection of data analysis and machine learning that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run this collections either in Jupyter notebook or python alone.

Features

Machine Learning

Cross-Validation
Evaluating Classification Metrics
Evaluating Clustering Metrics
Evaluating Regression Metrics
Grid Search
Preprocessing Encoding Categorical Features
Preprocessing Binarization
Preprocessing Imputing Missing Values
Preprocessing Normalization
Preprocessing StandardScaler
Randomized Parameter Optimization

Numpy

Adding, Removing, and Splitting Arrays
Sorting arrays
Matrix object
Statistics Vector Math
Structured Arrays
Import, Export, Slicing, Indexing
Data to from string

Pandas

Complete pandas
Groupby in Pandas
Mapping
Filtering
Applying

Visualization

BarPlots
Customization Matplotlib
Working with Image
Working with text

Naming Conventions

The naming convections I followed is:
[yyyy-mm-dd-in-project-name-library].extention
yyyy = stands for year
mm = stands for month
dd = stands for day
in = my initial, for example: Saleban Olow = so
library = numpy, pandas, sklearn, matplotlib
project-name = each project name
extention = .ipynb, .py, .html
Example: 2017-25-11-so-cross-validation-sklearn.ipynb

Code Samples:

Cross Validation

from sklearn.model_selection import cross_val_score model = SVC(kernel='linear', C=1) # let's try it using cv scores = cross_val_score(model, X, y, cv=5)

Grid Search

from sklearn.grid_search import GridSearchCV params = {"n_neighbors": np.arange(1,5), "metric": ["euclidean", "cityblock"]} grid = GridSearchCV(estimator=knn, param_grid=params) grid.fit(X_train, y_train) print(grid.best_score) print(grid.best_estimator_.n_neighbors)

Preprocessing Imputing Missing Values

from sklearn.preprocessing import Imputer impute = Imputer(missing_values = 0, strategy='mean', axis=0) impute.fit_transform(X_train)

Randomized Parameter Optimization

from sklearn.grid_search import RandomizedSearchCV params = {"n_neighbors" : range(1,5), "weights": ["uniform", "distance"]} rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4, n_iter=8, random_state=5) rsearch.fit(X_train, y_train) print(rsearch.best_score_)

Model fitting supervised and unsupervised learning

#supervised learning from sklearn import neighbors knn = neighbors.KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) #unsupervised learning from sklearn.decomposition import PCA pca = PCA(n_components=0.95) pca_model = pca.fit_transform(X_train)

Working with numpy arrays

import numpy as np #appends values to end of arr np.append(arr, values) #inserts values into arr before index 2 np.insert(arr, 2, values)

Indexing and Slicing arrays

import numpy as np #return the element at index 5 arr = np.array([[1,2,3,4,5,6,7]]) arr[5] #returns the 2D array element on index  arr[2,5] #assign array element on index 1 the value 4 arr[1] = 4 #assign array element on index [1][3] the value 10 arr[1,3] = 10

Creating DataFrame

import pandas as pd #specify values for each rows and columns df = pd.DataFrame(	[[4,7,10], [5,8,11], [6,9,12]], index=[1,2,3], columns=['a','b','c'])

groupby pandas

import pandas as pd import pandas as pd #return a groupby object, grouped by values in column named 'cities' df.groupby(by="Cities")

handling missing values

import pandas as pd #drop rows with any column having NA/null data. df.dropna() #replace all NA/null data with value df.fillna(value)

Melt function

import pandas as pd #most pandas methods return a DataFrame so that #this improves readability of code df = (pd.melt(df) .rename(columns={'old_name':'new_name', 'old_name':'new_name'}) .query('new_name >= 200') )

Save plot

mport matplotlib.pyplot as plt #saves plot/figure to image plt.savefig('pic_name.png')

Marker, lines

import matplotlib.pyplot as plt #add * for every data point plt.plot(x,y, marker='*') #adds dot for every data point plt.plot(x,y, marker='.')

Figures, Axis

import matplotlib.pyplot as plt #a container that contains all plot elements fig = plt.figures() #Initializes subplot fig.add_axes() #A subplot is an axes on a grid system, rows-cols num a = fig.add_subplot(222) #adds subplot fig, b = plt.subplots(nrows=3, ncols=2) #creates subplot ax = plt.subplots(2,2)

Working with text plot

import matplotlib.pyplot as plt #places text at coordinates 1/1 plt.text(1,1, 'Example text', style='italic') #annotate the point with coordinates xy with text  ax.annotate('some annotation', xy=(10,10)) #just put math formula plt.title(r'$delta_i=20$',fontsize=10)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
ARIMA Model - House Prediction		ARIMA Model - House Prediction
All Notebooks		All Notebooks
All Python Codes		All Python Codes
Data Exploring		Data Exploring
advanced python pandas/ipython notebook		advanced python pandas/ipython notebook
notebook - machine learning sklearn/ipython notebook		notebook - machine learning sklearn/ipython notebook
notebook - numpy/ipython notebook		notebook - numpy/ipython notebook
notebook - pandas		notebook - pandas
notebook - visualization/ipython notebook		notebook - visualization/ipython notebook
snippets - machine learning sklearn		snippets - machine learning sklearn
snippets - numpy		snippets - numpy
snippets - pandas		snippets - pandas
snippets - time series analysis		snippets - time series analysis
snippets - visualization		snippets - visualization
LICENSE		LICENSE
README.md		README.md
Steps we can take.docx		Steps we can take.docx
dev_indicators_ssa.xlsx		dev_indicators_ssa.xlsx
p4v2016.xls		p4v2016.xls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Complete-Data-Science-Toolkits

Features

Machine Learning

Numpy

Pandas

Visualization

Naming Conventions

Code Samples:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Olow304/Data-Science-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Complete-Data-Science-Toolkits

Features

Machine Learning

Numpy

Pandas

Visualization

Naming Conventions

Code Samples:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages