Skip to content

tongjiangwei/awesome-python-data-science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyds


Awesome Python Data Science


Probably the best curated list of data science software in Python

Contents

Machine Learning

General Purpouse Machine Learning

  • scikit-learn - Machine learning in Python. alt text
  • Shogun - Machine learning toolbox.
  • xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package.
  • cuML - RAPIDS Machine Learning Library. alt text alt text
  • modAL - Modular active learning framework for Python3. alt text
  • Sparkit-learn - PySpark + Scikit-learn = Sparkit-learn. alt text alt text
  • mlpack - A scalable C++ machine learning library (Python bindings).
  • dlib - Toolkit for making real world machine learning and data analysis applications in C++ (Python bindings).
  • MLxtend - Extension and helper modules for Python's data analysis and machine learning libraries. alt text
  • Reproducible Experiment Platform (REP) - Machine Learning toolbox for Humans. alt text
  • scikit-multilearn - Multi-label classification for python. alt text
  • seqlearn - Sequence classification toolkit for Python. alt text
  • pystruct - Simple structured learning framework for Python. alt text
  • sklearn-expertsys - Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models. alt text
  • RuleFit - Implementation of the rulefit. alt text
  • metric-learn - Metric learning algorithms in Python. alt text
  • pyGAM - Generalized Additive Models in Python.

Time Series

  • tslearn - Machine learning toolkit dedicated to time-series data. alt text
  • tick - Module for statistical learning, with a particular emphasis on time-dependent modelling. alt text
  • Prophet - Automatic Forecasting Procedure.
  • PyFlux - Open source time series library for Python.
  • bayesloop - Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
  • luminol - Anomaly Detection and Correlation library.

Automated Machine Learning

  • TPOT - Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. alt text
  • auto-sklearn - An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. alt text
  • MLBox - A powerful Automated Machine Learning python library.

Ensemble Methods

  • ML-Ensemble - High performance ensemble learning. alt text
  • Stacking - Simple and useful stacking library, written in Python. alt text
  • stacked_generalization - Library for machine learning stacking generalization. alt text
  • vecstack - Python package for stacking (machine learning technique). alt text

Imbalanced Datasets

  • imbalanced-learn - Module to perform under sampling and over sampling with various techniques. alt text
  • imbalanced-algorithms - Python-based implementations of algorithms for learning on imbalanced data. alt text alt text

Random Forests

Extreme Learning Machine

  • Python-ELM - Extreme Learning Machine implementation in Python. alt text
  • Python Extreme Learning Machine (ELM) - a machine learning technique used for classification/regression tasks.
  • hpelm - High performance implementation of Extreme Learning Machines (fast randomized neural networks). alt text

Kernel Methods

  • pyFM - Factorization machines in python. alt text
  • fastFM - A library for Factorization Machines. alt text
  • tffm - TensorFlow implementation of an arbitrary order Factorization Machine. alt text alt text
  • liquidSVM - An implementation of SVMs.
  • scikit-rvm - Relevance Vector Machine implementation using the scikit-learn API. alt text
  • ThunderSVM - A fast SVM Library on GPUs and CPUs. alt text alt text

Gradient Boosting

  • XGBoost - Scalable, Portable and Distributed Gradient Boosting. alt text alt text
  • LightGBM- A fast, distributed, high performance gradient boosting by Microsoft. alt text alt text
  • CatBoost - An open-source gradient boosting on decision trees library by Yandex. alt text alt text
  • ThunderGBM - Fast GBDTs and Random Forests on GPUs. alt text alt text

Deep Learning

PyTorch

  • PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration. alt text
  • torchvision - Datasets, Transforms and Models specific to Computer Vision. alt text
  • torchtext - Data loaders and abstractions for text and NLP. alt text
  • torchaudio - An audio library for PyTorch. alt text
  • ignite - High-level library to help with training neural networks in PyTorch. alt text
  • PyToune - A Keras-like framework and utilities for PyTorch.
  • skorch - A scikit-learn compatible neural network library that wraps pytorch. alt text alt text
  • PyTorchNet - An abstraction to train neural networks alt text
  • Aorun - Intend to implement an API similar to Keras with PyTorch as backend. alt text
  • pytorch_geometric - Geometric Deep Learning Extension Library for PyTorch. alt text

TensorFlow

  • TensorFlow - Computation using data flow graphs for scalable machine learning by Google. alt text
  • TensorLayer - Deep Learning and Reinforcement Learning Library for Researcher and Engineer. alt text
  • TFLearn - Deep learning library featuring a higher-level API for TensorFlow. alt text
  • Sonnet - TensorFlow-based neural network library by DeepMind. alt text
  • TensorForce - A TensorFlow library for applied reinforcement learning. alt text
  • tensorpack - A Neural Net Training Interface on TensorFlow alt text
  • Polyaxon - A platform that helps you build, manage and monitor deep learning models. alt text
  • NeuPy - NeuPy is a Python library for Artificial Neural Networks and Deep Learning (previously: alt text). alt text
  • tfdeploy - Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy. alt text
  • tensorflow-upstream - TensorFlow ROCm port. alt text alt text
  • TensorFlow Fold - Deep learning with dynamic computation graphs in TensorFlow. alt text
  • tensorlm - Wrapper library for text generation / language models at char and word level with RNN. alt text
  • TensorLight - A high-level framework for TensorFlow. alt text
  • Mesh TensorFlow - Model Parallelism Made Easier. alt text
  • Ludwig - A toolbox, that allows to train and test deep learning models without the need to write code. alt text

Keras

  • Keras - A high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.
  • keras-contrib - Keras community contributions.
  • Hyperas - Keras + Hyperopt: A very simple wrapper for convenient hyperparameter.
  • Elephas - Distributed Deep learning with Keras & Spark.
  • Hera - Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser.
  • dist-keras alt text - Distributed Deep Learning, with a focus on distributed training.
  • Spektral - Deep learning on graphs.
  • qkeras - A quantization deep learning library.
  • Keras add-ons...

MXNet

  • MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler. alt text
  • Gluon - A clear, concise, simple yet powerful and efficient API for deep learning (now included in MXNet). alt text
  • MXbox - Simple, efficient and flexible vision toolbox for mxnet framework. alt text
  • gluon-cv - Provides implementations of the state-of-the-art deep learning models in computer vision. alt text
  • gluon-nlp - NLP made easy. alt text
  • Xfer - Transfer Learning library for Deep Neural Networks. alt text
  • MXNet - HIP Port of MXNet. alt text alt text

Chainer

  • Chainer - A flexible framework for neural networks.
  • ChainerRL - A deep reinforcement learning library built on top of Chainer.
  • ChainerCV - A Library for Deep Learning in Computer Vision.
  • ChainerMN - Scalable distributed deep learning with Chainer.

Theano

WARNING: Theano development has been stopped

  • Theano - A Python library that allows you to define, optimize, and evaluate mathematical expressions.alt text
  • Lasagne - Lightweight library to build and train neural networks in Theano Lasagne add-ons... alt text
  • nolearn - A scikit-learn compatible neural network library (mainly for Lasagne). alt text alt text
  • Blocks - A Theano framework for building and training neural networks. alt text
  • scikit-neuralnetwork - Deep neural networks without the learning cliff. alt text alt text
  • platoon - Multi-GPU mini-framework for Theano. alt text
  • Theano-MPI - MPI Parallel framework for training deep learning models built in Theano. alt text

Others

  • CNTK - Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit.
  • Neon - Intel® Nervana™ reference deep learning framework committed to best performance on all hardware.
  • Tangent - Source-to-Source Debuggable Derivatives in Pure Python.
  • autograd - Efficiently computes derivatives of numpy code.
  • Myia - Deep Learning framework (pre-alpha).
  • nnabla - Neural Network Libraries by Sony.
  • Caffe - a fast open framework for deep learning.
  • Caffe2 - A lightweight, modular, and scalable deep learning framework (now a part of PyTorch).
  • hipCaffe - The HIP port of Caffe. alt text

Data Manipulation

Data Containers

  • pandas - Powerful Python data analysis toolkit.
  • cuDF - GPU DataFrame Library. alt text alt text
  • blaze - NumPy and Pandas interface to Big Data. alt text
  • pandasql - Allows you to query pandas DataFrames using SQL syntax. alt text
  • pandas-gbq - Pandas Google Big Query. alt text
  • xpandas - Universal 1d/2d data containers with Transformers .functionality for data analysis by The Alan Turing Institute.
  • pysparkling - A pure Python implementation of Apache Spark's RDD and DStream interfaces. alt text
  • Arctic - High performance datastore for time series and tick data.
  • datatable - Data.table for Python. alt text
  • koalas - Pandas API on Apache Spark. alt text
  • modin - Speed up your Pandas workflows by changing a single line of code. alt text
  • swifter - A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.

Pipelines

  • pdpipe - Sasy pipelines for pandas DataFrames.
  • SSPipe - Python pipe (|) operator with support for DataFrames and Numpy and Pytorch.
  • pandas-ply - Functional data manipulation for pandas. alt text
  • Dplython - Dplyr for Python. alt text
  • sklearn-pandas - Pandas integration with sklearn. alt text alt text
  • Dataset - Helps you conveniently work with random or sequential batches of your data and define data processing.
  • pyjanitor - Clean APIs for data cleaning. alt text
  • meza - A Python toolkit for processing tabular data.
  • Prodmodel - Build system for data science pipelines.

Feature Engineering

General

  • Featuretools - Automated feature engineering.
  • skl-groups - Scikit-learn addon to operate on set/"group"-based features. alt text
  • Feature Forge - A set of tools for creating and testing machine learning feature. alt text
  • few - A feature engineering wrapper for sklearn. alt text
  • scikit-mdr - A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction. alt text
  • tsfresh - Automatic extraction of relevant features from time series. alt text

Feature Selection

  • scikit-feature - Feature selection repository in python.
  • boruta_py - Implementations of the Boruta all-relevant feature selection method. alt text
  • BoostARoota - A fast xgboost feature selection algorithm. alt text
  • scikit-rebate- A scikit-learn-compatible Python alt text implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Visualization

  • Matplotlib - Plotting with Python.
  • seaborn - Statistical data visualization using matplotlib.
  • Bokeh - Interactive Web Plotting for Python.
  • HoloViews - Stop plotting your data - annotate your data and let it visualize itself.
  • Alphalens - Performance analysis of predictive (alpha) stock factors by Quantopian.
  • prettyplotlib - Painlessly create beautiful matplotlib plots.
  • python-ternary - Ternary plotting library for python with matplotlib.
  • missingno - Missing data visualization module for Python.

Model Explanation

  • Alibi - Algorithms for monitoring and explaining machine learning models.
  • anchor - Code for "High-Precision Model-Agnostic Explanations" paper.
  • aequitas - Bias and Fairness Audit Toolkit.
  • Contrastive Explanation - Contrastive Explanation (Foil Trees). alt text
  • yellowbrick- Visual analysis and diagnostic tools to facilitate machine learning model selection. alt text
  • scikit-plot - An intuitive library to add plotting functionality to scikit-learn objects. alt text
  • shap - A unified approach to explain the output of any machine learning model. alt text
  • ELI5 - A library for debugging/inspecting machine learning classifiers and explaining their predictions.
  • Lime- Explaining the predictions of any machine learning classifier. alt text
  • FairML- FairML is a python toolbox auditing the machine learning models for bias. alt text
  • L2X - Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.
  • PDPbox - Partial dependence plot toolbox.
  • pyBreakDown - Python implementation of R package breakDown. alt textalt text
  • PyCEbox - Python Individual Conditional Expectation Plot Toolbox.
  • Skater - Python Library for Model Interpretation.
  • model-analysis- Model analysis tools for TensorFlow. alt text
  • themis-ml - A library that implements fairness-aware machine learning algorithms. alt text
  • treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions. alt text
  • AI Explainability 360 - Interpretability and explainability of data and machine learning models.
  • Auralisation - Auralisation of learned features in CNN (for audio).
  • CapsNet-Visualization - A visualization of the CapsNet layers to better understand how it works.
  • lucid - A collection of infrastructure and tools for research in neural network interpretability.
  • Netron - Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
  • FlashLight - Visualization Tool for your NeuralNetwork.
  • tensorboard-pytorch - Tensorboard for pytorch (and chainer, mxnet, numpy, ...).
  • mxboard - Logging MXNet data for visualization in TensorBoard. alt text

Reinforcement Learning

  • OpenAI Gym - A toolkit for developing and comparing reinforcement learning algorithms.

Distributed Computing

  • Horovod- Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. alt text
  • PySpark - Exposes the Spark programming model to Python. alt text
  • Veles - Distributed machine learning platform by Samsung.
  • Jubatus - Framework and Library for Distributed Online Machine Learning.
  • DMTK - Microsoft Distributed Machine Learning Toolkit.
  • PaddlePaddle - PArallel Distributed Deep LEarning by Baidu.
  • dask-ml- Distributed and parallel machine learning. alt text
  • Distributed - Distributed computation in Python.

Probabilistic Methods

  • pomegranate- Probabilistic and graphical models for Python. alt text
  • pyro - A flexible, scalable deep probabilistic programming library built on PyTorch. alt text
  • ZhuSuan- Bayesian Deep Learning. alt text
  • PyMC - Bayesian Stochastic Modelling in Python.
  • PyMC3- Python package for Bayesian statistical modeling and Probabilistic Machine Learning. alt text
  • sampled - Decorator for reusable models in PyMC3.
  • Edward - A library for probabilistic modeling, inference, and criticism. alt text
  • InferPy - Deep Probabilistic Modelling Made Easy. alt text
  • GPflow - Gaussian processes in TensorFlow. alt text
  • PyStan - Bayesian inference using the No-U-Turn sampler (Python interface).
  • gelato - Bayesian dessert for Lasagne. alt text
  • sklearn-bayes - Python package for Bayesian Machine Learning with scikit-learn API. alt text
  • skggm - Estimation of general graphical models. alt text
  • pgmpy - A python library for working with Probabilistic Graphical Models.
  • skpro - Supervised domain-agnostic prediction framework for probabilistic modelling by The Alan Turing Institute. alt text
  • Aboleth - A bare-bones TensorFlow framework for Bayesian deep learning and Gaussian process approximation. alt text
  • PtStat - Probabilistic Programming and Statistical Inference in PyTorch. alt text
  • PyVarInf - Bayesian Deep Learning methods with Variational Inference for PyTorch. alt text
  • emcee - The Python ensemble sampling toolkit for affine-invariant MCMC.
  • hsmmlearn - A library for hidden semi-Markov models with explicit durations.
  • pyhsmm - Bayesian inference in HSMMs and HMMs.
  • GPyTorch - A highly efficient and modular implementation of Gaussian Processes in PyTorch. alt text
  • MXFusion - Modular Probabilistic Programming on MXNet alt text
  • sklearn-crfsuite - Scikit-learn inspire.d API for CRFsuite. alt text

Genetic Programming

  • gplearn - Genetic Programming in Python. alt text
  • DEAP - Distributed Evolutionary Algorithms in Python.
  • karoo_gp - A Genetic Programming platform for Python with GPU support. alt text
  • monkeys - A strongly-typed genetic programming framework for Python.
  • sklearn-genetic - Genetic feature selection module for scikit-learn. alt text

Optimization

  • Spearmint - Bayesian optimization.
  • BoTorch - Bayesian optimization in PyTorch. alt text
  • SMAC3 - Sequential Model-based Algorithm Configuration.
  • Optunity - Is a library containing various optimizers for hyperparameter tuning.
  • hyperopt - Distributed Asynchronous Hyperparameter Optimization in Python.
  • hyperopt-sklearn - Hyper-parameter optimization for sklearn. alt text
  • sklearn-deap - Use evolutionary algorithms instead of gridsearch in scikit-learn. alt text
  • sigopt_sklearn - SigOpt wrappers for scikit-learn methods. alt text
  • Bayesian Optimization - A Python implementation of global optimization with gaussian processes.
  • SafeOpt - Safe Bayesian Optimization.
  • scikit-optimize - Sequential model-based optimization with a scipy.optimize interface.
  • Solid - A comprehensive gradient-free optimization framework written in Python.
  • PySwarms - A research toolkit for particle swarm optimization in Python.
  • Platypus - A Free and Open Source Python Library for Multiobjective Optimization.
  • GPflowOpt - Bayesian Optimization using GPflow. alt text
  • POT - Python Optimal Transport library.
  • Talos - Hyperparameter Optimization for Keras Models.
  • nlopt - Library for nonlinear optimization (global and local, constrained or unconstrained).

Natural Language Processing

  • NLTK - Modules, data sets, and tutorials supporting research and development in Natural Language Processing.
  • CLTK - The Classical Language Toolkik.
  • gensim - Topic Modelling for Humans.
  • PSI-Toolkit - A natural language processing toolkit by Adam Mickiewicz University in Poznań.
  • pyMorfologik - Python binding for Morfologik (Polish morphological analyzer).
  • skift- Scikit-learn wrappers for Python fastText. alt text
  • Phonemizer - Simple text to phonemes converter for multiple languages.
  • flair - Very simple framework for state-of-the-art NLP by Zalando Research.

Computer Audition

  • librosa - Python library for audio and music analysis.
  • Yaafe - Audio features extraction.
  • aubio - A library for audio and music analysis.
  • Essentia - Library for audio and music analysis, description and synthesis.
  • LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
  • Marsyas - Music Analysis, Retrieval and Synthesis for Audio Signals.
  • muda - A library for augmenting annotated audio data.
  • madmom - Python audio and music signal processing library.
  • more: Python for Scientific Audio

Computer Vision

  • OpenCV - Open Source Computer Vision Library.
  • scikit-image - Image Processing SciKit (Toolbox for SciPy).
  • imgaug - Image augmentation for machine learning experiments.
  • imgaug_extension - Additional augmentations for imgaug.
  • Augmentor - Image augmentation library in Python for machine learning.
  • albumentations - Fast image augmentation library and easy to use wrapper around other libraries.

Statistics

  • pandas_summary - Extension to pandas dataframes describe function. alt text
  • Pandas Profiling - Create HTML profiling reports from pandas DataFrame objects. alt text
  • statsmodels - Statistical modeling and econometrics in Python
  • stockstats - Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline stock statistics/indicators support.
  • weightedcalcs - pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
  • scikit-posthocs - Pairwise Multiple Comparisons Post-hoc Tests.

Experimentation

  • Sacred - A tool to help you configure, organize, log and reproduce experiments by IDSIA.
  • Xcessiv - A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling.
  • Persimmon - A visual dataflow programming language for sklearn.
  • Ax - Adaptive Experimentation Platform. alt text

Evaluation

  • recmetrics - Library of useful metrics and plots for evaluating recommender systems.
  • Metrics - Machine learning evaluation metric.
  • sklearn-evaluation - Scikit-learn model evaluation made easy: plots, tables and markdown reports.
  • AI Fairness 360 - Fairness metrics for datasets and ML models, explanations and algorithms to mitigate bias in datasets and models.

Computations

  • numpy - The fundamental package needed for scientific computing with Python.
  • Dask - Parallel computing with task scheduling. alt text
  • bottleneck - Fast NumPy array functions written in C.
  • minpy - NumPy interface with mixed backend execution.
  • CuPy - NumPy-like API accelerated with CUDA.
  • scikit-tensor - Python library for multilinear algebra and tensor factorizations.
  • numdifftools - Solve automatic numerical differentiation problems in one or more variables.
  • quaternion - Add built-in support for quaternions to numpy.
  • adaptive - Tools for adaptive and parallel samping of mathematical functions.

Spatial Analysis

  • GeoPandas - Python tools for geographic data. alt text
  • PySal - Python Spatial Analysis Library.

Quantum Computing

  • QML - A Python Toolkit for Quantum Machine Learning.

Conversion

  • sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
  • ONNX - Open Neural Network Exchange.
  • MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.

Contributing

Contributions are welcome! 😎
Read the contribution guideline.

License

This work is licensed under the Creative Commons Attribution 4.0 International License - CC BY 4.0

About

Probably the best curated list of data science software in Python.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published