Curated list of data science software in Python
- scikit-learn compatible (or inspired) API
- Theano based project
- TensorFlow based project
- PyTorch based project
- CuPy based project
- Apache Spark based project
- GPU-accelerated computations (if not based on Theano, Tensorflow, PyTorch or CuPy)
- possible to run on AMD GPU
- Machine Learning
- Deep Learning
- Reinforcement Learning
- Distributed computing systems
- Probabilistic methods
- Genetic Programming
- Optimization
- Natural Language Processing
- Computer Audition
- Computer Vision
- Feature engineering
- Data manipulation & pipelines
- Statistics
- Experiments tools
- Visualization
- Evaluation
- Computations
- Quantum computing
- Conversion
- scikit-learn
- machine learning in Python - Shogun - machine learning toolbox
- xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package
- Sparkit-learn
- PySpark + Scikit-learn = Sparkit-learn - mlpack - a scalable C++ machine learning library (Python bindings)
- dlib - A toolkit for making real world machine learning and data analysis applications in C++ (Python bindings)
- MLxtend
- extension and helper modules for Python's data analysis and machine learning libraries - tick
- module for statistical learning, with a particular emphasis on time-dependent modelling - sklearn-extensions
- a consolidated package of small extensions to scikit-learn - civisml-extensions
- scikit-learn-compatible estimators from Civis Analytics - scikit-multilearn
- multi-label classification for python - tslearn
- machine learning toolkit dedicated to time-series data - seqlearn
- seqlearn is a sequence classification toolkit for Python - pystruct
- Simple structured learning framework for python - sklearn-expertsys
- Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models - skutil
- A set of scikit-learn and h2o extension classes (as well as caret classes for python) - sklearn-crfsuite
- scikit-learn inspired API for CRFsuite - RuleFit
- implementation of the rulefit - metric-learn
- metric learning algorithms in Python
- TPOT
- Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming - auto-sklearn
- is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator - MLBox - a powerful Automated Machine Learning python library.
- ML-Ensemble
- high performance ensemble learning - brew
- Python Ensemble Learning API - Stacking
- Simple and useful stacking library, written in Python. - stacked_generalization
- library for machine learning stacking generalization. - vecstack
- Python package for stacking (machine learning technique)
- imbalanced-learn
- module to perform under sampling and over sampling with various techniques - imbalanced-algorithms
- Python-based implementations of algorithms for learning on imbalanced data.
- rpforest
- a forest of random projection trees - Random Forest Clustering
- Unsupervised Clustering using Random Forests - sklearn-random-bits-forest
- wrapper of the Random Bits Forest program written by (Wang et al., 2016) - rgf_python
- Python Wrapper of Regularized Greedy Forest
- Python-ELM
- Extreme Learning Machine implementation in Python - Python Extreme Learning Machine (ELM) - a machine learning technique used for classification/regression tasks
- hpelm
- High performance implementation of Extreme Learning Machines (fast randomized neural networks).
- pyFM
- Factorization machines in python - fastFM
- a library for Factorization Machines - tffm
- TensorFlow implementation of an arbitrary order Factorization Machine - liquidSVM - an implementation of SVMs
- scikit-rvm
- Relevance Vector Machine implementation using the scikit-learn API
- XGBoost
- Scalable, Portable and Distributed Gradient Boosting - LightGBM
- a fast, distributed, high performance gradient boosting by Microsoft - CatBoost
- an open-source gradient boosting on decision trees library by Yandex - TGBoost
- Tiny Gradient Boosting Tree
- Keras - a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano
- Hyperas - Keras + Hyperopt: A very simple wrapper for convenient hyperparameter
- Elephas - Distributed Deep learning with Keras & Spark
- Hera - Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser.
- dist-keras
- Distributed Deep Learning, with a focus on distributed training - Keras add-ons...
- TensorFlow
- omputation using data flow graphs for scalable machine learning by Google - TensorLayer
- Deep Learning and Reinforcement Learning Library for Researcher and Engineer. - TFLearn
- Deep learning library featuring a higher-level API for TensorFlow - Sonnet
- TensorFlow-based neural network library by DeepMind - TensorForce
- a TensorFlow library for applied reinforcement learning - tensorpack
- a Neural Net Training Interface on TensorFlow - Polyaxon
- a platform that helps you build, manage and monitor deep learning models - Horovod
- Distributed training framework for TensorFlow - tfdeploy
- Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy - hiptensorflow
- ROCm/HIP enabled Tensorflow - TensorFlow Fold
- Deep learning with dynamic computation graphs in TensorFlow
WARNING: Theano development has been stopped
- Theano
- is a Python library that allows you to define, optimize, and evaluate mathematical expressions - Lasagne
- Lightweight library to build and train neural networks in Theano Lasagne add-ons... - nolearn
- scikit-learn compatible neural network library (mainly for Lasagne) - Blocks
- a Theano framework for building and training neural networks - platoon
- Multi-GPU mini-framework for Theano - NeuPy
- NeuPy is a Python library for Artificial Neural Networks and Deep Learning - scikit-neuralnetwork
- Deep neural networks without the learning cliff - Theano-MPI
- MPI Parallel framework for training deep learning models built in Theano
- PyTorch
- Tensors and Dynamic neural networks in Python with strong GPU acceleration - skorch
- a scikit-learn compatible neural network library that wraps pytorch - PyTorchNet
- an abstraction to train neural networks
- MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler
- Gluon - a clear, concise, simple yet powerful and efficient API for deep learning (now included in MXNet)
- MXbox - simple, efficient and flexible vision toolbox for mxnet framework.
- MXNet
- HIP Port of MXNet
- Caffe - a fast open framework for deep learning
- Caffe2 - a lightweight, modular, and scalable deep learning framework
- hipCaffe
- the HIP port of Caffe
- CNTK - Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
- Chainer - a flexible framework for neural networks
- ChainerRL - a deep reinforcement learning library built on top of Chainer.
- ChainerCV - a Library for Deep Learning in Computer Vision
- ChainerMN - scalable distributed deep learning with Chainer
- scikit-chainer
- scikit-learn like interface to chainer - chainer_sklearn
- Sklearn (Scikit-learn) like interface for Chainer
- Neon - Intel® Nervana™ reference deep learning framework committed to best performance on all hardware
- Tangent - Source-to-Source Debuggable Derivatives in Pure Python
- autograd - Efficiently computes derivatives of numpy code
- Myia - deep learning framework (pre-alpha)
- OpenAI Gym - a toolkit for developing and comparing reinforcement learning algorithms.
- PySpark
- exposes the Spark programming model to Python - Veles - Distributed machine learning platform by Samsung
- Jubatus - Framework and Library for Distributed Online Machine Learning
- DMTK - Microsoft Distributed Machine Learning Toolkit
- PaddlePaddle - PArallel Distributed Deep LEarning by Baidu
- dask-ml
- Distributed and parallel machine learning - Distributed - Distributed computation in Python
- pomegranate
- probabilistic and graphical models for Python - bayesloop - Probabilistic programming framework that facilitates objective model selection for time-varying parameter models
- PyFlux - Open source time series library for Python
- pyro
- a flexible, scalable deep probabilistic programming library built on PyTorch. - ZhuSuan
- Bayesian Deep Learning - PyMC - Bayesian Stochastic Modelling in Python
- PyMC3
- Python package for Bayesian statistical modeling and Probabilistic Machine Learning - Edward
- A library for probabilistic modeling, inference, and criticism. - GPflow
- Gaussian processes in TensorFlow - PyStan - Bayesian inference using the No-U-Turn sampler (Python interface)
- gelato
- Bayesian dessert for Lasagne - sklearn-bayes
- Python package for Bayesian Machine Learning with scikit-learn API - skggm
- estimation of general graphical models - pgmpy - a python library for working with Probabilistic Graphical Models.
- skpro
- supervised domain-agnostic prediction framework for probabilistic modelling by The Alan Turing Institute - Aboleth
- a bare-bones TensorFlow framework for Bayesian deep learning and Gaussian process approximation - PtStat
- Probabilistic Programming and Statistical Inference in PyTorch - emcee - The Python ensemble sampling toolkit for affine-invariant MCMC
- gplearn
- Genetic Programming in Python - DEAP - Distributed Evolutionary Algorithms in Python
- karoo_gp
- A Genetic Programming platform for Python with GPU support - monkeys - A strongly-typed genetic programming framework for Python
- Spearmint - Bayesian optimization
- SMAC3 - Sequential Model-based Algorithm Configuration
- Optunity - is a library containing various optimizers for hyperparameter tuning.
- htperopt - Distributed Asynchronous Hyperparameter Optimization in Python
- hyperopt-sklearn
- hyper-parameter optimization for sklearn - sklearn-deap
- use evolutionary algorithms instead of gridsearch in scikit-learn - sigopt_sklearn
- SigOpt wrappers for scikit-learn methods - Bayesian Optimization - A Python implementation of global optimization with gaussian processes.
- SafeOpt - Safe Bayesian Optimization
- scikit-optimize - Sequential model-based optimization with a
scipy.optimizeinterface - Solid - A comprehensive gradient-free optimization framework written in Python
- PySwarms - A research toolkit for particle swarm optimization in Python
- Platypus - A Free and Open Source Python Library for Multiobjective Optimization
- NLTK - modules, data sets, and tutorials supporting research and development in Natural Language Processing
- CLTK - The Classical Language Toolkik
- gensim - Topic Modelling for Humans
- PSI-Toolkit - a natural language processing toolkit by Adam Mickiewicz University in Poznań
- pyMorfologik - Python binding for Morfologik (Polish morphological analyzer)
- librosa - Python library for audio and music analysis
- Yaafe - Audio features extraction
- aubio - a library for audio and music analysis
- Essentia - library for audio and music analysis, description and synthesis
- LibXtract - is a simple, portable, lightweight library of audio feature extraction functions
- Marsyas - Music Analysis, Retrieval and Synthesis for Audio Signals
- muda - a library for augmenting annotated audio data
- madmom - Python audio and music signal processing library
- OpenCV - Open Source Computer Vision Library
- scikit-image - Image Processing SciKit (Toolbox for SciPy)
- Featuretools - automated feature engineering
- scikit-feature - feature selection repository in python
- skl-groups
- scikit-learn addon to operate on set/"group"-based features - Feature Forge
- a set of tools for creating and testing machine learning feature - boruta_py
- implementations of the Boruta all-relevant feature selection method - BoostARoota
- a fast xgboost feature selection algorithm
- pandas - powerful Python data analysis toolkit
- sklearn-pandas
- Pandas integration with sklearn - alexander
- wrapper that aims to make scikit-learn fully compatible with pandas - blaze - NumPy and Pandas interface to Big Data
- pandasql - allows you to query pandas DataFrames using SQL syntax
- pandas-gbq - Pandas Google Big Query
- xpandas - universal 1d/2d data containers with Transformers functionality for data analysis by The Alan Turing Institute
- Fuel - data pipeline framework for machine learning
- Arctic - high performance datastore for time series and tick data
- pdpipe - sasy pipelines for pandas DataFrames.
- meza - a Python toolkit for processing tabular data
- pandas-ply - functional data manipulation for pandas
- Dplython - Dplyr for Python
- pysparkling
- a pure Python implementation of Apache Spark's RDD and DStream interfaces - quinn
- pyspark methods to enhance developer productivity
- statsmodels - statistical modeling and econometrics in Python
- stockstats - Supply a wrapper
StockDataFramebased on thepandas.DataFramewith inline stock statistics/indicators support. - simplestatistics - simple statistical functions implemented in readable Python.
- weightedcalcs - pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more
- Sacred - a tool to help you configure, organize, log and reproduce experiments by IDSIA
- Xcessiv - a web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling
- Persimmon
- A visual dataflow programming language for sklearn
- Matplotlib - plotting with Python
- seaborn - statistical data visualization using matplotlib
- Bokeh - Interactive Web Plotting for Python
- HoloViews - stop plotting your data - annotate your data and let it visualize itself
- Alphalens - performance analysis of predictive (alpha) stock factors by Quantopian
- yellowbrick
- visual analysis and diagnostic tools to facilitate machine learning model selection - scikit-plot
- an intuitive library to add plotting functionality to scikit-learn objects - Lime
- Explaining the predictions of any machine learning classifier
- kaggle-metrics - Metrics for Kaggle competitions
- Metrics - machine learning evaluation metric
- sklearn-evaluation - scikit-learn model evaluation made easy: plots, tables and markdown reports
- numpy - the fundamental package needed for scientific computing with Python.
- Dask - parallel computing with task scheduling
- bottleneck - Fast NumPy array functions written in C
- minpy - NumPy interface with mixed backend execution
- CuPy - NumPy-like API accelerated with CUDA
- scikit-tensor - Python library for multilinear algebra and tensor factorizations
- QML - a Python Toolkit for Quantum Machine Learning
- sklearn-porter - transpile trained scikit-learn estimators to C, Java, JavaScript and others