Top 23 Python Dataset Projects

public-apis

1 418 386,541 2.7 Python

A collective list of free APIs

Project mention: What is an API? A Beginner's Guide | dev.to | 2025-10-25

Public APIs Directory
Stream

getstream.io featured

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
faker

2 9 18,975 9.3 Python

Faker is a Python package that generates fake data for you. (by joke2k)
LaTeX-OCR

3 22 16,037 5.4 Python

pix2tex: Using a ViT to convert images of equations into LaTeX code.
fashion-mnist

4 17 12,555 0.0 Python

A MNIST-like fashion product database. Benchmark :point_down:
doccano

5 13 10,451 3.6 Python

Open source annotation tool for machine learning practitioners.
awesome-pretrained-chinese-nlp-models

6 1 5,485 6.9 Python

Awesome Pretrained Chinese NLP Models，高质量中文预训练模型&大模型&多模态模型&大语言模型集合
datasets

7 5 4,518 8.3 Python

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
transformer-pytorch

8 2 4,326 1.7 Python

Transformer: PyTorch Implementation of "Attention Is All You Need"
img2dataset

9 15 4,244 6.2 Python

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Project mention: Anthropic reverses privacy stance, will train on Claude chats | news.ycombinator.com | 2025-08-29

> By default, you are opted in. Perfectly clear.
That's called opt-out. You're doing exactly what I described: gaslighting people into believing that opt-in and opt-out are synonyms, which makes the entire concept meaningless. The audacity of you calling me "political" while resorting to such manipulation is astounding.
These are examples of what "opt-in by default" actually means. It means having the user manually consent to something every time, the polar opposite your definition.
- https://arstechnica.com/gadgets/2024/06/report-new-apple-int...
- https://github.com/rom1504/img2dataset/issues/293
It's also just pure laziness to label me as "hysterical" when PR departments of companies like Google have, like you, misused the terms opt-out and opt-in in deceptive ways.
https://news.ycombinator.com/item?id=37314981
TextRecognitionDataGenerator

10 1 3,579 3.6 Python

A synthetic data generator for text recognition
waymo-open-dataset

11 7 3,157 2.7 Python

Waymo Open Dataset
pandas-datareader

12 3 3,135 7.1 Python

Extract data from a wide range of Internet sources into a pandas DataFrame.
Colour

13 8 2,469 9.2 Python

Colour Science for Python

Project mention: What Is a Color Space? | news.ycombinator.com | 2025-08-25

Nice article, I came across very cool Python library recently too re. colour science - https://www.colour-science.org/
Just started playing with it with my spectrometer based on one of the examples they have, to convert spectral data to a single RGB value.
beir

14 10 2,023 7.1 Python

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

Project mention: Gemini Embedding: Powering RAG and context engineering | news.ycombinator.com | 2025-07-31

It's always worth checking out the MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard
There are some good open models there that have longer context limits and fewer dimensions.
The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...
Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.
fastdup

15 19 1,802 7.3 Python

fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
ESC-50

16 2 1,700 2.9 Python

ESC-50: Dataset for Environmental Sound Classification
linusrants

17 13 1,693 0.0 Python

Dataset of Linus Torvalds' rants classified by negativity using sentiment analysis

Project mention: Asahi Linux Lead Developer Hector Martin Resigns from Linux Kernel | news.ycombinator.com | 2025-02-07

Of course not.
https://github.com/corollari/linusrants
DataProfiler

18 61 1,535 5.8 Python

What's in your data? Extract schema, statistics and entities from datasets
streaming

19 1 1,433 8.3 Python

A Data Streaming Library for Efficient Neural Network Training (by mosaicml)
VBench

20 1 1,373 9.4 Python

[CVPR2024 Highlight] VBench - We Evaluate Video Generation
chatgpt-comparison-detection

21 1 1,327 4.8 Python

Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥
covid-19

22 7 1,164 0.0 Python

Novel Coronavirus 2019 time series data on cases (by datasets)
RecSysDatasets

23 1 1,112 2.7 Python

This is a repository of public data sources for Recommender Systems (RS).
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Dataset discussion

Python Dataset related posts

McBroken

1 project | news.ycombinator.com | 26 Aug 2025
McDonald's Gives Its Restaurants an AI Makeover

1 project | news.ycombinator.com | 7 Mar 2025
Chain of Draft: Thinking Faster by Writing Less

1 project | dev.to | 28 Feb 2025
FashionMNIST in PyTorch

1 project | dev.to | 8 Dec 2024
Datasets for Computer Vision (3)

1 project | dev.to | 25 Nov 2024
Victory Is Sweet: We Can Now Fix McDonald's Ice Cream Machines

3 projects | news.ycombinator.com | 25 Oct 2024
McBroken – Is the mcdonald's ice cream machine broken?

1 project | news.ycombinator.com | 16 Jul 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 23 Dec 2025

InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source Dataset projects in Python? This list will help you:

#	Project	Stars
1	public-apis	386,541
2	faker	18,975
3	LaTeX-OCR	16,037
4	fashion-mnist	12,555
5	doccano	10,451
6	awesome-pretrained-chinese-nlp-models	5,485
7	datasets	4,518
8	transformer-pytorch	4,326
9	img2dataset	4,244
10	TextRecognitionDataGenerator	3,579
11	waymo-open-dataset	3,157
12	pandas-datareader	3,135
13	Colour	2,469
14	beir	2,023
15	fastdup	1,802
16	ESC-50	1,700
17	linusrants	1,693
18	DataProfiler	1,535
19	streaming	1,433
20	VBench	1,373
21	chatgpt-comparison-detection	1,327
22	covid-19	1,164
23	RecSysDatasets	1,112

Python Dataset

Top 23 Python Dataset Projects

Python Dataset discussion

Python Dataset related posts

McBroken

McDonald's Gives Its Restaurants an AI Makeover

Chain of Draft: Thinking Faster by Writing Less

FashionMNIST in PyTorch

Datasets for Computer Vision (3)

Victory Is Sweet: We Can Now Fix McDonald's Ice Cream Machines

McBroken – Is the mcdonald's ice cream machine broken?

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?