InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →
Top 23 Python Dataset Projects
-
Public APIs Directory
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
-
-
-
-
awesome-pretrained-chinese-nlp-models
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合
-
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
-
img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Project mention: Anthropic reverses privacy stance, will train on Claude chats | news.ycombinator.com | 2025-08-29> By default, you are opted in. Perfectly clear.
That's called opt-out. You're doing exactly what I described: gaslighting people into believing that opt-in and opt-out are synonyms, which makes the entire concept meaningless. The audacity of you calling me "political" while resorting to such manipulation is astounding.
These are examples of what "opt-in by default" actually means. It means having the user manually consent to something every time, the polar opposite your definition.
- https://arstechnica.com/gadgets/2024/06/report-new-apple-int...
- https://github.com/rom1504/img2dataset/issues/293
It's also just pure laziness to label me as "hysterical" when PR departments of companies like Google have, like you, misused the terms opt-out and opt-in in deceptive ways.
https://news.ycombinator.com/item?id=37314981
-
-
-
-
Nice article, I came across very cool Python library recently too re. colour science - https://www.colour-science.org/
Just started playing with it with my spectrometer based on one of the examples they have, to convert spectral data to a single RGB value.
-
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Project mention: Gemini Embedding: Powering RAG and context engineering | news.ycombinator.com | 2025-07-31It's always worth checking out the MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard
There are some good open models there that have longer context limits and fewer dimensions.
The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...
Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.
-
fastdup
fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.
-
- Project mention: Asahi Linux Lead Developer Hector Martin Resigns from Linux Kernel | news.ycombinator.com | 2025-02-07
Of course not.
https://github.com/corollari/linusrants
-
-
-
-
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Dataset discussion
Python Dataset related posts
-
McBroken
-
McDonald's Gives Its Restaurants an AI Makeover
-
Chain of Draft: Thinking Faster by Writing Less
-
FashionMNIST in PyTorch
-
Datasets for Computer Vision (3)
-
Victory Is Sweet: We Can Now Fix McDonald's Ice Cream Machines
-
McBroken – Is the mcdonald's ice cream machine broken?
- A note from our sponsor - InfluxDB www.influxdata.com | 23 Dec 2025
Index
What are some of the best open-source Dataset projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | public-apis | 386,541 |
| 2 | faker | 18,975 |
| 3 | LaTeX-OCR | 16,037 |
| 4 | fashion-mnist | 12,555 |
| 5 | doccano | 10,451 |
| 6 | awesome-pretrained-chinese-nlp-models | 5,485 |
| 7 | datasets | 4,518 |
| 8 | transformer-pytorch | 4,326 |
| 9 | img2dataset | 4,244 |
| 10 | TextRecognitionDataGenerator | 3,579 |
| 11 | waymo-open-dataset | 3,157 |
| 12 | pandas-datareader | 3,135 |
| 13 | Colour | 2,469 |
| 14 | beir | 2,023 |
| 15 | fastdup | 1,802 |
| 16 | ESC-50 | 1,700 |
| 17 | linusrants | 1,693 |
| 18 | DataProfiler | 1,535 |
| 19 | streaming | 1,433 |
| 20 | VBench | 1,373 |
| 21 | chatgpt-comparison-detection | 1,327 |
| 22 | covid-19 | 1,164 |
| 23 | RecSysDatasets | 1,112 |