Python Dataset

Open-source Python projects categorized as Dataset

Top 23 Python Dataset Projects

  1. public-apis

    A collective list of free APIs

    Project mention: What is an API? A Beginner's Guide | dev.to | 2025-10-25

    Public APIs Directory

  2. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  3. faker

    Faker is a Python package that generates fake data for you. (by joke2k)

  4. LaTeX-OCR

    pix2tex: Using a ViT to convert images of equations into LaTeX code.

  5. fashion-mnist

    A MNIST-like fashion product database. Benchmark :point_down:

  6. doccano

    Open source annotation tool for machine learning practitioners.

  7. awesome-pretrained-chinese-nlp-models

    Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合

  8. datasets

    TFDS is a collection of datasets ready to use with TensorFlow, Jax, ... (by tensorflow)

  9. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  10. transformer-pytorch

    Transformer: PyTorch Implementation of "Attention Is All You Need"

  11. img2dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

    Project mention: Anthropic reverses privacy stance, will train on Claude chats | news.ycombinator.com | 2025-08-29

    > By default, you are opted in. Perfectly clear.

    That's called opt-out. You're doing exactly what I described: gaslighting people into believing that opt-in and opt-out are synonyms, which makes the entire concept meaningless. The audacity of you calling me "political" while resorting to such manipulation is astounding.

    These are examples of what "opt-in by default" actually means. It means having the user manually consent to something every time, the polar opposite your definition.

    - https://arstechnica.com/gadgets/2024/06/report-new-apple-int...

    - https://github.com/rom1504/img2dataset/issues/293

    It's also just pure laziness to label me as "hysterical" when PR departments of companies like Google have, like you, misused the terms opt-out and opt-in in deceptive ways.

    https://news.ycombinator.com/item?id=37314981

  12. TextRecognitionDataGenerator

    A synthetic data generator for text recognition

  13. waymo-open-dataset

    Waymo Open Dataset

  14. pandas-datareader

    Extract data from a wide range of Internet sources into a pandas DataFrame.

  15. Colour

    Colour Science for Python

    Project mention: What Is a Color Space? | news.ycombinator.com | 2025-08-25

    Nice article, I came across very cool Python library recently too re. colour science - https://www.colour-science.org/

    Just started playing with it with my spectrometer based on one of the examples they have, to convert spectral data to a single RGB value.

  16. beir

    A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

    Project mention: Gemini Embedding: Powering RAG and context engineering | news.ycombinator.com | 2025-07-31

    It's always worth checking out the MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard

    There are some good open models there that have longer context limits and fewer dimensions.

    The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...

    Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.

  17. fastdup

    fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.

  18. ESC-50

    ESC-50: Dataset for Environmental Sound Classification

  19. linusrants

    Dataset of Linus Torvalds' rants classified by negativity using sentiment analysis

    Project mention: Asahi Linux Lead Developer Hector Martin Resigns from Linux Kernel | news.ycombinator.com | 2025-02-07

    Of course not.

    https://github.com/corollari/linusrants

  20. DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

  21. streaming

    A Data Streaming Library for Efficient Neural Network Training (by mosaicml)

  22. VBench

    [CVPR2024 Highlight] VBench - We Evaluate Video Generation

  23. chatgpt-comparison-detection

    Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥

  24. covid-19

    Novel Coronavirus 2019 time series data on cases (by datasets)

  25. RecSysDatasets

    This is a repository of public data sources for Recommender Systems (RS).

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Dataset discussion

Python Dataset related posts

  • McBroken

    1 project | news.ycombinator.com | 26 Aug 2025
  • McDonald's Gives Its Restaurants an AI Makeover

    1 project | news.ycombinator.com | 7 Mar 2025
  • Chain of Draft: Thinking Faster by Writing Less

    1 project | dev.to | 28 Feb 2025
  • FashionMNIST in PyTorch

    1 project | dev.to | 8 Dec 2024
  • Datasets for Computer Vision (3)

    1 project | dev.to | 25 Nov 2024
  • Victory Is Sweet: We Can Now Fix McDonald's Ice Cream Machines

    3 projects | news.ycombinator.com | 25 Oct 2024
  • McBroken – Is the mcdonald's ice cream machine broken?

    1 project | news.ycombinator.com | 16 Jul 2024
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 23 Dec 2025
    InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source Dataset projects in Python? This list will help you:

# Project Stars
1 public-apis 386,541
2 faker 18,975
3 LaTeX-OCR 16,037
4 fashion-mnist 12,555
5 doccano 10,451
6 awesome-pretrained-chinese-nlp-models 5,485
7 datasets 4,518
8 transformer-pytorch 4,326
9 img2dataset 4,244
10 TextRecognitionDataGenerator 3,579
11 waymo-open-dataset 3,157
12 pandas-datareader 3,135
13 Colour 2,469
14 beir 2,023
15 fastdup 1,802
16 ESC-50 1,700
17 linusrants 1,693
18 DataProfiler 1,535
19 streaming 1,433
20 VBench 1,373
21 chatgpt-comparison-detection 1,327
22 covid-19 1,164
23 RecSysDatasets 1,112

Sponsored
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io

Did you know that Python is
the 2nd most popular programming language
based on number of references?