FineData

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

eliebak submitted a paper 6 days ago

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

hynky new activity 10 days ago

HuggingFaceFW/finepdfs:Which language detector did you use

hynky new activity 13 days ago

HuggingFaceFW/finepdfs:The "file_path" data field appears to primarily contain cc-index paths rather than WARC paths.

View all activity

Papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

View all Papers

Organization Card

Community About org cards

🍷 FineData

This is the home of the 🍷 FineData team, a branch of the 🤗 Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.

🍷 FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
📚 FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
🥂 FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
📄 FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web.
🌐 FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
📄 FinePDFs-Edu: 350B+ highly educational tokens filtered from 📄 FinePDFs

Collections 7

View 7 collections

spaces 6

FineWiki Viewer

Viewer to explore the finewiki dataset

FineWeb: decanting the web for the finest text data at scale

Generate high-quality text data for LLMs using FineWeb

Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks

Evaluate multilingual models using FineTasks

Tasks Explorer

Explore and analyze experiment results

Datasets Metrics Explorer

Launch an interactive demo interface

models 105

HuggingFaceFW/finepdfs_edu_classifier_eng_Latn

0.4B • Updated Nov 11 • 13 • 2

HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn

0.4B • Updated Oct 6 • 7

HuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn

0.4B • Updated Oct 6 • 8

HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn

0.4B • Updated Oct 6 • 8

HuggingFaceFW/finepdfs_edu_classifier_guj_Gujr

0.3B • Updated Oct 6 • 8

HuggingFaceFW/finepdfs_edu_classifier_nno_Latn

0.3B • Updated Oct 6 • 8

HuggingFaceFW/finepdfs_edu_classifier_kaz_Cyrl

0.3B • Updated Oct 6 • 6

HuggingFaceFW/finepdfs_edu_classifier_tam_Taml

0.3B • Updated Oct 6 • 5

HuggingFaceFW/finepdfs_edu_classifier_azj_Latn

0.3B • Updated Oct 6 • 7

HuggingFaceFW/finepdfs_edu_classifier_afr_Latn

0.3B • Updated Oct 6 • 8

View 105 models

datasets 15

HuggingFaceFW/finepdfs

Viewer • Updated 23 days ago • 476M • 31.5k • 689

HuggingFaceFW/finepdfs-edu

Viewer • Updated Nov 11 • 49.5M • 7.72k • 57

HuggingFaceFW/fineweb-2

Viewer • Updated Oct 27 • 4.48B • 60k • 707

HuggingFaceFW/finewiki

Viewer • Updated Oct 22 • 61.6M • 10.1k • 267

HuggingFaceFW/clean-wikipedia

Viewer • Updated Oct 21 • 61.2M • 1.47k • 23

HuggingFaceFW/finepdfs_lang_classification_tmp

Updated Oct 21 • 13

HuggingFaceFW/ocr-annotations

Viewer • Updated Oct 20 • 1.62k • 181 • 16

HuggingFaceFW/finepdfs_lang_classification

Viewer • Updated Oct 17 • 3.08M • 5.13k • 4

HuggingFaceFW/finepdfs_eng_Latn_labeled

Viewer • Updated Oct 6 • 1.3M • 643 • 2

HuggingFaceFW/finepdfs_fw_edu_labeled

Viewer • Updated Oct 4 • 18.8M • 470 • 4

View 15 datasets