FineData
community
AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
View all activity
Papers
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Organization Card
π· FineData
This is the home of the π· FineData team, a branch of the π€ Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.
- π· FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
- π FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
- π₯ FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
- π FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web.
- π FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
- π FinePDFs-Edu: 350B+ highly educational tokens filtered from π FinePDFs
-
HuggingFaceFW/finepdfs
Viewer β’ Updated β’ 476M β’ 31.5k β’ 689 -
HuggingFaceFW/finepdfs-edu
Viewer β’ Updated β’ 49.5M β’ 7.72k β’ 57 -
HuggingFaceFW/ocr-annotations
Viewer β’ Updated β’ 1.62k β’ 181 β’ 16 -
HuggingFaceFW/finepdfs_lang_classification
Viewer β’ Updated β’ 3.08M β’ 5.13k β’ 4
-
HuggingFaceFW/finepdfs
Viewer β’ Updated β’ 476M β’ 31.5k β’ 689 -
HuggingFaceFW/finepdfs-edu
Viewer β’ Updated β’ 49.5M β’ 7.72k β’ 57 -
HuggingFaceFW/ocr-annotations
Viewer β’ Updated β’ 1.62k β’ 181 β’ 16 -
HuggingFaceFW/finepdfs_lang_classification
Viewer β’ Updated β’ 3.08M β’ 5.13k β’ 4
spaces 6
Running
9
FineWiki Viewer
π
Viewer to explore the finewiki dataset
Running
Featured
1.23k
FineWeb: decanting the web for the finest text data at scale
π·
Generate high-quality text data for LLMs using FineWeb
Running
85
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
π
Evaluate multilingual models using FineTasks
Build error
Tasks Explorer
π’
Explore and analyze experiment results
Runtime error
4
Datasets Metrics Explorer
π
Launch an interactive demo interface
models 105
HuggingFaceFW/finepdfs_edu_classifier_eng_Latn
0.4B β’ Updated β’ 13 β’ 2
HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn
0.4B β’ Updated β’ 7
HuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn
0.4B β’ Updated β’ 8
HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn
0.4B β’ Updated β’ 8
HuggingFaceFW/finepdfs_edu_classifier_guj_Gujr
0.3B β’ Updated β’ 8
HuggingFaceFW/finepdfs_edu_classifier_nno_Latn
0.3B β’ Updated β’ 8
HuggingFaceFW/finepdfs_edu_classifier_kaz_Cyrl
0.3B β’ Updated β’ 6
HuggingFaceFW/finepdfs_edu_classifier_tam_Taml
0.3B β’ Updated β’ 5
HuggingFaceFW/finepdfs_edu_classifier_azj_Latn
0.3B β’ Updated β’ 7
HuggingFaceFW/finepdfs_edu_classifier_afr_Latn
0.3B β’ Updated β’ 8
datasets 15
HuggingFaceFW/finepdfs
Viewer β’ Updated β’ 476M β’ 31.5k β’ 689
HuggingFaceFW/finepdfs-edu
Viewer β’ Updated β’ 49.5M β’ 7.72k β’ 57
HuggingFaceFW/fineweb-2
Viewer β’ Updated β’ 4.48B β’ 60k β’ 707
HuggingFaceFW/finewiki
Viewer β’ Updated β’ 61.6M β’ 10.1k β’ 267
HuggingFaceFW/clean-wikipedia
Viewer β’ Updated β’ 61.2M β’ 1.47k β’ 23
HuggingFaceFW/finepdfs_lang_classification_tmp
Updated β’ 13
HuggingFaceFW/ocr-annotations
Viewer β’ Updated β’ 1.62k β’ 181 β’ 16
HuggingFaceFW/finepdfs_lang_classification
Viewer β’ Updated β’ 3.08M β’ 5.13k β’ 4
HuggingFaceFW/finepdfs_eng_Latn_labeled
Viewer β’ Updated β’ 1.3M β’ 643 β’ 2
HuggingFaceFW/finepdfs_fw_edu_labeled
Viewer β’ Updated β’ 18.8M β’ 470 β’ 4