Loaders for various machine learning datasets for testing and example scripts. Previously in thinc.extra.datasets.
The package can be installed via pip:
pip install ml-datasetsLoaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments β see the source for details.
# Import directly from ml_datasets import imdb train_data, dev_data = imdb()# Load via registry from ml_datasets import loaders imdb_loader = loaders.get("imdb") train_data, dev_data = imdb_loader()| ID / Function | Description | NLP task | From URL |
|---|---|---|---|
imdb | IMDB sentiment dataset | Binary classification: sentiment analysis | β |
dbpedia | DBPedia ontology dataset | Multi-class single-label classification | β |
cmu | CMU movie genres dataset | Multi-class, multi-label classification | β |
quora_questions | Duplicate Quora questions dataset | Detecting duplicate questions | β |
reuters | Reuters dataset (texts not included) | Multi-class multi-label classification | β |
snli | Stanford Natural Language Inference corpus | Recognizing textual entailment | β |
stack_exchange | Stack Exchange dataset | Question Answering | |
ud_ancora_pos_tags | Universal Dependencies Spanish AnCora corpus | POS tagging | β |
ud_ewtb_pos_tags | Universal Dependencies English EWT corpus | POS tagging | β |
wikiner | WikiNER data | Named entity recognition |
| ID / Function | Description | ML task | From URL |
|---|---|---|---|
mnist | MNIST data | Image recognition | β |
Each instance contains the text of a movie review, and a sentiment expressed as 0 or 1.
train_data, dev_data = ml_datasets.imdb() for text, annot in train_data[0:5]: print(f"Review: {text}") print(f"Sentiment: {annot}")- Download URL: http://ai.stanford.edu/~amaas/data/sentiment/
- Citation: Andrew L. Maas et al., 2011
| Property | Training | Dev |
|---|---|---|
| # Instances | 25000 | 25000 |
| Label values | {0, 1} | {0, 1} |
| Labels per instance | Single | Single |
| Label distribution | Balanced (50/50) | Balanced (50/50) |
Each instance contains an ontological description, and a classification into one of the 14 distinct labels.
train_data, dev_data = ml_datasets.dbpedia() for text, annot in train_data[0:5]: print(f"Text: {text}") print(f"Category: {annot}")- Download URL: Via fast.ai
- Original citation: Xiang Zhang et al., 2015
| Property | Training | Dev |
|---|---|---|
| # Instances | 560000 | 70000 |
| Label values | 1-14 | 1-14 |
| Labels per instance | Single | Single |
| Label distribution | Balanced | Balanced |
Each instance contains a movie description, and a classification into a list of appropriate genres.
train_data, dev_data = ml_datasets.cmu() for text, annot in train_data[0:5]: print(f"Text: {text}") print(f"Genres: {annot}")- Download URL: http://www.cs.cmu.edu/~ark/personas/
- Original citation: David Bamman et al., 2013
| Property | Training | Dev |
|---|---|---|
| # Instances | 41793 | 0 |
| Label values | 363 different genres | - |
| Labels per instance | Multiple | - |
| Label distribution | Imbalanced: 147 labels with less than 20 examples, while Drama occurs more than 19000 times | - |
train_data, dev_data = ml_datasets.quora_questions() for questions, annot in train_data[0:50]: q1, q2 = questions print(f"Question 1: {q1}") print(f"Question 2: {q2}") print(f"Similarity: {annot}")Each instance contains two quora questions, and a label indicating whether or not they are duplicates (0: no, 1: yes). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.
- Download URL: http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
- Original citation: KornΓ©l Csernai et al., 2017
| Property | Training | Dev |
|---|---|---|
| # Instances | 363859 | 40429 |
| Label values | {0, 1} | {0, 1} |
| Labels per instance | Single | Single |
| Label distribution | Imbalanced: 63% label 0 | Imbalanced: 63% label 0 |
Loaders can be registered externally using the loaders registry as a decorator. For example:
@ml_datasets.loaders("my_custom_loader") def my_custom_loader(): return load_some_data() assert "my_custom_loader" in ml_datasets.loaders