ir_datasets : Washington Post

import ir_datasets dataset = ir_datasets.load("wapo/v2") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2 docs

 [doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
 ...

You can find more details about the CLI here.

import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:wapo/v2') # Index wapo/v2 indexer = pt.IterDictIndexer('./indices/wapo_v2', meta={"docno": 36}) index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset dataset = prepare_dataset('irds.wapo.v2') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

{ "docs": { "count": 595037, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } } }

`"wapo/v2/trec-core-2018"`

The TREC Common Core 2018 benchmark.

Queries: TREC-style (keyword, description, narrative)
Relevance: Deeply-annotated
Shared Task Website

50 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v2/trec-core-2018") for query in dataset.queries_iter(): query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-core-2018 queries

 [query_id]    [title]    [description]    [narrative]
 ...

You can find more details about the CLI here.

import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018') index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset topics = prepare_dataset('irds.wapo.v2.trec-core-2018.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

595K docs

Inherits docs from wapo/v2

Language: en

Document type:

WapoDoc: (namedtuple)

doc_id: str
url: str
title: str
author: str
published_date: Optional[int]
kicker: str
body: str
body_paras_html: Tuple[str, ...]
body_media: Tuple[
WapoDocMedia: (namedtuple)
1. type: str
2. url: str
3. text: str
, ...]

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v2/trec-core-2018") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-core-2018 docs

 [doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
 ...

You can find more details about the CLI here.

import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018') # Index wapo/v2 indexer = pt.IterDictIndexer('./indices/wapo_v2', meta={"docno": 36}) index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset dataset = prepare_dataset('irds.wapo.v2.trec-core-2018') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

26K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`22K`	85.0%
1	relevant	`2.1K`	7.9%
2	highly relevant	`1.9K`	7.1%

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v2/trec-core-2018") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-core-2018 qrels --format tsv

 [query_id]    [doc_id]    [relevance]    [iteration]
 ...

You can find more details about the CLI here.

import pyterrier as pt from pyterrier.measures import * pt.init() dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018') index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pt.Experiment( [pipeline], dataset.get_topics('title'), dataset.get_qrels(), [MAP, nDCG@20] )

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset qrels = prepare_dataset('irds.wapo.v2.trec-core-2018.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

{ "docs": { "count": 595037, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 26233, "fields": { "relevance": { "counts_by_value": { "0": 22285, "2": 1865, "1": 2083 } } } } }

`"wapo/v2/trec-news-2018"`

The TREC News 2018 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

50 queries

Language: en

Query type:

TrecBackgroundLinkingQuery: (namedtuple)

query_id: str
doc_id: str
url: str

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v2/trec-news-2018") for query in dataset.queries_iter(): query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2018 queries

 [query_id]    [doc_id]    [url]
 ...

You can find more details about the CLI here.

import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018') index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pipeline(dataset.get_topics('doc_id'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset topics = prepare_dataset('irds.wapo.v2.trec-news-2018.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

595K docs

Inherits docs from wapo/v2

Language: en

Document type:

WapoDoc: (namedtuple)

doc_id: str
url: str
title: str
author: str
published_date: Optional[int]
kicker: str
body: str
body_paras_html: Tuple[str, ...]
body_media: Tuple[
WapoDocMedia: (namedtuple)
1. type: str
2. url: str
3. text: str
, ...]

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v2/trec-news-2018") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2018 docs

 [doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
 ...

You can find more details about the CLI here.

import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018') # Index wapo/v2 indexer = pt.IterDictIndexer('./indices/wapo_v2', meta={"docno": 36}) index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset dataset = prepare_dataset('irds.wapo.v2.trec-news-2018') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

8.5K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	The document provides little or no useful background information.	`6.5K`	76.0%
2	The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.	`1.2K`	14.0%
4	The document provides significantly useful background ...	`584`	6.9%
8	The document provides essential useful background ...	`164`	1.9%
16	The document _must_ appear in the sidebar otherwise critical context is missing.	`106`	1.2%

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v2/trec-news-2018") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2018 qrels --format tsv

 [query_id]    [doc_id]    [relevance]    [iteration]
 ...

You can find more details about the CLI here.

import pyterrier as pt from pyterrier.measures import * pt.init() dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018') index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pt.Experiment( [pipeline], dataset.get_topics('doc_id'), dataset.get_qrels(), [MAP, nDCG@20] )

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset qrels = prepare_dataset('irds.wapo.v2.trec-news-2018.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Soboroff2018News}

Bibtex:

@inproceedings{Soboroff2018News, title={TREC 2018 News Track Overview}, author={Ian Soboroff and Shudong Huang and Donna Harman}, booktitle={TREC}, year={2018} }

{ "docs": { "count": 595037, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 8508, "fields": { "relevance": { "counts_by_value": { "16": 106, "2": 1189, "0": 6465, "4": 584, "8": 164 } } } } }

`"wapo/v2/trec-news-2019"`

The TREC News 2019 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

60 queries

Language: en

Query type:

TrecBackgroundLinkingQuery: (namedtuple)

query_id: str
doc_id: str
url: str

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v2/trec-news-2019") for query in dataset.queries_iter(): query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2019 queries

 [query_id]    [doc_id]    [url]
 ...

You can find more details about the CLI here.

import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019') index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pipeline(dataset.get_topics('doc_id'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset topics = prepare_dataset('irds.wapo.v2.trec-news-2019.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

595K docs

Inherits docs from wapo/v2

Language: en

Document type:

WapoDoc: (namedtuple)

doc_id: str
url: str
title: str
author: str
published_date: Optional[int]
kicker: str
body: str
body_paras_html: Tuple[str, ...]
body_media: Tuple[
WapoDocMedia: (namedtuple)
1. type: str
2. url: str
3. text: str
, ...]

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v2/trec-news-2019") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2019 docs

 [doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
 ...

You can find more details about the CLI here.

import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019') # Index wapo/v2 indexer = pt.IterDictIndexer('./indices/wapo_v2', meta={"docno": 36}) index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset dataset = prepare_dataset('irds.wapo.v2.trec-news-2019') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

16K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	The document provides little or no useful background information.	`13K`	80.6%
2	The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.	`1.7K`	10.7%
4	The document provides significantly useful background ...	`660`	4.2%
8	The document provides essential useful background ...	`431`	2.8%
16	The document _must_ appear in the sidebar otherwise critical context is missing.	`273`	1.7%

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v2/trec-news-2019") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2019 qrels --format tsv

 [query_id]    [doc_id]    [relevance]    [iteration]
 ...

You can find more details about the CLI here.

import pyterrier as pt from pyterrier.measures import * pt.init() dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019') index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pt.Experiment( [pipeline], dataset.get_topics('doc_id'), dataset.get_qrels(), [MAP, nDCG@20] )

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset qrels = prepare_dataset('irds.wapo.v2.trec-news-2019.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Soboroff2019News}

Bibtex:

@inproceedings{Soboroff2019News, title={TREC 2019 News Track Overview}, author={Ian Soboroff and Shudong Huang and Donna Harman}, booktitle={TREC}, year={2019} }

{ "docs": { "count": 595037, "fields": { "doc_id": { "max_len": 36, "common_prefix": "" } } }, "queries": { "count": 60 }, "qrels": { "count": 15655, "fields": { "relevance": { "counts_by_value": { "2": 1677, "0": 12614, "8": 431, "16": 273, "4": 660 } } } } }

`"wapo/v3/trec-news-2020"`

The TREC News 2020 Background Linking task. The task is to find relevant background information for the provided articles.

If you have a copy of the v3 dataset, we would appreciate a pull request to add support!

Queries: Articles via the doc_id field
Shared Task Website

50 queries

Language: en

Query type:

TrecBackgroundLinkingQuery: (namedtuple)

query_id: str
doc_id: str
url: str

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v3/trec-news-2020") for query in dataset.queries_iter(): query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v3/trec-news-2020 queries

 [query_id]    [doc_id]    [url]
 ...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset topics = prepare_dataset('irds.wapo.v3.trec-news-2020.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

18K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	The document provides little or no useful background information.	`15K`	86.4%
2	The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.	`1.6K`	9.0%
4	The document provides significantly useful background ...	`631`	3.6%
8	The document provides essential useful background ...	`132`	0.7%
16	The document _must_ appear in the sidebar otherwise critical context is missing.	`50`	0.3%

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v3/trec-news-2020") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v3/trec-news-2020 qrels --format tsv

 [query_id]    [doc_id]    [relevance]    [iteration]
 ...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset qrels = prepare_dataset('irds.wapo.v3.trec-news-2020.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

{ "queries": { "count": 50 }, "qrels": { "count": 17764, "fields": { "relevance": { "counts_by_value": { "0": 15348, "2": 1603, "4": 631, "8": 132, "16": 50 } } } } }

`"wapo/v4"`

(no description provided)

docs

729K docs

Language: en

Document type:

WapoDoc: (namedtuple)

doc_id: str
url: str
title: str
author: str
published_date: Optional[int]
kicker: str
body: str
body_paras_html: Tuple[str, ...]
body_media: Tuple[
WapoDocMedia: (namedtuple)
1. type: str
2. url: str
3. text: str
, ...]

Examples:

import ir_datasets dataset = ir_datasets.load("wapo/v4") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v4 docs

 [doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
 ...

You can find more details about the CLI here.

import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:wapo/v4') # Index wapo/v4 indexer = pt.IterDictIndexer('./indices/wapo_v4', meta={"docno": 36}) index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset dataset = prepare_dataset('irds.wapo.v4') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore