← home
Github: datasets/codec.py

ir_datasets: CODEC

Index
  1. codec
  2. codec/economics
  3. codec/history
  4. codec/politics

Data Access Information

To use this dataset, you need a copy the document corpus from here.

The process involves emailing a dataset author, who will provide instructions for downloading the dataset.

ir_datasets expects the source file to be copied/linked under ~/.ir_datasets/codec/v1/comets_documents.jsonl.


"codec"

CODEC Document Ranking sub-task.

  • Documents: curated web articles
  • Queries: challenging, entity-focused queries
  • Task Repository
  • See also: kilt/codec, the entity ranking subtask
queries
42 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec") for query in dataset.queries_iter(): query # namedtuple<query_id, query, domain, guidelines> 

You can find more details about the Python API here.

CLI
ir_datasets export codec queries 
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:codec') index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pipeline(dataset.get_topics()) 

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset topics = prepare_dataset('irds.codec.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
730K docs

Language: en

Document type:
CodecDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, title, text, url> 

You can find more details about the Python API here.

CLI
ir_datasets export codec docs 
[doc_id]    [title]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:codec') # Index codec indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32}) index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url']) 

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset dataset = prepare_dataset('irds.codec') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
6.2K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.2.4K38.0%
1Not Valuable. Consists of definitions or background.2.2K35.7%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.1.2K19.5%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.416 6.7%

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance, iteration> 

You can find more details about the Python API here.

CLI
ir_datasets export codec qrels --format tsv 
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt from pyterrier.measures import * pt.init() dataset = pt.get_dataset('irds:codec') index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pt.Experiment( [pipeline], dataset.get_topics(), dataset.get_qrels(), [MAP, nDCG@20] ) 

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset qrels = prepare_dataset('irds.codec.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata

"codec/economics"

Subset of codec that only contains topics about economics.

queries
14 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec/economics") for query in dataset.queries_iter(): query # namedtuple<query_id, query, domain, guidelines> 

You can find more details about the Python API here.

CLI
ir_datasets export codec/economics queries 
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:codec/economics') index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pipeline(dataset.get_topics()) 

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset topics = prepare_dataset('irds.codec.economics.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
730K docs

Inherits docs from codec

Language: en

Document type:
CodecDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec/economics") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, title, text, url> 

You can find more details about the Python API here.

CLI
ir_datasets export codec/economics docs 
[doc_id]    [title]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:codec/economics') # Index codec indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32}) index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url']) 

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset dataset = prepare_dataset('irds.codec.economics') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
2.0K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.660 33.5%
1Not Valuable. Consists of definitions or background.693 35.2%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.458 23.2%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.159 8.1%

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec/economics") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance> 

You can find more details about the Python API here.

CLI
ir_datasets export codec/economics qrels --format tsv 
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt from pyterrier.measures import * pt.init() dataset = pt.get_dataset('irds:codec/economics') index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pt.Experiment( [pipeline], dataset.get_topics(), dataset.get_qrels(), [MAP, nDCG@20] ) 

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset qrels = prepare_dataset('irds.codec.economics.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata

"codec/history"

Subset of codec that only contains topics about history.

queries
14 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec/history") for query in dataset.queries_iter(): query # namedtuple<query_id, query, domain, guidelines> 

You can find more details about the Python API here.

CLI
ir_datasets export codec/history queries 
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:codec/history') index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pipeline(dataset.get_topics()) 

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset topics = prepare_dataset('irds.codec.history.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
730K docs

Inherits docs from codec

Language: en

Document type:
CodecDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec/history") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, title, text, url> 

You can find more details about the Python API here.

CLI
ir_datasets export codec/history docs 
[doc_id]    [title]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:codec/history') # Index codec indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32}) index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url']) 

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset dataset = prepare_dataset('irds.codec.history') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
2.0K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.998 49.3%
1Not Valuable. Consists of definitions or background.618 30.5%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.292 14.4%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.116 5.7%

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec/history") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance> 

You can find more details about the Python API here.

CLI
ir_datasets export codec/history qrels --format tsv 
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt from pyterrier.measures import * pt.init() dataset = pt.get_dataset('irds:codec/history') index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pt.Experiment( [pipeline], dataset.get_topics(), dataset.get_qrels(), [MAP, nDCG@20] ) 

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset qrels = prepare_dataset('irds.codec.history.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata

"codec/politics"

Subset of codec that only contains topics about politics.

queries
14 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec/politics") for query in dataset.queries_iter(): query # namedtuple<query_id, query, domain, guidelines> 

You can find more details about the Python API here.

CLI
ir_datasets export codec/politics queries 
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:codec/politics') index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pipeline(dataset.get_topics()) 

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset topics = prepare_dataset('irds.codec.politics.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
730K docs

Inherits docs from codec

Language: en

Document type:
CodecDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec/politics") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, title, text, url> 

You can find more details about the Python API here.

CLI
ir_datasets export codec/politics docs 
[doc_id]    [title]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt pt.init() dataset = pt.get_dataset('irds:codec/politics') # Index codec indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32}) index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url']) 

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset dataset = prepare_dataset('irds.codec.politics') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
2.2K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.695 31.7%
1Not Valuable. Consists of definitions or background.899 41.0%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.457 20.8%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.141 6.4%

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("codec/politics") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance> 

You can find more details about the Python API here.

CLI
ir_datasets export codec/politics qrels --format tsv 
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt from pyterrier.measures import * pt.init() dataset = pt.get_dataset('irds:codec/politics') index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25') # (optionally other pipeline components) pt.Experiment( [pipeline], dataset.get_topics(), dataset.get_qrels(), [MAP, nDCG@20] ) 

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset qrels = prepare_dataset('irds.codec.politics.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata