← home
Github: datasets/gov.py

ir_datasets: GOV

Index
  1. gov
  2. gov/trec-web-2002
  3. gov/trec-web-2002/named-page
  4. gov/trec-web-2003
  5. gov/trec-web-2003/named-page
  6. gov/trec-web-2004

Data Access Information

To use this dataset, you need a copy of GOV, provided by the University of Glasgow.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" and pay a fee to UoG to get a copy. The data are provided as hard drives that are shipped to you.

Once you have the data, ir_datasets will need the directories that look like the following:

ir_datasets expects the above directories to be copied/linked under ~/.ir_datasets/gov/corpus.


"gov"

GOV web document collection. Used for early TREC Web Tracks. Not to be confused with gov2.

The dataset is obtained for a fee from UoG, and is shipped as a hard drive. More information is provided here.

docs
1.2M docs

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, http_headers, body, body_content_type> 

You can find more details about the Python API here.

CLI
ir_datasets export gov docs 
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset dataset = prepare_dataset('irds.gov') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Metadata

"gov/trec-web-2002"

The TREC Web Track 2002 ad-hoc ranking benchmark.

queries
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2002") for query in dataset.queries_iter(): query # namedtuple<query_id, title, description, narrative> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002 queries 
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset topics = prepare_dataset('irds.gov.trec-web-2002.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
1.2M docs

Inherits docs from gov

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2002") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, http_headers, body, body_content_type> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002 docs 
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset dataset = prepare_dataset('irds.gov.trec-web-2002') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
57K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant55K97.2%
1Relevant1.6K2.8%

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2002") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance, iteration> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002 qrels --format tsv 
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset qrels = prepare_dataset('irds.gov.trec-web-2002.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Craswell2002TrecWeb}

Bibtex:

@inproceedings{Craswell2002TrecWeb, title={Overview of the TREC-2002 Web Track}, author={Nick Craswell and David Hawking}, booktitle={TREC}, year={2002} }
Metadata

"gov/trec-web-2002/named-page"

The TREC Web Track 2002 named page ranking benchmark.

queries
150 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2002/named-page") for query in dataset.queries_iter(): query # namedtuple<query_id, text> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002/named-page queries 
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset topics = prepare_dataset('irds.gov.trec-web-2002.named-page.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
1.2M docs

Inherits docs from gov

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2002/named-page") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, http_headers, body, body_content_type> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002/named-page docs 
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset dataset = prepare_dataset('irds.gov.trec-web-2002.named-page') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
170 qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1Name refers to this page170 100.0%

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2002/named-page") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance, iteration> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2002/named-page qrels --format tsv 
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset qrels = prepare_dataset('irds.gov.trec-web-2002.named-page.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Craswell2002TrecWeb}

Bibtex:

@inproceedings{Craswell2002TrecWeb, title={Overview of the TREC-2002 Web Track}, author={Nick Craswell and David Hawking}, booktitle={TREC}, year={2002} }
Metadata

"gov/trec-web-2003"

The TREC Web Track 2003 ad-hoc ranking benchmark.

queries
50 queries

Language: en

Query type:
GovWeb02Query: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2003") for query in dataset.queries_iter(): query # namedtuple<query_id, title, description> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003 queries 
[query_id]    [title]    [description]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset topics = prepare_dataset('irds.gov.trec-web-2003.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
1.2M docs

Inherits docs from gov

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2003") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, http_headers, body, body_content_type> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003 docs 
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset dataset = prepare_dataset('irds.gov.trec-web-2003') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
51K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant51K99.0%
1Relevant516 1.0%

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2003") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance, iteration> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003 qrels --format tsv 
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset qrels = prepare_dataset('irds.gov.trec-web-2003.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Craswell2003TrecWeb}

Bibtex:

@inproceedings{Craswell2003TrecWeb, title={Overview of the TREC 2003 Web Track}, author={Nick Craswell and David Hawking and Ross Wilkinson and Mingfang Wu}, booktitle={TREC}, year={2003} }
Metadata

"gov/trec-web-2003/named-page"

The TREC Web Track 2003 named page ranking benchmark.

queries
300 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2003/named-page") for query in dataset.queries_iter(): query # namedtuple<query_id, text> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003/named-page queries 
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset topics = prepare_dataset('irds.gov.trec-web-2003.named-page.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
1.2M docs

Inherits docs from gov

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2003/named-page") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, http_headers, body, body_content_type> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003/named-page docs 
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset dataset = prepare_dataset('irds.gov.trec-web-2003.named-page') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
352 qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
1Name refers to this page352 100.0%

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2003/named-page") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance, iteration> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2003/named-page qrels --format tsv 
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset qrels = prepare_dataset('irds.gov.trec-web-2003.named-page.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Craswell2003TrecWeb}

Bibtex:

@inproceedings{Craswell2003TrecWeb, title={Overview of the TREC 2003 Web Track}, author={Nick Craswell and David Hawking and Ross Wilkinson and Mingfang Wu}, booktitle={TREC}, year={2003} }
Metadata

"gov/trec-web-2004"

The TREC Web Track 2004 ad-hoc ranking benchmark.

Queries include a combination of topic distillation, homepage finding, and named page finding.

queries
225 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2004") for query in dataset.queries_iter(): query # namedtuple<query_id, text> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2004 queries 
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset topics = prepare_dataset('irds.gov.trec-web-2004.queries') # AdhocTopics for topic in topics.iter(): print(topic) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
1.2M docs

Inherits docs from gov

Language: en

Document type:
GovDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. http_headers: str
  4. body: bytes
  5. body_content_type: str

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2004") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, url, http_headers, body, body_content_type> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2004 docs 
[doc_id]    [url]    [http_headers]    [body]    [body_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset dataset = prepare_dataset('irds.gov.trec-web-2004') for doc in dataset.iter_documents(): print(doc) # an AdhocDocumentStore break 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
89K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant87K98.0%
1Relevant1.8K2.0%

Examples:

Python API
import ir_datasets dataset = ir_datasets.load("gov/trec-web-2004") for qrel in dataset.qrels_iter(): qrel # namedtuple<query_id, doc_id, relevance, iteration> 

You can find more details about the Python API here.

CLI
ir_datasets export gov/trec-web-2004 qrels --format tsv 
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset qrels = prepare_dataset('irds.gov.trec-web-2004.qrels') # AdhocAssessments for topic_qrels in qrels.iter(): print(topic_qrels) # An AdhocTopic 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Craswell2004TrecWeb}

Bibtex:

@inproceedings{Craswell2004TrecWeb, title={Overview of the TREC-2004 Web Track}, author={Nick Craswell and David Hawking}, booktitle={TREC}, year={2004} }
Metadata