BM25S
Collection
https://github.com/xhluca/bm25s • 15 items • Updated • 9
This is a BM25S index created with the bm25s library (version 0.0.1dev0), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.
You can install the bm25s library with pip:
pip install "bm25s==0.1.3" # Include extra dependencies like stemmer pip install "bm25s[full]==0.1.3" # For huggingface hub usage pip install huggingface_hub bm25s index You can use this index for information retrieval tasks. Here is an example:
import bm25s from bm25s.hf import BM25HF # Load the index retriever = BM25HF.load_from_hub("xhluca/bm25s-scidocs-index", revision="main") # You can retrieve now query = "a cat is a feline" results = retriever.retrieve(query, k=3) bm25s index You can save a bm25s index to the Hugging Face Hub. Here is an example:
import bm25s from bm25s.hf import BM25HF # Create a BM25 index and add documents retriever = BM25HF() corpus = [ "a cat is a feline and likes to purr", "a dog is the human's best friend and loves to play", "a bird is a beautiful animal that can fly", "a fish is a creature that lives in water and swims", ] corpus_tokens = bm25s.tokenize(corpus) retriever.index(corpus_tokens) token = None # You can get a token from the Hugging Face website retriever.save_to_hub("xhluca/bm25s-scidocs-index", token=token) This dataset was created using the following data:
| Statistic | Value |
|---|---|
| Number of documents | 25657 |
| Number of tokens | 2076690 |
| Average tokens per document | 80.94048407841915 |
The index was created with the following parameters:
| Parameter | Value |
|---|---|
| k1 | 1.5 |
| b | 0.75 |
| delta | 0.5 |
| method | lucene |
| idf method | lucene |