You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pyversity is a fast, lightweight library for diversifying retrieval results.
4
16
Retrieval systems often return highly similar items. Pyversity efficiently re-ranks these results to encourage diversity, surfacing items that remain relevant but less redundant.
5
17
6
-
It implements several popular strategies such as MMR, MSD, DPP, and Cover with a clear, unified API. More information about the supported strategies can be found in the [supported strategies section](#supported-strategies).
18
+
It implements several popular diversification strategies such as MMR, MSD, DPP, and Cover with a clear, unified API. More information about the supported strategies can be found in the [supported strategies section](#supported-strategies). The only dependency is NumPy, making the package very lightweight.
The returned `DiversificationResult` can be used to access the diversified `indices`, as well as the `marginal gains` of the selected strategy and other useful info. The strategies are extremely fast and scalable: this example runs in 0.0001s.
38
50
39
51
## Supported Strategies
40
52
41
-
The following table describes the supported strategies, how they work, their time complexity, and when to use them.
53
+
The following table describes the supported strategies, how they work, their time complexity, and when to use them. The papers linked in the [references](#references) section provide more in-depth information on the strengths/weaknesses of the supported strategies.
42
54
43
55
| Strategy | What It Does | Time Complexity | When to Use |
|**MMR** (Maximum Marginal Relevance) | Keeps the most relevant items while down-weighting those too similar to what’s already picked. |**O(k · n · d)**|Best **default**. Fast, simple, and works well when you just want to avoid near-duplicates. |
57
+
|**MMR** (Maximum Marginal Relevance) | Keeps the most relevant items while down-weighting those too similar to what’s already picked. |**O(k · n · d)**|Good default. Fast, simple, and works well when you just want to avoid near-duplicates. |
46
58
|**MSD** (Max Sum of Distances) | Prefers items that are both relevant and far from *all* previous selections. |**O(k · n · d)**| Use when you want stronger spread, i.e. results that cover a wider range of topics or styles. |
47
59
|**DPP** (Determinantal Point Process) | Samples diverse yet relevant items using probabilistic “repulsion.” |**O(k · n · d + n · k²)**| Ideal when you want to eliminate redundancy or ensure diversity is built-in to selection. |
48
60
|**COVER** (Facility-Location) | Ensures selected items collectively represent the full dataset’s structure. |**O(k · n²)**| Great for topic coverage or clustering scenarios, but slower for large `n`. |
49
61
62
+
63
+
## Motivation
64
+
65
+
Traditional retrieval systems rank results purely by relevance (how closely each item matches the query) While effective, this can lead to redundancy: top results often look nearly identical, which can create a poor user experience.
66
+
67
+
Diversification techniques like MMR, MSD, COVER, and DPP help balance relevance and variety.
68
+
Each new item is chosen not only because it’s relevant, but also because it adds new information that wasn’t already covered by earlier results.
69
+
70
+
This improves exploration, user satisfaction, and coverage across many domains, for example:
71
+
72
+
- E-commerce: Show different product styles, not multiple copies of the same black pants.
73
+
- News search: Highlight articles from different outlets or viewpoints.
74
+
- Academic retrieval: Surface papers from different subfields or methods.
75
+
- RAG / LLM contexts: Avoid feeding the model near-duplicate passages.
76
+
50
77
## References
51
78
52
79
The implementations in this package are based on the following research papers:
@@ -61,3 +88,7 @@ The implementations in this package are based on the following research papers:
61
88
62
89
-**DPP (efficient greedy implementation)**: Chen, L., Zhang, G., & Zhou, H. (2018). Fast greedy MAP inference for determinantal point process to improve recommendation diversity.
0 commit comments