Skip to content

Commit 189805f

Browse files
authored
feat: Added DiversificationResult, added docs (Pringled#5)
* Added docs * Added docs * Added docs * Added docs * Added docs * Made returning gains optional * Added return datamodel * Added return datamodel * Updated docs * Updated docs * Updated docs * Updated docs
1 parent 5046ca4 commit 189805f

File tree

11 files changed

+214
-72
lines changed

11 files changed

+214
-72
lines changed

README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,63 @@
11
# Pyversity — Diversified Re‑Ranking for Retrieval
2+
3+
Pyversity is a small, fast library for diversifying retrieval results.
4+
Retrieval systems often return highly similar items. Pyversity efficiently re-ranks these results to encourage diversity, surfacing items that remain relevant but less redundant.
5+
6+
It implements several popular strategies such as MMR, MSD, DPP, and Cover with a clear, unified API. More information about the supported strategies can be found in the [supported strategies section](#supported-strategies).
7+
8+
9+
## Quickstart
10+
11+
Install `pyversity` with:
12+
13+
```bash
14+
pip install pyversity
15+
```
16+
17+
Diversify retrieval results:
18+
```python
19+
import numpy as np
20+
from pyversity import diversify, Strategy
21+
22+
# Define embeddings and scores
23+
embeddings = np.random.randn(100, 256).astype(np.float32)
24+
scores = np.random.rand(100).astype(np.float32)
25+
26+
# Diversify with with a chosen strategy (in this case MMR)
27+
diversified_result = diversify(
28+
embeddings=embeddings,
29+
scores=scores,
30+
k=10,
31+
strategy=Strategy.MMR,
32+
)
33+
# Get the indicices of the diversified result
34+
diversified_indices = diversified_result.indices
35+
```
36+
37+
38+
39+
## Supported Strategies
40+
41+
The following table describes the supported strategies, how they work, their time complexity, and when to use them.
42+
43+
| Strategy | What It Does | Time Complexity | When to Use |
44+
| ------------------------------------- | ---------------------------------------------------------------------------------------------- | ------------------------- | ---------------------------------------------------------------------------------------------- |
45+
| **MMR** (Maximum Marginal Relevance) | Keeps the most relevant items while down-weighting those too similar to what’s already picked. | **O(k · n · d)** | Best **default**. Fast, simple, and works well when you just want to avoid near-duplicates. |
46+
| **MSD** (Max Sum of Distances) | Prefers items that are both relevant and far from *all* previous selections. | **O(k · n · d)** | Use when you want stronger spread, i.e. results that cover a wider range of topics or styles. |
47+
| **DPP** (Determinantal Point Process) | Samples diverse yet relevant items using probabilistic “repulsion.” | **O(k · n · d + n · k²)** | Ideal when you want to eliminate redundancy or ensure diversity is built-in to selection. |
48+
| **COVER** (Facility-Location) | Ensures selected items collectively represent the full dataset’s structure. | **O(k · n²)** | Great for topic coverage or clustering scenarios, but slower for large `n`. |
49+
50+
## References
51+
52+
The implementations in this package are based on the following research papers:
53+
54+
- **MMR**: Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. [Link](https://dl.acm.org/doi/pdf/10.1145/290941.291025)
55+
56+
- **MSD**: Borodin, A., Lee, H. C., & Ye, Y. (2012). Max-sum diversification, monotone submodular functions and dynamic updates. [Link](https://arxiv.org/pdf/1203.6397)
57+
58+
- **COVER**: Puthiya Parambath, S. A., Usunier, N., & Grandvalet, Y. (2016). A coverage-based approach to recommendation diversity on similarity graph. [Link](https://dl.acm.org/doi/10.1145/2959100.2959149)
59+
60+
- **DPP**: Kulesza, A., & Taskar, B. (2012). Determinantal Point Processes for Machine Learning. [Link](https://arxiv.org/pdf/1207.6083)
61+
62+
- **DPP (efficient greedy implementation)**: Chen, L., Zhang, G., & Zhou, H. (2018). Fast greedy MAP inference for determinantal point process to improve recommendation diversity.
63+
[Link](https://arxiv.org/pdf/1709.05135)

src/pyversity/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
from pyversity.core import diversify
2+
from pyversity.datatypes import DiversificationResult, Metric, Strategy
23
from pyversity.strategies import cover, dpp, mmr, msd
4+
from pyversity.version import __version__
35

4-
__all__ = ["diversify", "mmr", "msd", "cover", "dpp", "__version__"]
6+
__all__ = ["diversify", "Strategy", "Metric", "DiversificationResult", "mmr", "msd", "cover", "dpp", "__version__"]

src/pyversity/core.py

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,27 +2,28 @@
22

33
import numpy as np
44

5-
from pyversity.datatypes import Strategy
5+
from pyversity.datatypes import DiversificationResult, Strategy
66
from pyversity.strategies import cover, dpp, mmr, msd
77

88

99
def diversify(
10-
strategy: Strategy,
1110
embeddings: np.ndarray,
1211
scores: np.ndarray,
1312
k: int,
13+
strategy: Strategy = Strategy.MMR,
1414
**kwargs: Any,
15-
) -> tuple[np.ndarray, np.ndarray]:
15+
) -> DiversificationResult:
1616
"""
1717
Diversify a retrieval result using a selected strategy.
1818
19-
:param strategy: The diversification strategy to apply. Supported strategies are: MMR, MSD, COVER, and DPP.
20-
:param embeddings: Array of embeddings for the items.
21-
:param scores: Array of relevance scores for the items.
22-
:param k: The number of items to select in the diversified result.
19+
:param embeddings: Embeddings of the items to be diversified.
20+
:param scores: Scores (relevances) of the items to be diversified.
21+
:param k: The number of items to select for the diversified result.
22+
:param strategy: The diversification strategy to apply.
23+
Supported strategies are: 'mmr' (default), 'msd', 'cover', and 'dpp'.
2324
:param **kwargs: Additional keyword arguments passed to the specific strategy function.
24-
:return: A tuple containing an array of indices of the selected items
25-
and an array of corresponding relevance scores for the selected items.
25+
:return: A DiversificationResult containing the selected item indices,
26+
their marginal gains, the strategy used, and the parameters.
2627
:raises ValueError: If the provided strategy is not recognized.
2728
"""
2829
if strategy == Strategy.MMR:

src/pyversity/datatypes.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
1+
from dataclasses import dataclass
12
from enum import Enum
23

4+
import numpy as np
5+
36

47
class Strategy(str, Enum):
58
"""Supported diversification strategies."""
@@ -15,3 +18,23 @@ class Metric(str, Enum):
1518

1619
COSINE = "cosine"
1720
DOT = "dot"
21+
22+
23+
@dataclass
24+
class DiversificationResult:
25+
"""
26+
Result of a diversification operation.
27+
28+
Attributes
29+
----------
30+
indices: Diversified item indices.
31+
marginal_gains: Marginal gains/relevance scores for the diversified items.
32+
strategy: Diversification strategy used.
33+
parameters: Additional parameters used in the strategy.
34+
35+
"""
36+
37+
indices: np.ndarray
38+
marginal_gains: np.ndarray
39+
strategy: Strategy
40+
parameters: dict

src/pyversity/strategies/cover.py

Lines changed: 28 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import numpy as np
22

3-
from pyversity.datatypes import Metric
3+
from pyversity.datatypes import DiversificationResult, Metric, Strategy
44
from pyversity.utils import normalize_rows, pairwise_similarity, prepare_inputs
55

66

@@ -12,7 +12,7 @@ def cover(
1212
gamma: float = 0.5,
1313
metric: Metric = Metric.COSINE,
1414
normalize: bool = True,
15-
) -> tuple[np.ndarray, np.ndarray]:
15+
) -> DiversificationResult:
1616
"""
1717
Select a subset of items that balances relevance and coverage.
1818
@@ -27,7 +27,8 @@ def cover(
2727
:param gamma: Concavity parameter in (0, 1]; lower values emphasize diversity.
2828
:param metric: Similarity metric to use. Default is Metric.COSINE.
2929
:param normalize: Whether to normalize embeddings before computing similarity.
30-
:return: Tuple of selected indices and their marginal gains.
30+
:return: A DiversificationResult containing the selected item indices,
31+
their marginal gains, the strategy used, and the parameters.
3132
:raises ValueError: If theta is not in [0, 1].
3233
:raises ValueError: If gamma is not in (0, 1].
3334
"""
@@ -37,11 +38,22 @@ def cover(
3738
if not (0.0 < float(gamma) <= 1.0):
3839
raise ValueError("gamma must be in (0, 1]")
3940

41+
params = {
42+
"theta": theta,
43+
"gamma": gamma,
44+
"metric": metric,
45+
}
46+
4047
# Prepare inputs
4148
feature_matrix, relevance_scores, top_k, early_exit = prepare_inputs(embeddings, scores, k)
4249
if early_exit:
4350
# Nothing to select: return empty arrays
44-
return np.empty(0, np.int32), np.empty(0, np.float32)
51+
return DiversificationResult(
52+
indices=np.empty(0, np.int32),
53+
marginal_gains=np.empty(0, np.float32),
54+
strategy=Strategy.COVER,
55+
parameters=params,
56+
)
4557

4658
if metric == Metric.COSINE and normalize:
4759
# Normalize feature vectors to unit length for cosine similarity
@@ -51,7 +63,12 @@ def cover(
5163
# Pure relevance: select top-k by relevance scores
5264
topk = np.argsort(-relevance_scores)[:top_k].astype(np.int32)
5365
gains = relevance_scores[topk].astype(np.float32, copy=False)
54-
return topk, gains
66+
return DiversificationResult(
67+
indices=topk,
68+
marginal_gains=gains,
69+
strategy=Strategy.COVER,
70+
parameters=params,
71+
)
5572

5673
# Compute non-negative similarities for coverage to avoid concave-power NaNs
5774
similarity_matrix = pairwise_similarity(feature_matrix, metric)
@@ -82,4 +99,9 @@ def cover(
8299
# Update accumulated coverage
83100
accumulated_coverage += similarity_matrix[:, best_index]
84101

85-
return selected_indices, marginal_gains
102+
return DiversificationResult(
103+
indices=selected_indices,
104+
marginal_gains=marginal_gains,
105+
strategy=Strategy.COVER,
106+
parameters=params,
107+
)

src/pyversity/strategies/dpp.py

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import numpy as np
22

3+
from pyversity.datatypes import DiversificationResult, Strategy
34
from pyversity.utils import EPS32, normalize_rows, prepare_inputs
45

56

@@ -16,7 +17,7 @@ def dpp(
1617
scores: np.ndarray,
1718
k: int,
1819
beta: float = 1.0,
19-
) -> tuple[np.ndarray, np.ndarray]:
20+
) -> DiversificationResult:
2021
"""
2122
Greedy determinantal point process (DPP) selection.
2223
@@ -29,14 +30,19 @@ def dpp(
2930
:param k: Number of items to select.
3031
:param beta: Controls the influence of relevance scores in the DPP kernel.
3132
Higher values increase the emphasis on relevance.
32-
:return: Tuple of selected indices and their marginal gains.
33+
:return: A DiversificationResult containing the selected item indices,
34+
their marginal gains, the strategy used, and the parameters.
3335
"""
3436
# Prepare inputs
3537
feature_matrix, relevance_scores, top_k, early_exit = prepare_inputs(embeddings, scores, k)
3638
if early_exit:
3739
# Nothing to select: return empty arrays
38-
return np.empty(0, np.int32), np.empty(0, np.float32)
39-
40+
return DiversificationResult(
41+
indices=np.empty(0, np.int32),
42+
marginal_gains=np.empty(0, np.float32),
43+
strategy=Strategy.DPP,
44+
parameters={"beta": beta},
45+
)
4046
# Normalize feature vectors to unit length for cosine similarity
4147
feature_matrix = normalize_rows(feature_matrix)
4248

@@ -87,4 +93,9 @@ def dpp(
8793
residual_variance -= update_component * update_component
8894
np.maximum(residual_variance, 0.0, out=residual_variance)
8995

90-
return selected_indices[:step], marginal_gains[:step]
96+
return DiversificationResult(
97+
indices=selected_indices[:step],
98+
marginal_gains=marginal_gains[:step],
99+
strategy=Strategy.DPP,
100+
parameters={"beta": beta},
101+
)

src/pyversity/strategies/mmr.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import numpy as np
22

3-
from pyversity.datatypes import Metric
3+
from pyversity.datatypes import DiversificationResult, Metric
44
from pyversity.strategies.utils import greedy_select
55

66

@@ -11,7 +11,7 @@ def mmr(
1111
lambda_param: float = 0.5,
1212
metric: Metric = Metric.COSINE,
1313
normalize: bool = True,
14-
) -> tuple[np.ndarray, np.ndarray]:
14+
) -> DiversificationResult:
1515
"""
1616
Maximal Marginal Relevance (MMR) selection.
1717
@@ -26,7 +26,8 @@ def mmr(
2626
1.0 = pure relevance, 0.0 = pure diversity.
2727
:param metric: Similarity metric to use. Default is Metric.COSINE.
2828
:param normalize: Whether to normalize embeddings before computing similarity.
29-
:return: Tuple of selected indices and their marginal gains.
29+
:return: A DiversificationResult containing the selected item indices,
30+
their marginal gains, the strategy used, and the parameters.
3031
"""
3132
return greedy_select(
3233
"mmr",

src/pyversity/strategies/msd.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import numpy as np
22

3-
from pyversity.datatypes import Metric
3+
from pyversity.datatypes import DiversificationResult, Metric
44
from pyversity.strategies.utils import greedy_select
55

66

@@ -11,7 +11,7 @@ def msd(
1111
lambda_param: float = 0.5,
1212
metric: Metric = Metric.COSINE,
1313
normalize: bool = True,
14-
) -> tuple[np.ndarray, np.ndarray]:
14+
) -> DiversificationResult:
1515
"""
1616
Maximal Sum of Distances (MSD) selection.
1717
@@ -27,7 +27,8 @@ def msd(
2727
2828
:param metric: Similarity metric to use. Default is Metric.COSINE.
2929
:param normalize: Whether to normalize embeddings before computing similarity.
30-
:return: Tuple of selected indices and their marginal gains.
30+
:return: A DiversificationResult containing the selected item indices,
31+
their marginal gains, the strategy used, and the parameters.
3132
"""
3233
return greedy_select(
3334
"msd",

src/pyversity/strategies/utils.py

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
import numpy as np
44

5-
from pyversity.datatypes import Metric
5+
from pyversity.datatypes import DiversificationResult, Metric, Strategy
66
from pyversity.utils import normalize_rows, prepare_inputs, vector_similarity
77

88

@@ -15,7 +15,7 @@ def greedy_select(
1515
metric: Metric,
1616
normalize: bool,
1717
lambda_param: float,
18-
) -> tuple[np.ndarray, np.ndarray]:
18+
) -> DiversificationResult:
1919
"""
2020
Greedy selection for MMR/MSD strategies.
2121
@@ -32,19 +32,30 @@ def greedy_select(
3232
:param normalize: Whether to normalize embeddings before computing similarity.
3333
:param lambda_param: Trade-off parameter in [0, 1].
3434
1.0 = pure relevance, 0.0 = pure diversity.
35-
:return: Tuple of selected indices and their marginal gains.
35+
:return: A DiversificationResult containing the selected item indices,
36+
their marginal gains, the strategy used, and the parameters.
3637
:raises ValueError: If lambda_param is not in [0, 1].
3738
:raises ValueError: If input shapes are inconsistent.
3839
"""
3940
# Validate parameters
4041
if not (0.0 <= float(lambda_param) <= 1.0):
4142
raise ValueError("lambda_param must be in [0, 1]")
4243

44+
params = {
45+
"lambda_param": lambda_param,
46+
"metric": metric,
47+
}
48+
4349
# Prepare inputs
4450
feature_matrix, relevance_scores, top_k, early_exit = prepare_inputs(embeddings, scores, k)
4551
if early_exit:
4652
# Nothing to select: return empty arrays
47-
return np.empty(0, np.int32), np.empty(0, np.float32)
53+
return DiversificationResult(
54+
indices=np.empty(0, np.int32),
55+
marginal_gains=np.empty(0, np.float32),
56+
strategy=Strategy.MMR if strategy == "mmr" else Strategy.MSD,
57+
parameters=params,
58+
)
4859

4960
if metric == Metric.COSINE and normalize:
5061
# Normalize feature vectors to unit length for cosine similarity
@@ -93,4 +104,9 @@ def greedy_select(
93104
marginal_gains[step] = float(candidate_scores[best_index])
94105
selected_mask[best_index] = True
95106

96-
return selected_indices, marginal_gains
107+
return DiversificationResult(
108+
indices=selected_indices,
109+
marginal_gains=marginal_gains,
110+
strategy=Strategy.MMR if strategy == "mmr" else Strategy.MSD,
111+
parameters=params,
112+
)

0 commit comments

Comments
 (0)