Is this a new bug?
Current Behavior
I noticed that the _encode_single_document method may be missing the standard (k1 + 1) multiplier in the numerator of the BM25 term frequency calculation. I wanted to raise this for clarification, as it differs from the standard BM25 formula (https://en.wikipedia.org/wiki/Okapi_BM25).
Pinecone Text Source: https://github.com/pinecone-io/pinecone-text/blob/main/pinecone_text/sparse/bm25_encoder.py#L120
Current Implementation
def _encode_single_document(self, text: str) -> SparseVector:
indices, doc_tf = self._tf(text)
tf = np.array(doc_tf)
tf_sum = sum(tf)
tf_normed = tf / (
self.k1 * (1.0 - self.b + self.b * (tf_sum / self.avgdl)) + tf
)
return {
"indices": indices,
"values": tf_normed.tolist(),
}
Expected Implementation
def _encode_single_document(self, text: str) -> SparseVector:
indices, doc_tf = self._tf(text)
tf = np.array(doc_tf)
tf_sum = sum(tf)
tf_normed = (tf * (self.k1 + 1.0)) / (
self.k1 * (1.0 - self.b + self.b * (tf_sum / self.avgdl)) + tf
)
return {
"indices": indices,
"values": tf_normed.tolist(),
}
Expected Behavior
Unless k1=0, the expected behaviour is to factor in k1+1 in the nominator (tf_normed).
Steps To Reproduce
This is a formula question so no steps to reproduce.
Relevant log output
Environment
- **OS**:
- **Language version**:
- **Pinecone client version**:
Additional Context
No response
Is this a new bug?
Current Behavior
I noticed that the
_encode_single_documentmethod may be missing the standard(k1 + 1)multiplier in the numerator of the BM25 term frequency calculation. I wanted to raise this for clarification, as it differs from the standard BM25 formula (https://en.wikipedia.org/wiki/Okapi_BM25).Pinecone Text Source: https://github.com/pinecone-io/pinecone-text/blob/main/pinecone_text/sparse/bm25_encoder.py#L120
Current Implementation
Expected Implementation
Expected Behavior
Unless k1=0, the expected behaviour is to factor in k1+1 in the nominator (
tf_normed).Steps To Reproduce
This is a formula question so no steps to reproduce.
Relevant log output
Environment
Additional Context
No response