You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SearchArray adds boring, old BM25 full text search to the PyData stack.
A full-text search index comes with one underaprreciated problem: phrase search. In this article, I want to recap a roaringish phrase search algorithm, encoding positions in a roaring-like numpy array for fast bit intersections.
But before such magic, I’m ashamed to say, SearchArray started with very primitive approaches. SearchArray took almost 5 minutes(!!) for a phrase search to complete on 100k MSMarco docs. Luckily now I’ve gotten those numbers down quite a bit:
SearchArray refresher
Let’s remember how SearchArray works. It’s built as a Pandas extension array that represents an inverted index, demoed below.
In [1]: import pandas as pd
from searcharray import SearchArray
In [2]: df = pd.DataFrame({"text": ["mary had a little lamb the lamb ate mary",
...: "uhoh little mary dont eat the lamb it will get revenge",
...: "the cute little lamb ran past the little lazy sheep",
...: "little mary ate mutton then ran to the barn yard"]})
In [3]: df['text_indexed'] = SearchArray.index(df['text'])
In [4]: df
Out[4]:
text text_indexed
0 mary had a little lamb the lamb ate mary Terms({'little', 'mary', 'a', 'lamb', 'had', '...
1 uhoh little mary dont eat the lamb it will get... Terms({'eat', 'it', 'will', 'little', 'revenge...
2 the cute little lamb ran past the little lazy ... Terms({'little', 'sheep', 'lazy', 'cute', 'lam...
3 little mary ate mutton then ran to the barn yard Terms({'barn', 'little', 'mary', 'mutton', 'ya...
Now we can get a BM25 score for each row of text_indexed, looking for the phrase “little lamb”:
If a user searches text_indexed for like lamb, we can recover its term frequencies (doc 0 has 2 occurences, doc 1, 2, etc), document frequency (lamb occurs in 3 docs), everything we need for a BM25 score.
But what about phrase search? How do we compute not just term frequencies, but phrase frequencies?
How would we count all the places where “little” comes one position before “lamb”?Looking at doc 2 above - if little occurs at positions [2, 7] (2nd and 7th location) and lamb at position [3], how do we detect that lamb’s has one case of occuring exactly one location after little?
Bigram by bigram search
To implement phrase search, we connect the dots, one bigram at a time.
If a user searches for “mary had a little lamb” we first connect mary -> had. We then then repeat the process, except now we find where mary+had bigram positions occur immediately before a’s. If mary+had bigram ends at positions [1, 6] in a document, and that document has an a at position [2], then now mary+had+a trigram occurs ending at position [2].
Now we continue to finding the quadram mary+had+a+little. Maybe little occurs at position [3] in this doc. Now we have narrowed to a quadgram at mary+had+a+little ending at [3].
Now we connect mary+had+a+little positions to lamb, getting to the end of our phrase.
The number of times mary+had+a+little occurs before lamb is our phrase frequency.
It’s this connecting detail that concerns us. How do we find where one set of position occurs immediately before another set of position (ie mary before had, etc)? How do we do this fast enough to make SearchArray usable?
From Boring to Roaring
Let’s dig into how to intersect two sets of positions.
Storing the integers directly takes up way too much memory. For a meaty search dataset, storing giant positional arrays, per term, is simply a non starter to do meaningful work.
Time to rethink things.
I knew a bit about roaring bitmaps. Roaring bitmaps store sets of integers as bitmaps. I won’t take you through a tutorial of roaring bitmaps themselves. What I did was roaring-adjacent, or roaringish. Let’s walk through that.
The main intuition I took from roaring bitmaps was the following: follow along in this notebook
Let’s say we want to pack positions into a 32 bit array. Looking just at one term, we have
So far we’ve looked at phrase search one document at a time. What if we could phrase match the entire corpus for two terms in a handful of numpy operations?
Indeed, we can do just that by just also packing the doc id… and just concatting every docs packed posns together.
# Using 64 bits, put doc id in upper 32
doc1_encoded=doc_1_id<<32|packed_doc_1doc2_encoded=doc_2_id<<32|packed_doc_2
# For every doc with 'lamb' store all positions term_posns["lamb"]=np.concatenate([doc1_encoded,doc2_encoded])
Now the 48 MSB lets us intersect docs + bit group MSBs. At search time we do the exact same as before:
lamb=term_posns["lamb"]# Every doc's posns for "lamb"
little=term_posns["little"]# Every doc's posns for "little"
# Find where common MSBs _and documents_ occur between terms (doc ID AND bit groups)
_,lamb_indices,little_indices=np.intersect1d(lamb>>16,little>>16,return_indices=True)
# Get every doc's phrase match fol 'little lamb' matches=(little_intersect_lsb<<1)&lamb_intersect_lsb
This now finds phrase matches over all documents for the two terms.
To count term matches by doc id, we just add a little step to count bits, then sum the phrase freq at the doc id boundary.
# Grab the intersected values so we can get doc ids (of either, they're identical)
intersections=lamb[lamb_indices]
# Demarcate the indices where doc ids differ doc_id_change=np.argwhere(np.diff(intersections>>32)!=0).flatten()+1 doc_id_change=np.insert(doc_id_change,0,0)
# Sum set bits per doc id counted_bits=bit_count64((little_intersect_lsb<<1)&lamb_intersect_lsb) phrase_freqs=np.add.reduceat(counted_bits,doc_id_change)
We’ve now got bigram phrase frequencies for “little->lamb”. If we had a trigram “little->lamb->tacos”, we’d now repeat the same algorithm, just with little+lamb posns -> taco.
Optimizations and annoying little issues
Our data, with doc ids in the MSB, can be stored sorted. Making intersections faster.
Numpy’s own intersect1ddoesn’t particularly know about sorted data. Our data can be stored sorted (by doc id, then MSBs), theoretically making the intersection much faster. Luckily Frank Sauerburger has us covered with a handy sortednp library for intersections, merging, and other operations on sorted numpy arrays. Allowing us to find intersections with exponential search.
Another issue, of course: the algorithm I describe will miss out on phrase matches where the phrase matches straddle two MSB groups. IE:
“little” before “lamb” but across two MSB groups:
"little" group 21
0000000000010101 1000000000000000
"lamb" group 22
0000000000010110 000000000000001
This isn’t that hard to handle, we just need to create a second bit twiddling pass. Intersecting the first term with the second term’s group + 1, and looking for this exact situation.
Additionally, my bigram intersection above walked bigram by bigram mary -> mary+had etc. We might be smart and actually intersect the two rarest terms first. This dramatically limits the search space for the other bigrams. Maybe, for example, we note lamb and little are rarest. So we can intersect first those two terms, and limit the other intersections to a reasonable search space.
Finally this algorithm focuses on direct phrase matching. What do we do when a user asks for slop? As in, the user indicates that set of terms can be within N positions of a direct phrase match? “little bobby lamb” still matches “little” “lamb” with a slop of 1?. This algorithm doesn’t (yet) handle this case (and all the little edge cases :) ). That’s very much WIP.
Results
The speedup results from this were impressive. On 100K very text heavy MSMarco docs, I took searches on my local machine from some several seconds on a stopwordish search (what is the) to 10s of ms. The primary bottleneck now being the intersection (snp.intersect).
I can see now, the next step would be to optimize this layer. Sorted intersections on id sets can also be done with skiplists, which I believe is how most search engines implement this. But for now, this performance is good enough for my needs!
Additionally memory usage of 100K MSMarco docs shrunk by an order of magnitude. Making it closer to reasonable to do meaty, medium-data, search relevance experimentation with SearchArray
Roaringish taking over search array
A nice side effect of all this, is in addition to optimizing positions, I can use the packed positions to implement the inverted index, as follows:
Term freq - count all of a term’s the set positions for a document. Get term_posns["lamb"], get just the doc I want (via intersecting), and count set bits.
Doc freq - how many unique doc ids exist in a roaringish data structure for a term? Get term_posns["lamb"], get the unique doc ids (32 MSB).
This shrinks memory usage even further, and still maintains the needed speed.
That’s it for now and phrase search, but if you’re curious to dig deeper into roaringish, and see all the itty-bitty optimizations, its been pulled out into a utility within SearchArray. Check it out and give me some feedback.
via Doug Turnbull's Blog
December 16, 2024 at 10:31AM
The text was updated successfully, but these errors were encountered:
A Roaringish phrase search algorithm
https://ift.tt/fTYkrdV
Doug Turnbull
SearchArray adds boring, old BM25 full text search to the PyData stack.
A full-text search index comes with one underaprreciated problem: phrase search. In this article, I want to recap a roaringish phrase search algorithm, encoding positions in a roaring-like numpy array for fast bit intersections.
But before such magic, I’m ashamed to say, SearchArray started with very primitive approaches. SearchArray took almost 5 minutes(!!) for a phrase search to complete on 100k MSMarco docs. Luckily now I’ve gotten those numbers down quite a bit:
SearchArray refresher
Let’s remember how SearchArray works. It’s built as a Pandas extension array that represents an inverted index, demoed below.
Now we can get a BM25 score for each row of
text_indexed
, looking for the phrase “little lamb”:Under the hood of
text_indexed
, we’ve got an inverted index that looks like:If a user searches
text_indexed
for likelamb
, we can recover its term frequencies (doc 0 has 2 occurences, doc 1, 2, etc), document frequency (lamb
occurs in 3 docs), everything we need for a BM25 score.But what about phrase search? How do we compute not just term frequencies, but phrase frequencies?
How would we count all the places where “little” comes one position before “lamb”?Looking at doc 2 above - if
little
occurs at positions[2, 7]
(2nd and 7th location) andlamb
at position[3]
, how do we detect thatlamb
’s has one case of occuring exactly one location afterlittle
?Bigram by bigram search
To implement phrase search, we connect the dots, one bigram at a time.
If a user searches for “mary had a little lamb” we first connect
mary
->had
. We then then repeat the process, except now we find wheremary+had
bigram positions occur immediately beforea
’s. Ifmary+had
bigram ends at positions[1, 6]
in a document, and that document has ana
at position[2]
, then nowmary+had+a
trigram occurs ending at position[2]
.Now we continue to finding the quadram
mary+had+a+little
. Maybelittle
occurs at position[3]
in this doc. Now we have narrowed to a quadgram atmary+had+a+little
ending at[3]
.Now we connect
mary+had+a+little
positions tolamb
, getting to the end of our phrase.The number of times
mary+had+a+little
occurs beforelamb
is our phrase frequency.It’s this connecting detail that concerns us. How do we find where one set of position occurs immediately before another set of position (ie
mary
beforehad
, etc)? How do we do this fast enough to make SearchArray usable?From Boring to Roaring
Let’s dig into how to intersect two sets of positions.
Early, naive versions of SearchArray, stored each doc/terms position integers directly as raw integer arrays. I had many different ways of finding bigram adjacencies - from subtracting one terms positions from another to intersecting the positions.
Storing the integers directly takes up way too much memory. For a meaty search dataset, storing giant positional arrays, per term, is simply a non starter to do meaningful work.
Time to rethink things.
I knew a bit about roaring bitmaps. Roaring bitmaps store sets of integers as bitmaps. I won’t take you through a tutorial of roaring bitmaps themselves. What I did was roaring-adjacent, or roaringish. Let’s walk through that.
The main intuition I took from roaring bitmaps was the following: follow along in this notebook
Let’s say we want to pack positions into a 32 bit array. Looking just at one term, we have
Output:
By dividing by 16 (
groups
), and modulo by 16 (values
), we’ve segmented up the positions into groups. Let’s zoom in group-by-group:We can bitwise OR each groups values together into 16 bit values per group:
Moving now to a 32 bit value, we can store the group number in the upper 16 bits:
We can do this in numpy, pretty easily, continuing the code from above:
The key here is doing a
bitwise_or
, but over a specific set of indiceschange_indices
. This packs only where the MSBs differ, at their boundary.And now… 7 original positions are now packed into 4 32 bit values. Shrinkage!
Finding bigrams through bitwise intersection
Now, given two sets of term posns, we might find their bigram frequency by simple bit arithmetic.
Given two sets of positions for
little
andlamb
Walking through an example:
Say we have term “little”, with positions encoded
And given “lamb”
We note they share a “group 21” set of posns, corresponding to posns >= 336 but < 352 where adjacent positions might be found:
We now zero in on just the LSB, to see if any
little
s occur beforelamb
s in group 21 (posns 336-352). We do that simply with bit twiddling:Or one phrase match (one bit set).
All in numpy Python we can just do this:
Counting the bits of each value in the output gives us the phrase frequency.
Wait there’s more! Also packing the doc id
Here’s where roaringish gets really fast.
So far we’ve looked at phrase search one document at a time. What if we could phrase match the entire corpus for two terms in a handful of numpy operations?
Indeed, we can do just that by just also packing the doc id… and just concatting every docs packed posns together.
Now the 48 MSB lets us intersect docs + bit group MSBs. At search time we do the exact same as before:
This now finds phrase matches over all documents for the two terms.
To count term matches by doc id, we just add a little step to count bits, then sum the phrase freq at the doc id boundary.
We’ve now got bigram phrase frequencies for “little->lamb”. If we had a trigram “little->lamb->tacos”, we’d now repeat the same algorithm, just with
little+lamb
posns ->taco
.Optimizations and annoying little issues
Our data, with doc ids in the MSB, can be stored sorted. Making intersections faster.
Numpy’s own
intersect1d
doesn’t particularly know about sorted data. Our data can be stored sorted (by doc id, then MSBs), theoretically making the intersection much faster. Luckily Frank Sauerburger has us covered with a handy sortednp library for intersections, merging, and other operations on sorted numpy arrays. Allowing us to find intersections with exponential search.Another issue, of course: the algorithm I describe will miss out on phrase matches where the phrase matches straddle two MSB groups. IE:
“little” before “lamb” but across two MSB groups:
This isn’t that hard to handle, we just need to create a second bit twiddling pass. Intersecting the first term with the second term’s group + 1, and looking for this exact situation.
Additionally, my bigram intersection above walked bigram by bigram
mary
->mary+had
etc. We might be smart and actually intersect the two rarest terms first. This dramatically limits the search space for the other bigrams. Maybe, for example, we notelamb
andlittle
are rarest. So we can intersect first those two terms, and limit the other intersections to a reasonable search space.Finally this algorithm focuses on direct phrase matching. What do we do when a user asks for slop? As in, the user indicates that set of terms can be within N positions of a direct phrase match? “little bobby lamb” still matches “little” “lamb” with a slop of 1?. This algorithm doesn’t (yet) handle this case (and all the little edge cases :) ). That’s very much WIP.
Results
The speedup results from this were impressive. On 100K very text heavy MSMarco docs, I took searches on my local machine from some several seconds on a stopwordish search (
what is the
) to 10s of ms. The primary bottleneck now being the intersection (snp.intersect
).I can see now, the next step would be to optimize this layer. Sorted intersections on id sets can also be done with skiplists, which I believe is how most search engines implement this. But for now, this performance is good enough for my needs!
Additionally memory usage of 100K MSMarco docs shrunk by an order of magnitude. Making it closer to reasonable to do meaty, medium-data, search relevance experimentation with SearchArray
Roaringish taking over search array
A nice side effect of all this, is in addition to optimizing positions, I can use the packed positions to implement the inverted index, as follows:
term_posns["lamb"]
, get just the doc I want (via intersecting), and count set bits.term_posns["lamb"]
, get the unique doc ids (32 MSB).This shrinks memory usage even further, and still maintains the needed speed.
That’s it for now and phrase search, but if you’re curious to dig deeper into roaringish, and see all the itty-bitty optimizations, its been pulled out into a utility within SearchArray. Check it out and give me some feedback.
via Doug Turnbull's Blog
December 16, 2024 at 10:31AM
The text was updated successfully, but these errors were encountered: