wfx86sdng1rwd8a3z5au2mlc9dpsms-large.png&w=3840&q=75)
Building an effective search for AI and other applications (part 2 of 3)
In the previous blog, we discussed how full-text search works, including preprocessing steps, the BM25 scoring function, and indexing strategies. In this article we’ll focus on the following advanced aspects of search:
- vector search,
- considerations for using different search methods,
- hybrid search - combining full-text search with vector search,
- reranking.
Vector search basics
At the moment, vector search has become very popular due to the rise of RAG-based AI systems. It is a very suitable method for usage in a conversational context, due to its ability to find relevant documents based on arbitrary text queries. So how does vector search work? Well, the short answer is:
- We create vectors from text - a vector for the query and vectors for source documents.
- We calculate the similarity scores of vectors, using pretty simple formulas, like cosine similarity, dot product of vectors, etc.
- After getting the similarity scores, we order the source documents by their scores and consider the highest-rated documents the most relevant to the given query.
After reading this summary, a few questions may arise, for example: “How do we create these vectors?”, “How do we choose a formula to calculate a score?”.
Creating vectors
The answer to the 1st question is we’re going to use a vector embedding model to create a vector from any given text. The embedding model is a model created using machine learning. The model is trained on a large amount of text data and there are plenty of such models out there. Some of them are from large companies like OpenAI, Anthropic, Google, Cohere… There are also many open-source models that can be found on Huggingface, which are free to use and can be deployed anywhere. The embedding model produces a vector from a source text, which is basically an array of numbers, like:
[0.25,0.13,−0.44,0.98,−0.67]
Most of the time, it is an array of floating-point numbers, where each index number represents a certain feature of a text. Now the thing is - we don’t know which features these numbers actually represent, because every ML model is basically a black box, so we just have to trust the provider or the evaluations.
Choosing similarity function
This particular part isn’t that important most of the time and the simple answer is to use whatever the model provider suggests. For example, OpenAI recommends using cosine similarity with their “text-embedding-3-small”. There is a slight difference between the use cases for different similarity functions, but it mostly comes down to how the model was trained, because for example, the mentioned OpenAI model produces normalized vectors, which means that cosine similarity and dot product functions are going to produce the same result for 2 given vectors. Some other models, though, have a different behaviour, but the outcome also depends on how the model was trained, so there’s no universal answer to this question. Although, it might be a good idea to start with just cosine similarity and see if it fulfils your requirements in practice and if so, you can stop right there.
Preparing documents
You might be thinking by now if it’s possible to just pass any amount of text to the embedding model and it will get all the job done. The answer to this is no, which is the case because if you pass too much or too little text, the embedding model just doesn’t perform very well. So what’s too much and too little? In practice, for something like “text-embedding-3-small”, you should be good to go with the following values:
- Minimum: 3-5 words.
- Maximum: 500-700 words.
Now these aren’t some well-known numbers, just derived from practical usage. The essence of each embedding model is encoding the “meaning” of text into a vector form, so given that each vector has a limited amount of parameters, it is understandable that there should be an upper boundary. The logical reason for the minimum number of words being 3 to 5 is that it’s just very hard to put a lot of meaningful information in that little amount of text.
Having said all the above, a usual flow of creating a vector database looks like the following:
- chunk documents - split documents into chunks of a certain length, so that the embedding model can effectively work with them,
- create embeddings from the chunks,
- store embeddings in a database.
There are plenty of strategies for chunking documents, like:
- simple text splitting - just split text every 600 words,
- stop sequences-based splitting - if we have the source text in a structured form, like HTML or Markdown, we can split texts on headers,
- semantic chunking - using an LLM to process each document and group relevant pieces of information into 1 or more chunks.
Each strategy has its advantages and drawbacks, which we’re not going to cover in this article. It is a good rule of thumb to just start with a simple chunking strategy, like the one based on stop sequences, and see if it performs well enough before going into more advanced ones.
Performing a vector search
To perform a vector search, all modern solutions require creating a vector index. In its most simple form, a vector index is just an array of vectors (2-dimensional array), where the vectors are stored in their original form. This one is usually called a flat index. It is perfectly fine to have this kind of index if you’re dealing with a small amount of data, like under 10,000 vectors. If you’ve got more data though, you might start looking into some advanced options:
- vector quantisation,
- indexes optimised for performance.
Vector quantisation basically stands for using some kind of interpolation technique to convert original values in a vector into a different data type. For example, the “text-embedding-3-small” model by default produces an array of floating-point numbers. Using quantisation, you can convert this array into 4-byte integers, or even into an array of bytes, which has the following advantages:
- reduces the amount of memory needed to store vectors,
- increases the speed of search - any computations are faster with numbers that take up less space in memory, though the impact of this isn’t that large.
Let’s now talk about types of indexes:
- flat index - the one presented at the beginning and the best choice for a small amount of vectors,
- HNSW index - index optimised for lookup speed.
As we’ve already discussed the 1st type of index, now we’re going to briefly talk about HNSW (Hierarchical Navigable Small World). This kind of index is much faster than the flat index and allows a fast lookup on pretty much any amount of vectors:
- this kind of index operates in layers, where each layer contains a more granular graph structure, so the closest analogy is a binary search tree, where you basically narrow down the amount of data you’re going to search at the end,
- as the last step, the index performs a lookup on a defined subset of data.
The main advantage of this index is the speed and that it’s highly configurable (for higher precision or higher speed). The main drawbacks are:
- there’s an element of randomness to the algorithm behind this index, so you can’t expect the same results for the same queries every time,
- much more memory-intensive type of index than flat index.
To conclude this section, I would say you can use flat index if you’ve got less than 10,000 vectors or you don’t care about speed too much. Otherwise, you should use the HNSW index.
Considerations for using different types of search
Now that we know of 2 primary search methods - full-text search and vector search, it is a good time to think about when we want to use each of them and why. The full-text search by its nature doesn’t capture the meaning of the source texts, but it’s fast, deterministic and allows searching for exact information. On the other hand, vector search with ML embedding models and something like HNSW index is not deterministic and doesn’t allow us to search for anything in particular, but it’s very good for capturing the “meaning” of the source texts and providing decent results for long queries. So here are some use cases for each of these methods and the thinking process behind each decision:
- Basic search for any kind of application - pretty much every application out there can benefit from having the ability to search the content. For example, a blog with articles can have a search bar at the top, which would allow searching the articles using full-text search on titles and texts. This is a good use case because most queries written in such context are short - from 3 to 8 words. Such short texts aren’t a very good fit for a vector search but are perfect for a full-text search.
- Search for conversational AI applications - conversational RAG systems require searching for relevant information among company data to provide the best answers. A big part of such systems is incorporating conversational context into queries, for example, when a user asks a follow-up question, like “What about item 5 in the list?”. The question itself doesn’t contain any relevant context for the search but is related to a previous answer. To solve this issue, the systems either pass the previous response along with the last query from the user or use a query expansion technique, which is basically prompting an LLM to incorporate the necessary context into the query. Both techniques might produce a large textual query (10-500 words). Vector search is a much better fit for this kind of query.
Hybrid search
What if a user for some reason decides to write a more detailed query in a search bar? What if a user decides to look for a particular piece of content in a RAG system? After reading all the above, one more question might arise - “Is there a way to get the best of both worlds?”. The answer is “kinda”?
A naive approach would be to just perform both full-text search and vector search, then combine the scores and rank the results. The problem is the scale of each method:
- full-text search scores might vary from 0 to 50 or 100, basically dependent on your source texts,
- vector search scores lie in a range from 0 to 1, if we’re using a cosine similarity function.
Given the different ranges of values, each search method doesn’t contribute equally to the final ranking of a document. To solve this problem one might try to normalise the scores, but the next problem is how to normalise them? Should we normalise scores per query, or should we establish some normalisation process beforehand? These questions are very hard to answer and would require a lot of manual evaluations of the search results, but there’s a simpler answer - RRF reranker.
RRF (Reciprocal Rank Fusion) - allows combining the search results from an arbitrary amount of queries using just the ranking of the document in each particular query. Although the name might sound fancy, the actual logic behind it is dead simple:
o8mvvjarfrb5g0i5sct4fmbdc2czk6.png)
Basically, we only care about the position of the document in each set of search results. This allows us to ignore the different scales for scores and easily combine the results of each search in a reasonable way. k in the formula regulates how much each query impacts the position of the document. Let’s look at the examples with low and high values for k.
0dmaxl5pg2ivehricq1we1t2uqkn36.png)
What we can see here is that low values of k tend to prioritise highly ranked documents over lower ranked documents, which basically means if any query has a document high in the list, after reranking the document will likely stay among top results. On the opposite, if we set a higher value of k, then we make the impact of each query more even.
Reranking
Reranking stands for reordering the set of documents and we’ve covered one kind of reranker in the previous section, but there are a couple of other ways to do reranking:
- using cross-encoder models,
- using LLMs.
In comparison to RRF reranker, these models don’t allow combining results from different search queries, so you would need to use RRF reranker or some other strategy to obtain a single list of search results first.
Cross-encoder models
In its essence, a cross-encoder model is a model meant for a specific use case - reranking documents (there are some others, but we won’t cover them here). Cross-encoder models are very similar to embedding models in this sense. The main advantage of such models is they are able to understand the context better than embedding models, but at the same time are more resource intensive, so usually used only with a small subset of search results. In theory, we would be able to use such a model with our whole database of documents, but that would be very slow and memory intensive. For that reason, we first perform other kinds of search and rerank for example just the top 50 documents, whilst returning maybe 10 documents from the search query. One example of such a model that can actually be used in production is Cohere Reranker 3.5. As well as with embedding models, plenty of open-source cross-encoder models can be found on Huggingface hub.
LLMs
At the moment, it is a well-known fact that large language models are a pinnacle of natural language processing models, so they are the most flexible and allow tailoring their behaviour to the specific needs of each project. The main advantage of using an LLM for this kind of task is that it’s probably going to perform even better than a cross-encoder model, because you can also provide some relevant context in the system prompt, which would help the model assess the relevancy of each document. LLMs aren’t used that commonly for such use cases for several good reasons:
- Speed - when using an LLM from any big provider out there, the response time can vary from 1.5s to 3s for such a task, which might not be fast enough for your use case system.
- Cost - even when having a reasonably small set of search results for reranking, you are probably still going to use a lot of input tokens every time, so using LLMs might get expensive.
- Hallucinations - a word at this point very familiar to everyone who’s heard of LLMs, and even with the progress these models have made over the years, it is still the case you have to worry about. When using LLMs for reranking, you’re probably going to use a tool calling or structured outputs features, which sometimes fail to produce the defined structure.
Even though the drawbacks are serious, you can still achieve better results for a conversational RAG system if this is set up correctly.