Tech corner - 30. March 2025

Building an effective search for AI and other applications (Part 1 of 3)

In this series of articles, we’re going to learn about the key components of an effective search. Then, we’re going to look at the solutions (DBs) that provide us with the tools we need and compare them side by side. The main focus here will be on some generally available search methods across many modern DBs so that you can easily create a great search engine for your application without a great amount of effort. We’re going to look at these methods in particular:

Full-text search,
Vector search,
Hybrid search.

The first article will be solely dedicated to full-text search.

So what’s full-text search?

The simple definition is that full-text search is the method that allows you to find the most relevant text documents for a given query. A definition like this is also fully applicable to other search methods, so the main difference is how these methods achieve this, and what their strengths and weaknesses are.

In the case of full-text search, there are three main components to how it works:

Text pre-processing – converting textual information into smaller units that we’re going to use later.
Scoring function – in full-text search, the relevancy of the document is represented by a numeric score, and each algorithm has its own function for calculating the score.
Index – the structure that stores the relevant metadata for the search, increasing the effectiveness of information retrieval. This part is just for optimization purposes, so theoretically, having just the scoring function is enough to have a full-text search.

There are several full-text search algorithms, but the most commonly used one at the moment is BM25 (Best Match 25), so we’re going to look at this one.

Pre-processing

This is the first step in the information retrieval process with full-text search, which is performed both for queries and source texts. Consider we have a query: “how to perform an effective full-text search?” and a bunch of articles about the search. The scoring function that’ll be described later calculates scores for individual tokens; therefore, we first need to extract them from the text. There are several things we could do in this step:

Extracting terms – basically splitting text on some chosen characters, usually whitespace and punctuation. There are some other ways to do this, but we’re not going to cover them here.
Filtering – in this phase, we want to get rid of some words that don’t really matter for the search much:
Removing words that contain some special characters or contain only special characters.
Removing stop words – in the example query used before, we had words like “to” and “an”; try to remove these words and see if you can still understand the query. There are plenty of words present in the text that don’t really impact the meaning much and, therefore, can be safely removed.
Stemming – the process of reducing the variation in different forms of a single word. The source text might contain the word “eat” in these forms: “eat,” “ate.” With stemming, we would have this word in just one form, “eat.”
Lemmatization – an advanced processing technique that goes beyond just reducing variations in forms of a single word but can also recognize that “ate” and “used to eat” have the same meaning. This technique mostly isn’t available in any modern databases out of the box, as it’s much harder to implement and requires more resources than stemming.

So let’s take our example query and try to process it as described above: “how to perform an effective full-text search?” → [“how”, “perform”, “effect”, “full”, “text”, “search”]

In this example, we’ve:

Split the text on whitespace characters, the question mark, and the “-” character.
Removed non-meaningful words like “to” and “an.”
Performed stemming – “effective” → “effect.”

One thing to remember is that the way you preprocess the text doesn’t need to go through all these steps necessarily, so you would either have to make a decision based on your particular use case or just trust the implementation in your DB of choice 🙂.

Each step described above has much more going on than mentioned, but describing them won’t be a part of this article.

BM25 scoring function

In this section, I’ll break down what the scoring function consists of and how individual parts of it affect the final score – in other words, how we can adjust the scoring function to our specific needs. The function consists of two main parts:

Scoring function:

Inverse document frequency:

Term frequency:

qi - keyword.
fqi,D - how many times the term qi occurs in the document D.
D - length of the document.
avgdl - average length of the document.
N - number of documents.
nqi - number of documents containing the keyword.
k,b - free parameters, which we can set manually, but usually they’re set by the DB to some default values, like k=1.2 and b=0.75. Consider the query we’ve processed in the previous section - [“how”, “perform”, “effect”, “full”, “text”, “search”]. Each word here would be scored individually and then we’d just add all the scores, thus receiving our final score for this particular query. Let’s now simplify this scoring function and for the sake of simplicity score only a single word. Let’s try to understand how this scoring function works by breaking down its individual parts.

Inverse document frequency

IDFqi - the value of this part of the equation would be high for the terms that occur very rarely in the source text and low for the terms that occur very frequently. This is important, because there are words like “and”, “this” and so on, which appear in pretty much any text, but there also might be some domain specific words that occur pretty rarely, for example if your company has a bunch of medical articles. So the logic here is not to ignore generic words and prioritise rarely occurring words. This part of the equation will stay the same per term, so we’re not yet concerned about the relationship between the term and the document that we’re scoring.

Term frequency

In the second part of the scoring function we’re now concerned with the relationship between the term and the document we’re scoring. fqi,D is the basis of our score, which is also pretty self explanatory. Then we’ve got some free parameters k and b, which allow us to adjust how the score is calculated in the following way:

k - controls the saturation of the term:
lower k, like 0.5 means higher degree of saturation, which in its turn means that when the document contains the term qi many times, the resulting score won’t increase very much because of these repetitive occurrences,
higher k, like 2.0 means lower degree of saturation, which means that repetitive occurrences will contribute to the resulting score more.
b - controls the normalization, or term concentration impact on the resulting score in other words:
lower b, like 0 means no normalization, which would basically produce the same score for short document and for the long document, if they were to contains qi the same amount of times,
higher b, like 1 means full normalization, which would decrease the score for long documents containing the same amount of the term qi compared to shorter documents.

Saturation Tech Corner general blog purple background

Choosing appropriate k and b is very hard and requires a lot of domain knowledge, so there are some standardized ranges these parameters should fall into, like:

k - standard values are from 0.5 to 2.0, whereas modern DBs usually choose 1.2 as a default value, but there isn’t any upper bound for this actually,
b - as this is the degree of normalization, the value should be in range from 0 to 1; modern DBs usually have this value set to 0.75 by default.

Creating an index

The purpose of the index for BM25 is pretty much the same as the purpose of regular DB indexes – making search more effective. In theory, the index isn’t required for the search algorithm to work, but without it, performing a search would take a very long time for any larger collection of documents. We’re just going to briefly look into what this index contains, as the specific implementation is up to the DB vendor or you.

At first, we would basically do the pre-processing described at the beginning of the article for each document and save unique terms.

Everything that goes after extracting the terms is up to the one implementing the solution, as it all comes to whether you want to use more memory or compute power when searching. If memory isn’t an issue, it is possible to pre-calculate all the necessary parameters for the BM25 scoring function. Some of the parameters would be stored in a map like data structure per perm and some of them will be stored per document. If we didn’t have that much memory and weren’t that concerned about the speed, we could basically pre calculate less things.

In the end, we would have an array of all unique terms. For each individual document we would store a structure called sparse vector, which usually contains an array of tuples like {index of the term},{bm25 score}. Well, it would contain a BM25 score if you were to go with a memory heavy approach.

Further enhancements

Now that you understand the basics of how full-text search works, we can briefly discuss some improvements that might be useful for real-world applications:

Synonym Search – In the query preprocessing step, we can use a dictionary of synonyms to expand the query. As you may recall, in BM25, the score is calculated per term, so we can look up two or three synonyms for a word and include them in the query to calculate a score. This is particularly useful for domain-specific applications.
Fuzzy Search – It’s common to search for something on a web page, accidentally hit the wrong key, and see the results disappear, which is incredibly frustrating. You’ve likely encountered this issue at some point. Fuzziness allows for typos in queries by using a fuzzy search algorithm to find several candidates in our index for a mistyped word. These algorithms usually accept a “fuzziness ratio” parameter, which defines the degree of similarity between two compared words:
Suppose we set a fuzziness ratio of 0.75, meaning we require a 75% match between two strings based on our chosen algorithm.
We can then order the results by match percentage.
When calculating the BM25 score for a term, we can assume that the top three results by match percentage represent the same term. We then take the highest BM25 score for the term and use it in the final document score.

The specific implementations of these features might be very different across different DBs, so the approaches described above are just to give you an idea of how they work.

Author

Vitalii Tsimbolynets

I’m a web developer with a passion for problem-solving. The tougher the challenge, the more motivated I am to find a clean, effective solution.