Tech Corner - 9. May 2025
header_image

Building an effective search for AI and other applications (part 3 of 3)

This is the final article in the search series, and we’re finally going to talk about some practical solutions. There are a lot of considerations when choosing a good solution for search, so to make things easier, I’m going to outline the strengths and weaknesses of some popular options.

Missed the earlier parts? Check out the first and second articles in this search series

High-level overview of solutions

As I see it, there are three important things to consider before choosing a solution:

  1. Are there a lot of relationships among the data?
  2. How complex is the expected filtering logic?
  3. Are we looking for just a search solution, or a database with search capability?

There are plenty of vector DBs out there, so going through every one of them would be nearly impossible. Because of that, I’ve chosen just three solutions, which I think fit into very different use cases based on the criteria listed above:

  1. Pinecone – the easiest solution to set up initially, which is very good for simple use cases like internal docs, or data that doesn’t need to be updated often and where there’s no need for advanced search and filtering.
  2. Postgres – lies somewhere in the middle in terms of initial setup complexity and usage. Perfect if you’ve already got your data in Postgres or your team is familiar with it. It has a decent number of search features and even allows for some advanced search techniques.
  3. Elasticsearch – the most complex in terms of initial setup and management, but also grants access to the most advanced search techniques and simplifies complex use cases. Pretty much everything performance-related is handled automatically. There’s also a completely free open-source alternative called OpenSearch. It was originally a fork of Elasticsearch, but from version 8 onward they parted ways, and OpenSearch has taken its own path.

The selection might be a bit biased, so take it with a grain of salt.

One aspect not mentioned in the above list is performance, and there’s a good reason for that — it’s not tied to the DB itself that much. In my experience, most benchmarks on the internet in this regard are useless, because the search algorithms for both full-text and vector search don’t depend on the database you’re using. For both search methods, separate indexes will be created, so it’s mostly a matter of how well the search methods are implemented. There’s an advantage in the case of open-source DBs, because someone can always come along and build a faster extension.

For the sake of this article, we’re going to go through the setup of a Postgres database with search features like BM25 full-text search, vector search, and hybrid search.

About the example dataset

The dataset used for demonstration is “All the News Dataset”, which can be found on Kaggle. It’s basically just a large collection of articles.

Postgres setup

The version of Postgres used in this demonstration is 17.4. We’re going to be using native Postgres functionality for full-text search and the pgvector extension for vector search. Note that there’s also the pgvectorscale extension from the team that built TimescaleDB. The latter extension is supposed to be more performant than pgvector. There’s also an extension that adds the BM25 full-text search algorithm to Postgres, though it’s not as easy to set up.

Let’s assume we’ve already created a base table from the dataset above, called news_articles. First, let’s set up full-text search for this table:

sql

> ALTER TABLE news_articles
>  ADD COLUMN article_vector tsvector;
>

>UPDATE news_articles
>SET article_vector =
>     setweight(to_tsvector('english', coalesce(title, '')), 'A')
>   || setweight(to_tsvector('english', coalesce(article,  '')), 'B');
>

Here, we’re basically adding search functionality for both the title and content of the article. Here’s an example query:

sql

> SELECT
>  id,
>  title,
>  ts_rank(article_vector,
>    plainto_tsquery('english', 'photographs of flowers')
>  ) AS rank
>FROM news_articles
>WHERE article_vector @@ plainto_tsquery('english', 'photographs of flowers')
>ORDER BY rank DESC;
>

The syntax may seem a bit odd, but that’s usually the tradeoff when you’re trying to add functionality like this into a general-purpose database. There are also additional steps needed for an actual production setup, which we won’t cover here, such as:

  1. adding a trigger to create a vector for each new record,
  2. setting up a GIN index for efficient search.

Setting up the pgvector extension is pretty straightforward. First, install the extension via the official installation guide:

bash

> cd /tmp
>git clone --branch v0.8.0 https://github.com/pgvector/pgvector.git
>cd pgvector
>make
>make install # may need sudo
>

Then just add the extension to the database:

CREATE EXTENSION vector;

Let’s assume that at this point we’ve created a main table for articles from the dataset. Next, we create a separate table for the embeddings, since some articles will be split into multiple chunks:

sql

> CREATE TABLE article_chunks (
>    id TEXT PRIMARY KEY,
>    original_id INTEGER REFERENCES news_articles(id),
>    chunk_index INTEGER,
>    chunk_text TEXT,
>    embedding VECTOR(1536)
>);
>

We’ll be using the “text-embedding-3-small” model from OpenAI for demonstration purposes, which has 1536 dimensions, so we’ve defined a vector of that size. As discussed in the previous article, you might want to use a specialized index to improve search speed, which you can add with a simple command:

CREATE INDEX ON article_chunks USING hnsw (embedding vector_cosine_ops);

Here’s an example query for searching the database:

sql

> SELECT
>  ac.id AS chunk_id,
>  ac.original_id,
>  ac.chunk_index,
>  ac.chunk_text,
>  na.title,
>  1 - (ac.embedding <=> '<EMBEDDING_VECTOR>'::vector) AS similarity_score
>FROM
>  article_chunks AS ac
>JOIN
>  news_articles AS na
>    ON ac.original_id = na.id
>ORDER BY
>  similarity_score DESC
>LIMIT
>  10;
>

A couple of important things to note:

  1. We’re using the <=> operator, which represents cosine similarity.
  2. The embedding vector is just a placeholder, as inserting a full 1536-dimensional vector wouldn’t look great here.

Hybrid search

There are several ways to handle hybrid search in Postgres:

  1. Just run two separate queries and implement a reranking function in your language of choice.
  2. Use Postgres syntax like plpgsql to define a function.

Option 1 is easier to implement but inefficient. Option 2 is harder to write and adjust, but much more effective. There’s a lot involved in the second approach, so I’ve decided not to include it here, as it would be too long.

Final thoughts

If you need something really simple for search, Pinecone or a similar solution is probably your best choice. If you’ve already got your data in Postgres, understand search mechanisms well, or have time to spare, Postgres will likely be the best fit. If the limitations of a NoSQL database — such as lack of support for referencing and limited general-purpose capabilities — don’t bother you, Elasticsearch or OpenSearch might be your best option. However, be aware that some Elasticsearch features require a license, which can be quite expensive for smaller projects.

Even though I’ve only demonstrated a minimal setup for Postgres, I believe that by the time you’re reading this article, you have a decent understanding of what search is and how to do it right.

about the author

Vitalii Tsimbolynets

blog author
I’m a web developer with a passion for problem-solving. The tougher the challenge, the more motivated I am to find a clean, effective solution.
blog author

READ MORE