AI monthly insights #2

In the first edition of the AI Monthly Insights, we listed a few useful resources that were mostly related to RAG systems development. In this edition, we will look at it from the perspective of quality.
The assurance that RAG systems provide good quality results is still a concern because the results themselves depend on the various AI models and retrieval methods.
The RAG systems are pipelines consisting of steps from retrieval to generation, and each step requires quality measurement in order to
- identify the issues if the outcomes from the generation phase are not good enough.
- improve the probability of outcome correctness.
Tuning of the pipeline requires numerous configurations and settings of different parameters and thresholds. Subsequently, it requires comparing the results - automation is needed to make this process efficient..
Obviously, the human-in-the-loop (ideally end-user with the domain knowledge) testing is necessary, but it is pointless to conduct it when the system is not yet considered stable from a development perspective.
Important note: bear in mind that all the tools are very new, mostly in beta versions, and you will be very likely an early adopter, as the entire generative AI era is still in its early stages.
Our first recommendation is RAGAS (https://github.com/explodinggradients/ragas) - a toolkit for evaluating various aspects of your RAG applications. It is based on a set of metrics that provide insight into the performance of your system. You can find implemented metrics for evaluation context precision and recall, the system's sensitivity to noisy responses, generic relevance/pertinence, as well as faithfulness—obviously one of the most important metrics for factual consistency.
In most cases, this tool (or alternatives) will be enough to cover what you need as a developer:
- You can adjust the threshold for semantic search response similarity to identify relevant chunks and immediately see how it affects different metrics—what improves and what worsens
- You can change the method of retrieval from simple semantic search to combined hybrid search or any other method and determine whether that is helpful or harmful for your pipeline.
- You can also modify the chunking implementation and see whether that improves context relevance and overall faithfulness.
- Last but not least, you can switch the embedding model or LLM model, experiment with the parameters, and easily compare the models in your use case.
If for some reason the above is not enough, you can pick DeepEval (https://github.com/confident-ai/deepeval) and cover the same and much more. E.g: a UI real time presentation layer - Confident AI platform (https://www.confident-ai.com/) - for the metrics dashboard you need.
Yours Truly,
L-AI-dy Whistledown
As Co-CEO, I bring together deep technical expertise and strategic vision to drive business growth. I enjoy solving problems through smart architecture, data, and a bit of math. Outside of work, you’ll probably find me on a bike, at the gym, or just tackling something new — because I don’t sit still for long.