RAG on Azure Cognitive Search

4 min readFeb 5, 2024

Retrieval Augmented Generation has been a technique that is being used in fetching relevant information to stop LLMs from hallucinations… You might have read a lot about RAG by now, it has been everywhere! The rise of LLMs with relatively short context lengths has paved the way for the explosion of vector databases that store the entire millions if not billions of tokens while LLMs process a small fraction of that to answer a question.

The most important functionality of the vector database is to perform similarity search with an option of having a pre-filtering strategy to narrow down a vast majority of irrelevant text. While working on a project, I wanted to explore through existing vector database solutions. In this article, I want to share the boilerplate code to get your RAG application up in minutes.

[Enter Cognitive Search]


from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import (
    VectorizedQuery,
    VectorFilterMode,
)

Cognitive Search/ AI Search is Azure’s offering of vector database. Their adoption of OData language to perform metadata handling and search makes it easier for the developer to build solutions in the new paradigm of applications.

The first and the most crucial thing for a vector database is to perform similarity search and find some relevant documents according to a metric (typically cosine similarity search). Due to the presence of walls of text, we are forced to use chunking strategies to meaningfully separate the information and fetch the relevant chunks. But that means even more “rows” to search in the vector database. The true power of these vector databases would be visible when we go into the millions of documents.

Yes! Millions!

And then comes the situation, where we can have multiple shards (partitions) and each of them be filled to the brim with information.That means the true full scale of operation is in billions of documents.

Milvus, an open source, software has already performed tests where they were able to retrieve a couple of relevant documents from more than billions within fractions of a second (Link). Therefore it is necessary for the us to understand how the retrieval is working with its flaws to design systems that can scale.

The python code that performs the basic vector search can be performed by :

client = SearchClient(
            endpoint=uri,
            index_name=self.index_name,
            credential=AzureKeyCredential(AzureKey)
        )

vector_query = VectorizedQuery(
            vector=vector,
            k_nearest_neighbors=k,
            fields=list_of_required_fields
        )

        search_results = client.search(
            search_text=None,
            vector_queries=[vector_query],
            vector_filter_mode=VectorFilterMode.PRE_FILTER,
            select=select, # Select a specific set of columns that we really require
            filter=filter if filter else None, # Apply metadata filtering before vector search
        )

Select variable in the above snippet is a list of column names, that should be retrieved after performing the search operation on the database. In addition, filter is a variable that has a string expression that reduces the target search size with the help of other metadata that is associated with each of the row. Such a metadata could be a list/collection, string.

Collection: When the attribute corresponding to each vector is a list, and the search request would like to match up with atleast all of the request data, the following function can be used. In other words, if a particular vector is tagged with [“Hilton” , “Downtown”, “Seattle”] and the request mentioned the tags [“Downtown”, “Seattle”], the above vector would be considered for the vector search, along with possibly other hotels such as Mariott, etc.

def filter_two_lists(values: list, field_name: str):
        return " and ".join(f"{field_name}/any(value: value eq '{value}')" for value in values)

There could be another case for having a collection, where the collection has mutually exclusive strings/numbers in the list, therefore the request can only search for one value.

def filter_single_list(value, field_name):
        return f"search.in({field_name},'{value}',',')"

Single Value in this mode of metadata search, the vector has an attribute that only have one value and the request also has a single value. In such a case, the following function can be used

def filter_single_value(value, field_name):
        expr = f"{field_name} eq {value}"
        return expr

With the above filtering techniques combined with vector similarity search, it becomes possible for the search to be perfomed across millions if not billions of entries. In the next articles, we can look at custom scoring profiles of the vector database. This allows us to define a metric on which the similarity is measured, leading to better search efficiency rather than relying on a simple and plain cosine similarity/L2 metric.

RAG on Azure Cognitive Search

Written by gksriharsha