The architecture of a Retrieval Augmented Generation (RAG)

This system deployed on Azure, using Azure Foundry as the application platform, Azure OpenAI for the Large Language Model (LLM) capabilities, and Azure Cosmos DB for vector storage and retrieval.

Core Concept: Why RAG?

Standard LLMs generate responses based solely on their training data, which can be outdated or lack specific domain knowledge. RAG addresses this by:

  1. Retrieving: Fetching relevant, up-to-date information from an external knowledge source (like documents stored and indexed in Cosmos DB) based on the user's query.

  2. Augmenting: Adding this retrieved information as context to the original user query.

  3. Generating: Sending the augmented prompt (original query + retrieved context) to the LLM (Azure OpenAI) to generate a more informed, accurate, and grounded response.

Main Architectural Components:

Here's a breakdown of the typical components involved in such a system:

+---------------------+      +--------------------------------+      +---------------------+
|   User Interface    |<---->|   RAG Application Service      |<---->| Azure OpenAI Service|
| (Web App, Chatbot, |      |   (Hosted on Azure Foundry)    |      | (LLM & Embeddings)  |
|  API Client, etc.)  |      +--------------------------------+      +---------------------+
+---------------------+                 |          ^
                                        |          |
                                  (Query Vector)  (Retrieved Context)
                                        |          |
                                        v          |
                         +-------------------------------------+
                         | Azure Cosmos DB (w/ Vector Search)  |
                         | - Vector Embeddings                 |
                         | - Original Text Chunks              |
                         | - Metadata                          |
                         +-------------------------------------+
                                        ^
                                        | (Data Ingestion & Indexing - Often Offline)
                                        |
+------------------------------------+  |  +-------------------------+
| Data Ingestion & Indexing Pipeline |<-+--| Source Knowledge Base   |
| (e.g., Azure Function, Data Factory|     | (Docs, DBs, Websites)   |
|  Uses Azure OpenAI Embedding Model) |     +-------------------------+
+------------------------------------+

Components:

  1. User Interface (UI) / Client Application:

    • Role: The entry point for user interaction. This could be a web application, a chatbot interface, an API endpoint for other services, etc.

    • Function: Captures the user's query and sends it to the RAG Application Service. Displays the final generated response back to the user.

  2. RAG Application Service (Hosted on Azure Foundry):

    • Role: This is the central orchestrator of the RAG process. It's the custom application code you would write (e.g., in Go, Rust, C#, Java, Python) and deploy to Azure Foundry.

    • Platform: Azure Foundry provides the Platform-as-a-Service (PaaS) environment. You'd cf push your application artifact. Foundry manages the underlying infrastructure, scaling, routing, and service binding.

    • Key Functions:

      • Receives Query: Accepts the user query from the UI/Client.

      • Generates Query Embedding: Sends the user query to the Azure OpenAI Embedding endpoint (e.g., text-embedding-ada-002) to convert it into a vector representation (an array of floating-point numbers).

      • Queries Vector Database (Cosmos DB): Sends the generated query vector to Azure Cosmos DB to perform a vector similarity search.

      • Retrieves Context: Receives the top 'k' most relevant document chunks (text and potentially metadata) from Cosmos DB based on vector similarity.

      • Constructs Augmented Prompt: Creates a new prompt for the LLM, carefully combining the original user query with the retrieved text chunks as context. Prompt engineering is key here.

      • Calls Generative LLM (Azure OpenAI): Sends the augmented prompt to the Azure OpenAI Completion or Chat Completion endpoint (e.g., gpt-4, gpt-35-turbo).

      • Processes Response: Receives the generated text response from Azure OpenAI. May perform minor formatting or post-processing.

      • Returns Response: Sends the final response back to the UI/Client.

    • Service Binding (Azure Foundry): Your application running on Foundry would need to securely connect to Azure OpenAI and Azure Cosmos DB. This is typically done via Azure Foundry's service binding mechanism, which injects connection details (like endpoint URLs, API keys, or managed identity information) into the application's environment. Using Azure Managed Identities is the recommended practice for secure, keyless authentication where supported.

  3. Azure OpenAI Service:

    • Role: Provides the core AI models needed for RAG.

    • Key Endpoints Used:

      • Embedding Model Endpoint (e.g., text-embedding-ada-002): Used by both the Data Ingestion Pipeline (to embed knowledge base chunks) and the RAG Application Service (to embed incoming user queries). It's crucial to use the same embedding model for both for the similarity search to work correctly.

      • Completion/Chat Model Endpoint (e.g., gpt-4, gpt-35-turbo): Used by the RAG Application Service to generate the final response based on the augmented prompt.

  4. Azure Cosmos DB (with Vector Search Capability):

    • Role: Acts as the persistent, scalable knowledge base and enables efficient retrieval of relevant context using vector similarity search.

    • Vector Profiles/Storage: Cosmos DB stores "vector profiles" which typically consist of:

      • Vector Embedding: The numerical vector representation of a text chunk, generated by the Azure OpenAI embedding model.

      • Original Text Chunk: The actual piece of text that the vector represents. This is needed to provide context to the LLM.

      • Metadata: Additional information about the chunk, such as the source document ID, page number, section headers, URLs, access control information, etc. This metadata can be useful for filtering searches or providing citations in the final response.

    • Vector Search Implementation: Azure Cosmos DB offers integrated vector search capabilities (currently in public preview for specific APIs):

      • Azure Cosmos DB for MongoDB (vCore): Allows storing, indexing (using algorithms like HNSW - Hierarchical Navigable Small World), and querying vector data alongside your MongoDB documents.

      • Azure Cosmos DB for PostgreSQL: Leverages the pgvector extension for storing, indexing, and querying vectors within a PostgreSQL database.

    • How it's used in RAG:

      1. The RAG Application Service sends the query vector (generated from the user's query).

      2. The application executes a vector search query against Cosmos DB (e.g., using the specific API's query language/SDK features for vector search). This query asks Cosmos DB to find the 'k' items whose stored vectors are most similar (e.g., using cosine similarity or Euclidean distance) to the provided query vector.

      3. Cosmos DB utilizes its vector index for efficient searching (Approximate Nearest Neighbor - ANN search) and returns the matching documents/items, including the Original Text Chunk and Metadata.

  5. Data Ingestion & Indexing Pipeline (Often Offline/Asynchronous):

    • Role: Prepares the knowledge base data and populates Cosmos DB. This is typically a separate process that runs periodically or whenever the source data changes.

    • Platform: Can be implemented using various Azure services like Azure Functions, Azure Logic Apps, Azure Data Factory, Azure Databricks, or even a simple script running on a VM or container.

    • Steps:

      • Load Data: Read data from source knowledge bases (e.g., documents in Blob Storage, SharePoint, web pages, SQL databases).

      • Parse & Clean: Extract relevant text content and clean it (remove irrelevant formatting, etc.).

      • Chunk Data: Break down large documents into smaller, meaningful chunks (e.g., paragraphs, sections). Chunking strategy significantly impacts retrieval quality.

      • Generate Embeddings: For each text chunk, call the Azure OpenAI Embedding endpoint to get its vector representation.

      • Store in Cosmos DB: Write the text chunk, its corresponding vector embedding, and any relevant metadata as a single item/document into the configured Azure Cosmos DB collection/table.

      • Build/Update Index: Ensure Cosmos DB's vector index is updated to include the new vectors.

How Cosmos DB Vector Profiles Enable the RAG Service:

  1. Co-location: Cosmos DB allows you to store the vector embedding directly alongside the original text chunk and its metadata within the same database record/document. This avoids needing separate lookups (first finding a vector ID, then looking up the text elsewhere).

  2. Efficient Retrieval: The integrated vector indexing and search capabilities (like HNSW in MongoDB vCore or pgvector in PostgreSQL) allow Cosmos DB to perform fast Approximate Nearest Neighbor (ANN) searches, quickly finding the most relevant text chunks based on the semantic meaning captured in the query vector.

  3. Scalability & Reliability: Cosmos DB is a globally distributed, highly available, and scalable database service, ensuring your knowledge base can grow and handle query load reliably.

  4. Filtering with Metadata: You can often combine vector search with traditional metadata filtering. For example, retrieve chunks relevant to the query vector but only from documents tagged with a specific category or access level stored in the metadata.

  5. Unified Data Platform: Storing vectors, text, and metadata together simplifies the data management aspect of the RAG system.

Workflow Summary (Runtime):

  1. User submits a query via the UI.

  2. UI sends the query to the RAG Application Service on Azure Foundry.

  3. App Service gets a vector embedding for the query from Azure OpenAI.

  4. App Service queries Cosmos DB's vector search with the query vector.

  5. Cosmos DB finds the most similar vector profiles (vector + text + metadata) and returns the corresponding text chunks.

  6. App Service constructs an augmented prompt (original query + retrieved text chunks).

  7. App Service sends the augmented prompt to Azure OpenAI's generative model.

  8. Azure OpenAI generates a response based on the augmented prompt.

  9. App Service sends the final response back to the UI.

  10. UI displays the response to the user.

Resources and Links:

This architecture provides a robust way to leverage the power of Azure OpenAI LLMs while grounding their responses in your specific, up-to-date knowledge base stored and efficiently queried using Azure Cosmos DB's vector capabilities, all orchestrated by an application hosted on Azure Foundry. Remember to choose the Cosmos DB API (MongoDB vCore or PostgreSQL) that best fits your existing ecosystem or preferences, keeping in mind the vector search feature is currently in preview.

Last updated