Talk to your data!

RAG with MongoDB, VoyageAI, Python, and OpenAI

Arturo Nereu - MongoDB

@ArturoNereu

Session's Goals

Learn to improve AI responses using Retrieval Augmented Generation (RAG)
Implement a full RAG stack with MongoDB, VoyageAI, Python, and OpenAI
Have fun!

DEMO

💬 RAG Chat Interface
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

RETRIEVAL

AUGMENTED

GENERATION

RAG

Enhances LLMs' knowledge by providing up-to-date or domain-specific expertise that wasn't in their original training data.

INGEST DATA

STORE DATA

CHUNK DATA

GENERATE EMBEDDINGS

PERFORM SEMANTIC SEARCH

PROVIDE A RESPONSE

Ingest Data

Our Data Source

https://github.com/ArturoNereu/AI-Study-Group

MongoDB Collection Schema

{
  "books": [
    {
      "_id": "ObjectId",
      "title": "string",        // Book title
      "author": "string",       // Author name
      "review": "string",       // Personal review/summary
      "link": "string"          // Purchase link
    }
  ]
}

Sample Book Document

{
  "_id": "67f4a74759d7b45f2e180317",
  "title": "AI Engineering: Building Applications with Foundation Models",
  "author": "Chip Huyen",
  "review": "If you feel lost and don't know where to start, ...",
  "link": "https://www.oreilly.com/library/view/ai-engineering/9781098166298/"
}

Store Data

Storing our Data in MongoDB

> pip3 install pymongo

Why MongoDB?

Document Model
Vector Search
Driver for Python

Storing our Data in MongoDB

import json
from pymongo import MongoClient
import os

def store_data():
    # MongoDB connection (use environment variable for security)
    connection_string = os.getenv('MONGODB_CONNECTION_STRING')
    client = MongoClient(connection_string)
    db = client['books_db']
    collection = db['ai_books']
    
    # Load books from JSON
    with open('books.json', 'r') as f:
        books = json.load(f)
    
    # Clear existing data and insert new books
    collection.delete_many({})
    result = collection.insert_many(books)
    
    print(f"✅ Successfully stored {len(result.inserted_ids)} books")
    client.close()

Our "book" document

{
  "title": "Deep Learning - A Visual Approach",
  "author": "Andrew Glassner",
  "review": "Probably the best resource out there for building solid intuition about the many
    concepts surrounding deep learning. Andrew, the author, did a wonderful job illustrating
    these concepts, making it much easier to develop a real understanding of them.",
  "link": "https://www.glassner.com/portfolio/deep-learning-a-visual-approach/"
}

Chunk Data

Sentence-Based Chunking

TL;DR. If you only have limited time to learn Artificial Intelligence, here's what I recommend:

📘 Read this book: AI Engineering: Building Applications with Foundation Models
🎥 Watch this video: Deep Dive into LLMs like ChatGPT
🧠 Follow this course: 🤗 Agents Course
If you want more (and there's a lot more) keep reading.

Why this repo exists. Learning often feels like walking down a road that forks every few meters; you're always exploring, never really arriving. And that's the beauty of it.

When I was working in games, people would ask me: "How do I learn to make games?" My answer was always: "Pick a game, and build it, learn the tools and concepts along the way." I've taken the same approach with AI.

This repository is a collection of the material I've used (and continue to use) to learn AI: books, courses, papers, tools, models, datasets, and notes. It's not a curriculum, it's more like a journal. One that's helped me build, get stuck, and keep going.

Do I know AI? Not really. But I'm learning, building, and having a great time doing it.

I hope something in here is useful to you too. And if you have suggestions or feedback, I'd love to hear it.

Fixed-Length Chunking (100 chars)

TL;DR. If you only have limited time to learn Artificial Intelligence, here's what I recommend: 📘 Read this book: AI Engineering: Building Applications with Foundation Models 🎥 Watch this video: Deep Dive into LLMs like ChatGPT 🧠 Follow this course: 🤗 Agents Course If you want more (and there's a lot more) keep reading. Why this repo exists. Learning often feels like walking down a road that forks every few meters; you're always exploring, never really arriving. And that's the beauty of it. When I was working in games, people would ask me: "How do I learn to make games?" My answer was always: "Pick a game, and build it, learn the tools and concepts along the way." I've taken the same approach with AI. This repository is a collection of the material I've used (and continue to use) to learn AI: books, courses, papers, tools, models, datasets, and notes. It's not a curriculum, it's more like a journal. One that's helped me build, get stuck, and keep going. Do I know AI? Not really. But I'm learning, building, and having a great time doing it. I hope something in here is useful to you too. And if you have suggestions or feedback, I'd love to hear it.

Generate Embeddings

Embeddings

Numerical representation of data.

triangle = [0.0, 1.0, -1.0]
square = [1.0, 0.0, -1.0]
circle = [1.0, 1.0, 0.0]

Setup

Embedding Models

You can use any.
Same for generating and retrieving.
VoyageAI by MongoDB.

> pip3 voyageai

Generating our embeddings

import voyageai

voyage_client = voyageai.Client(api_key=os.getenv("VOYAGE_API_KEY"))

# Generate embeddings for each book review
for book in books:
    try:
        # Generate embedding using Voyage AI client
        result = voyage_client.embed(
            texts=[book['review']],
            model='voyage-3'
        )
        
        embedding = result.embeddings[0]
        print(f"Generated embedding (dim: {len(embedding)})")
        
    except Exception as e:
        print(f"Failed to generate embedding: {e}")

Update our MongoDB collection

# Update the book document with embedding
for i, book in enumerate(books, 1):
    print(f"Processing book {i}/{len(books)}: {book['title']}")
    
    try:
        # Generate embedding using Voyage AI client
        result = voyage_client.embed(
            texts=[book['review']],
            model='voyage-3'
        )
        
        embedding = result.embeddings[0]
        
        # Update the book document with embedding
        collection.update_one(
            {'_id': book['_id']},
            {'$set': {'embedding': embedding}}
        )
        
        print(f"   ✅ Generated embedding (dim: {len(embedding)})")
        
    except Exception as e:
        print(f"   ❌ Failed to generate embedding: {e}")

Perform Semantic Search

Generate Query Embedding

import voyageai

def semantic_search(query, top_k=3):
    # Example query: "I want to learn the very basics of AI"
    
    voyage_api_key = os.getenv('VOYAGE_API_KEY')
    # ...
    
    result = voyage_client.embed(
        texts=[query],
        model='voyage-3'
    )
    
    query_embedding = result.embeddings[0]

Perform Vector Search

def semantic_search(query, top_k=3):
    # ...
    
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",  # Name of your vector search index
                "path": "embedding",      # Field containing the embeddings
                "queryVector": query_embedding,
                "numCandidates": 3,       # Number of candidates to consider
                "limit": top_k            # Number of results to return
            }
        },
        {
            "$project": {
                "title": 1, "author": 1, "review": 1, "link": 1,
                "score": {"$meta": "vectorSearchScore"}
            }
        }
    ]

    results = list(collection.aggregate(pipeline))

The vector_index

{
  "fields": [
    {
      "numDimensions": 1024,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"
    }
  ]
}

Provide a Response

Generate the Response with GPT

> pip3 install openai

The provide_response Function

from openai import OpenAI

def provide_response(query):
    """
    Generate final AI-powered book recommendation using OpenAI
    """
    
    # Check OpenAI API key
    openai_api_key = os.getenv('OPENAI_API_KEY')
    if not openai_api_key:
        print("❌ Please set OPENAI_API_KEY environment variable")
        return
    
    # Initialize OpenAI client
    client = OpenAI(api_key=openai_api_key)
    
    # Step 1: Get search results from vector database (from previous step)
    search_results = get_search_results(query)

The Prompt

def provide_response(query):
    #...
    
    context = ""
    for i, book in enumerate(search_results, 1):
        context += f"{i}. {book['title']} by {book['author']}\n"
        context += f"   Review: {book['review']}\n"
        context += f"   Link: {book['link']}\n\n"
    
    prompt = f"""You are an AI book recommendation assistant specializing in AI and machine learning books.

User Query: {query}

Based on the following relevant books from our database:

{context}

Please provide a helpful recommendation response that:
1. Addresses the user's specific query
2. Recommends the most suitable books from the list above
3. Explains why each book is relevant to their needs
4. Provides a brief summary of what they can expect from each recommendation
5. Suggests a reading order if applicable

Keep your response conversational and helpful."""

Getting the LLM Response

def provide_response(query):
    #...
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a helpful AI book recommendation assistant."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=800,
            temperature=0.7
        )
        
        ai_response = response.choices[0].message.content
        
        return ai_response
        
    except Exception as e:
        print(f"❌ Error generating AI response: {e}")
        return None

Before you go!

Is RAG dead?
What optimization looks like for you?
Data is not always structured and complete.
MongoDB GenAI Showcase:
- https://github.com/mongodb-developer/GenAI-Showcase
AI Study Group repository:
- https://github.com/ArturoNereu/AI-Study-Group

Thank you!

Arturo Nereu - MongoDB

@ArturoNereu