Building a Custom RAG for Optimizely Opal

Custom RAG for Opal - architecture overview

A knowledge base outside Optimizely One

As more teams put Opal agents in front of real users, a recurring question surfaces: what do you do with the institutional knowledge that isn't in your CMS? The technical docs, the compliance reports, the internal research that informs decisions but has no business living in a content management system.

Optimizely Opal uses Optimizely Graph as its built-in RAG (Retrieval-Augmented Generation) layer. At its core, RAG is a context engineering technique: language models have finite context windows, and rather than loading all your documents into every prompt, you retrieve only the passages relevant to the current query and inject just those into the model's context. Better retrieval means better answers. It also means lower token costs, because the model isn't processing irrelevant material on every request. Graph handles this for your CMS content, indexing it and making it searchable by Opal agents. For many teams, that's enough.

But there's a scenario where none of those options quite fit: you have documents you don't want to put into Optimizely One at all. Not because of a technical limitation, but by design. Internal knowledge bases, technical whitepapers, compliance documentation, research reports. Content that has no business living in a CMS, and that you'd rather not push into a Graph index tied to a specific Optimizely environment.

What you want instead is a separate RAG layer, one that your Opal agents can query as a tool, but that also remains accessible to other AI systems independently of Optimizely. A knowledge service that isn't coupled to any single platform.

The pattern: build your own RAG, expose it as an Opal tool

The good news is that Opal's tool system gives you an extension point. You can build a standalone RAG service (completely outside Optimizely One) and register it as a custom tool. Once registered, Opal agents can query it just like they query Graph: by asking a natural language question and getting back relevant content.

The solution has two independent parts:

A RAG service: a pipeline that ingests your documents, indexes them for semantic and keyword search, and exposes a search API
An Opal tool wrapper: a thin integration layer that bridges Opal's tool protocol to your search API

These parts are deliberately separate. The RAG service is general-purpose infrastructure; it doesn't need to know anything about Opal. The tool wrapper is a thin adapter: it handles the Opal protocol and delegates the actual search logic to the service.

Designing the RAG service

A RAG pipeline has three stages: ingestion, indexing, and retrieval. Each has design decisions worth thinking through.

Ingestion: getting documents into a consistent format

Documents come in many formats: PDF, Word, PowerPoint, HTML, plain text. Before you can index them, you need a canonical representation. Markdown works well for this: it preserves structure (headings, lists, tables) while being straightforward to parse and chunk. Tools like Docling can convert most office formats to clean markdown automatically.

The ingestion step is usually a one-off or scheduled batch job, not part of the hot path. Think of it as a preprocessing pipeline that runs whenever your document corpus changes.

A practical setup is to store the converted markdown files in a version-controlled repository. Your indexer watches that repository and re-indexes only files that have changed since the last run. This keeps ingestion lightweight and auditable: you can always see what version of a document was indexed and when.

Indexing: chunks, embeddings, and storage

Raw documents are too large to pass directly to a language model. A single whitepaper might run to tens of thousands of tokens, and passing the whole thing for every query would flood the context window with irrelevant material. The model's attention would dilute across content that has nothing to do with the question, and your token costs would climb accordingly. Chunking solves this: you break documents into smaller passages so that retrieval can be selective, putting only what's relevant into the model's context.

The chunking strategy matters more than it might seem. Fixed-size chunking (split every N characters or tokens) is simple but often cuts across logical boundaries, splitting a paragraph mid-sentence or separating a heading from the content it introduces. A better approach is structure-aware chunking: split on heading boundaries, so each chunk corresponds to a coherent section of the document. When a section exceeds your target token budget, fall back to paragraph-level splitting within that section.

One underrated technique: prefix each chunk with its heading hierarchy. A chunk extracted from a section three levels deep might be opaque on its own. But if you prepend something like Product Overview > Technical Specifications > Authentication, the chunk becomes self-contained. An LLM reading it has immediate context about where in the document structure this information lives.

Once you have chunks, you need to embed them. Embedding models convert text into high-dimensional vectors that encode semantic meaning. Chunks with similar meaning end up close together in vector space, which is what enables semantic search. Azure OpenAI's text-embedding-3-small is a solid, cost-effective choice with good multilingual coverage.

For storage, PostgreSQL with the pgvector extension is a practical choice that avoids introducing a separate vector database. You get HNSW indexes for approximate nearest-neighbor search, full SQL query flexibility, transactions, and the operational simplicity of a database you likely already know how to run. Dedicated vector databases (Pinecone, Weaviate, Qdrant) are worth evaluating if you're indexing millions of chunks or need advanced filtering, but for most enterprise knowledge base scenarios, pgvector is sufficient.

A minimal schema has two tables: documents (path, hash, full content) and chunks (chunk text, embedding vector, full-text search index, reference to the parent document). The document hash enables incremental re-indexing: compare the current file hash against the stored hash, and only re-chunk and re-embed files that have changed.

Retrieval: why hybrid search beats pure vector search

When a user asks a question, you need to find the most relevant chunks. Two approaches are common:

Vector (semantic) search: embed the query, find chunks with the most similar embedding. This works well for conceptual questions and paraphrased queries where the exact words don't match.

Full-text search (BM25/keyword): traditional inverted index search. This works well for specific terms, product names, error codes, and other cases where the exact words matter.

Neither approach is universally superior. Vector search misses exact-match queries where vocabulary matters. Keyword search misses semantic similarity. The solution is to run both and fuse the results, a technique called Reciprocal Rank Fusion (RRF).

RRF is straightforward: rank each result set independently, then combine the rankings using a formula that rewards appearing high in either list. The parameter k (commonly 60) controls how much weight to give to rank position versus rank distance. The combined ranking tends to be more robust than either approach alone.

PostgreSQL makes this convenient. The same database serves both search paths: pgvector handles the vector similarity search, while PostgreSQL's native tsvector and tsquery handle full-text search. A CTE-based SQL query can compute both rankings and fuse them in a single round-trip.

Hybrid search combining vector similarity and full-text search via Reciprocal Rank Fusion

Wiring it to Opal

With the RAG service running, plugging it into Opal is straightforward using the Opal Tools SDK. If you haven't worked with Opal tools before, Building AI-Powered Tools with Optimizely Opal covers the SDK setup, tool registration, and authorization from scratch.

The integration needs two tools:

search: takes a natural language query, runs it against the RAG service, returns ranked chunks with their source path and content
get-document: takes a document path, returns the full markdown content of that document

The search tool is what Opal calls first to understand what's in the knowledge base. The get-document tool lets it retrieve broader context when chunks alone are not enough, for example when summarizing a full whitepaper rather than answering a specific question.

In the Opal SDK, tools are C# methods decorated with attributes:

[OpalTool(Name = "search")]
[Description("Search the knowledge base for relevant content. Returns ranked chunks with source path and content.")]
public async Task<object> Search(SearchParameters parameters)
{
    // embed the query, call the search service, return results
}

[OpalTool(Name = "get-document")]
[Description("Retrieve the full content of a document by path. Use when you need broader context than chunks provide.")]
public async Task<object> GetDocument(GetDocumentParameters parameters)
{
    // fetch full document from the service
}

The description on each tool is the signal Opal uses to decide when to invoke it. Write descriptions that explain when the tool is appropriate, not just what it does. The reasoning agent reads these descriptions when deciding which tools to call for a given query.

builder.Services.AddOpalToolService();
builder.Services.AddOpalTool<KnowledgeBaseOpalTools>();
// ...
app.MapOpalTools();

Then go to Opal configuration in Optimizely One and add your service's base URL as a custom tool source. Opal calls the discovery endpoint automatically, reads the tool definitions, and from that point the tools are available to any agent configured to use them. That's the extent of the integration.

Trade-offs and honest limitations

This approach solves the "data outside Optimizely One" problem, but it introduces new responsibilities.

You own the infrastructure. Graph is managed for you. A custom RAG service is yours to deploy, monitor, and maintain. That means a PostgreSQL instance, an Azure OpenAI account for embeddings, container hosting for the service, and a CI/CD pipeline to trigger re-indexing when documents change. The operational overhead is real.

Search quality depends on your chunking and prompting. RRF is not magic. Results are only as good as your chunks. If documents are poorly structured, chunks will be incoherent. If tool descriptions are vague, Opal won't know when to use the search tool versus answering from its general knowledge. Expect to iterate.

There's no editorial workflow. Optimizely Graph stays in sync with CMS content automatically. Your custom RAG service stays in sync only because you built an indexer that runs on document changes. If someone updates a document and forgets to push it, the knowledge base goes stale. This is solvable with automation, but it's a design responsibility you take on.

You need to secure the endpoint. Your RAG service is exposing potentially sensitive documents: compliance records, internal policies, research that hasn't been published. The Opal tool wrapper has no built-in auth beyond what you implement. At a minimum, the endpoint should require a bearer token or API key. If different user groups should see different documents, you'll also need to think about tenancy at the search layer.

Latency. Every query that hits your RAG tool requires an embedding call to Azure OpenAI, a database round-trip, and a return trip to Opal. On a warm connection, the embedding call alone takes 100–200ms; the PostgreSQL vector search adds another 20–50ms. Budget 300–500ms of added latency per tool call, and measure it against your UX expectations before committing to the architecture.

Where this can go

The architecture described here is a foundation. A few natural extensions:

Query expansion: before searching, use an LLM to rewrite the user's query into multiple variations. This improves recall for queries that are too terse or use vocabulary that differs from the documents.

Reranking: after RRF returns the top 10 chunks, pass them through a cross-encoder reranking model that scores each chunk against the query more precisely than embedding similarity allows. This can meaningfully improve precision for complex questions.

Multiple corpora: the document-agnostic design means you can run multiple instances of this service against different document sets, and expose each as a separate Opal tool with a clear description of what it covers. Opal's reasoning layer will pick the right one based on the question.

MCP for developer access: the same service can expose an MCP (Model Context Protocol) endpoint alongside the Opal tools. This gives developers in Claude Code or Claude Desktop access to the same knowledge base, without needing to duplicate any infrastructure.

Summary

Optimizely Graph is a solid RAG layer for content that lives in Optimizely One. When you have valuable knowledge that doesn't belong in the CMS (and most organizations do), you can extend Opal's reach by building a custom RAG service and registering it as an Opal tool.

The key architectural decisions are: structure-aware chunking with heading hierarchy prefixes, hybrid search combining vector similarity and full-text search via RRF, and a clean separation between the RAG service and the Opal tool wrapper. None of these decisions are unique to Optimizely; they're general RAG patterns that happen to integrate cleanly with Opal's tool SDK.

The result is an Opal agent that can reason about your internal documents just as naturally as it reasons about your CMS content, without those documents ever needing to enter the CMS.