RAG in Delphi: Ground an LLM in Your Own Documents

· Components

Quick answer: Retrieval-Augmented Generation (RAG) is two ideas glued together. First, you embed your documents once and store the resulting vectors. Then, on every question, you embed the question, retrieve the closest chunks by meaning, and hand them to the LLM as context so the answer comes from your data instead of the model's training memory. In sgcWebSockets the whole pipeline is three components: TsgcAIOpenAIEmbeddings to turn text into vectors, TsgcAIDatabaseVectorFile or TsgcAIDatabaseVectorPinecone to store and search them, and TsgcHTTP_API_OpenAI to write the final grounded answer.

A general-purpose chat model knows a lot about the world and nothing about your product manual, your support tickets, or last quarter's internal report. Ask it about those and it will either refuse or, worse, invent something plausible. RAG fixes that without retraining anything: you keep the model as-is and feed it the right passages from your own corpus at question time. Below is the full loop in Delphi, end to end, with real component names.

What RAG actually does

An embedding is a list of numbers that captures the meaning of a piece of text. Two passages about the same topic land close together in that numeric space, even when they share no keywords. A vector database stores those numbers and, given a query vector, returns the nearest entries ranked by similarity. RAG strings these together:

StageWhat happenssgcWebSockets piece
1. Ingest (once)Split documents into chunks, embed each chunk, store the vectorsTsgcAIOpenAIEmbeddings.CreateEmbeddingsFromFile
2. StoreKeep vectors in a local file or a cloud indexTsgcAIDatabaseVectorFile · TsgcAIDatabaseVectorPinecone
3. Retrieve (per question)Embed the question, find the closest chunksGetEmbeddingQueryData
4. AnswerPut the chunks in the prompt, ask the modelTsgcHTTP_API_OpenAI._CreateChatCompletion

Steps 1 and 2 run when your data changes. Steps 3 and 4 run on every user question. Let's build each one.

Step 1 — embed your documents

Create a TsgcAIOpenAIEmbeddings, give it an OpenAI key, point its Database property at a vector store, and call CreateEmbeddingsFromFile. That one call reads the file, splits it into chunks (controlled by EmbeddingsOptions.ChunkSize), embeds each chunk, and writes the vectors into the store via the BeginAddData / AddData / EndAddData sequence for you.

uses
  sgcAI, sgcAI_OpenAI_Embeddings,
  sgcAI_DB_Vector, sgcAI_DB_Vector_File, sgcAI_DB_Vector_Pinecone;

var
  Embeddings: TsgcAIOpenAIEmbeddings;
  DBFile: TsgcAIDatabaseVectorFile;
begin
  Embeddings := TsgcAIOpenAIEmbeddings.Create(nil);
  Embeddings.OpenAIOptions.ApiKey := 'sk-...';

  // local, file-based vector store
  DBFile := TsgcAIDatabaseVectorFile.Create(nil);
  DBFile.VectorFileOptions.InputFilename  := 'corpus.sgcif';
  DBFile.VectorFileOptions.VectorFilename := 'corpus.sgcvf';

  Embeddings.Database := DBFile;
  Embeddings.CreateEmbeddingsFromFile('docs.txt');
end;

That is the entire ingest step. Run it once, or whenever your documents change. The default embedding model is text-embedding-3-small; change it through EmbeddingsOptions.Model if you need a different one. There is more detail on the Embeddings component page.

Step 2 — choose where the vectors live

Both backends descend from the same base component, TsgcAIDatabaseVector, so they are interchangeable: swap one for the other and your ingest and query code does not change. The only difference is where the data sits.

For a desktop app, an offline tool, or a smaller corpus, TsgcAIDatabaseVectorFile keeps everything in a local file with no external service. When the index is large, must be shared across processes or users, or has to scale beyond one machine, switch to TsgcAIDatabaseVectorPinecone, which upserts every chunk through the managed Pinecone REST API:

var
  DBPinecone: TsgcAIDatabaseVectorPinecone;
begin
  DBPinecone := TsgcAIDatabaseVectorPinecone.Create(nil);
  DBPinecone.PineconeOptions.ApiKey         := 'pc-...';
  DBPinecone.PineconeIndexOptions.IndexName := 'sgc-embeddings';

  Embeddings.Database := DBPinecone;
  Embeddings.CreateEmbeddingsFromFile('docs.txt');
end;

Notice that the ingest line is identical to Step 1. That is the whole point of the shared base class. See the Vector Databases page for the file backend and the Pinecone page for the cloud one.

Step 3 — retrieve and answer

Now the per-question path. Embed the user's question and find the closest stored chunks in one call: GetEmbedding embeds the text and runs it through the database's QueryData, returning the most relevant passages from your corpus. Those passages are your context. Concatenate them with the question and send the whole thing to the chat model:

var
  Question, Context, Prompt, Answer: string;
  OpenAI: TsgcHTTP_API_OpenAI;
begin
  Question := 'How do I enable the WatchDog reconnect?';

  // retrieve the closest chunks from your own data
  Context := Embeddings.GetEmbedding(Question, '');

  // build a grounded prompt
  Prompt :=
    'Answer the question using only the context below.' + sLineBreak +
    'If the context does not contain the answer, say you do not know.' +
    sLineBreak + sLineBreak +
    'Context:' + sLineBreak + Context + sLineBreak + sLineBreak +
    'Question: ' + Question;

  // ask the model
  OpenAI := TsgcHTTP_API_OpenAI.Create(nil);
  OpenAI.OpenAIOptions.ApiKey := 'sk-...';
  Answer := OpenAI._CreateChatCompletion('gpt-4o-mini', Prompt);

  Memo1.Lines.Text := Answer;
end;

That is RAG. The model never saw your documents during training, yet it answers from them, because you put the relevant passages in front of it at request time. Change the corpus and the answers change with it, no fine-tuning involved. The instruction to refuse when the context is empty is what keeps the model honest instead of guessing.

Local or cloud, same code

One detail worth repeating: the choice between the file store and Pinecone is reversible. Because TsgcAIDatabaseVectorFile and TsgcAIDatabaseVectorPinecone share the TsgcAIDatabaseVector base, you can prototype against the local file (zero infrastructure, runs offline) and move to Pinecone later by swapping the component you assign to Embeddings.Database. Nothing in your ingest or query code changes. The same is true for the LLM at the end: _CreateChatCompletion on TsgcHTTP_API_OpenAI can be swapped for the Anthropic or Gemini component if you prefer a different model to write the final answer.

A note on chunking and quality

Retrieval quality depends on how your documents are split. Smaller chunks make matches more precise but can lose context; larger chunks keep context but dilute the match. EmbeddingsOptions.ChunkSize controls this for CreateEmbeddingsFromFile, so it is worth tuning to your material. For finer control you can also embed individual strings with CreateEmbeddings and shape the chunks yourself before ingesting.

Getting started

All three components ship in sgcWebSockets. Grab the free trial, drop the embeddings and vector-store components on a form, point them at a text file, and you will have a working RAG loop in well under a hundred lines. Browse the full set of AI building blocks on the AI & LLM components hub.

Questions about applying this to your own corpus? Get in touch — you will get a reply from the people who wrote the code.