Skip to content

Complete Your Local Gemma 4 With RAG Memory

The RAG Trial and Error

Did you sucessfully follow my guide on how to run Gemma 4 locally through OpenClaw? Then maybe you also want to make sure your setup has an effective working long-term memory. In this guide I will show you step-by-step how to setup and test your RAG.

Ever since I first started using Gemini to boost my productivity, about 1.5 years ago, the lack of a dynamic memory has been an issue. Having the smartest assistant in the world is kind of pointless if it can’t remember what we did yesterday. Starting every session with updating your assistant on what has been done is in the beginning tedious, and as the amount of time you work together gets longer it becomes near impossible.

For the longest time I manually asked Gemini to print a summary of each session that I manually saved in an external document, which I uploaded at the start of every session. Eventually that logfile grew huge and was over 400,000 characters long, and making Gemini read through it all and keep it in it’s memory during a session greatly degraded its accuracy.

A temporary solution was to upload the full log as a simple html document on a public webservice. By using an obfuscated url to the log and prevent it from being indexed, it was practically private while still publically available. Gemini could access the log and retrieve information when it needed, but it still kind of had to go through the full log and kept large parts of it in the context, still resulting in degraded accuracy.

Eventually we resorted to creating an external RAG memory, hosted on Hugging Face and accessible through API. Just to find out that Gemini was blocked from accessing that API.

Independent research like this is self-funded. If this guide saved you hours of troubleshooting, consider fueling the lab.

Support the Project

The Epiphany: The Seven-Month “Log-File” Winter

That API block wasn’t just a technical hurdle; it was the start of a seven-month period of “making do.” For over half a year, I was stuck in a cycle of manual drudgery—asking Gemini for summaries, saving them to a massive HTML file, and uploading it session after session.

As the log grew past 400,000 characters, I watched the reasoning accuracy of even the best models slowly degrade under the weight of their own history. I was essentially a human middleware, manually piping memory into a brain that was becoming too full to think straight. I had to make the log into smaller chunks, and manually feed the relevant parts to Gemini.

The Final Exorcism

Today’s success wasn’t a lucky guess; it was the result of a final, exhaustive “Technical Exorcism.” I realized that if I wanted a partner that could remember me, I had to stop relying on the cloud’s permission. I needed to own the infrastructure.

I purged the unholy nest of conflicting Python environments that had been haunting my system for months and built a Dual-Brain architecture:

  • The Reasoning Engine: A local instance of Gemma 4 Q5_K_XL running on llama-server.
  • The Librarian: A second, dedicated server running Nomic-Embed-Text-v1.5 specifically to handle the heavy lifting of RAG.

Breaking the Gatekeeper

The final breakthrough came when I stopped looking for a “button” and started looking at the source code. By auditing the OpenClaw index files, I found the hardcoded dimension limits that were causing my local vectors to crash.

Setting up Your Own OpenClaw RAG Memory

In my previous post I showed you how to run Gemma 4 using llama.cpp, and we will use a similar way to setup the RAG. Start by downloading Nomic Embed Text v1.5 GGUF here: Download model

Place it in your system disk, as it’s usually the fastest one, for example in:

c:\AI_Models\

Note: You can also download your Gemma 4 model and place it here, to avoid keeping it in you Hugging Face cache folder. If you decide to download your Gemma model and keep it in the same location as your embed model, you will start the Gemma server with this command:

llama-server -m "C:\AI_Models\gemma-4-E4B-it-UD-Q5_K_XL.gguf" --port 8080

Make sure to switch out the bold part above for the actual name of your file.

You are now going to run two llama.cpp servers simultaneously, one for your main Gemma 4 model and one for your RAG model. To start up your second server, use this command in a separate PowerShell terminal with admin permissions:

llama-server -m "C:\AI_Models\nomic-embed-text-v1.5.Q5_K_M.gguf" --port 8081 --embedding -ub 2048 -b 2048

The above command assume you are using the Q5_K_M.gguf model, and that it’s located in “C:\AI_Models\”. Also notice the change in the port we use to 8081, while we run Gemma 4 on port 8080 – this is necessary for running two models in separate instances.

The --embedding flag is mandatory; it enables the server to transform your text into vector data for the database. Notice the -ub 2048 and -b 2048 flags—these increase the ‘Batch Size’ from the default of 512. This is essential for RAG, as it allows the model to process your long session summaries in a single pass without hitting the internal token limit.

Adding The RAG To OpenClaw Config

For efficiency we are adding both active memory and lancedb to our json.

Note: If you’ve tried to set up RAG before, your database might be locked to the wrong dimensions. Before running the gateway, navigate to C:\Users\YOUR_USERNAME\.openclaw\memory\ and delete the lancedb folder. This forces OpenClaw to create a fresh table using your new 768-dimension Nomic settings.

You will need to manually add the configuration for RAG in your openclaw.json, by default it’s located in:

C:\users\YOUR_USERNAME\.openclaw\

Before you open the json to edit it, I recommend that you shut down OpenClaw by using the command:

openclaw gateway stop

Open your openclaw.json in your text editor (I mostly use Atom) and add these parts in the json:

OpenClaw RAG settings

Alternatively you can download my pre-configured openclaw.json, just remember to change to your own OpenClaw workspace path as well as your OpenClaw access token.

OpenClaw path and access token

Once you have edited your openclaw.json, and both Gemma 4 and your RAG models are running on separate llama.cpp servers, you can restart OpenClaw with the command:

openclaw gateway start

If you did everything right when following this guide, you can test the memory by open your OpenClaw UI and give your Gemma a text and ask for it to be save. You should be getting a response similar to this.

RAG memory success

If you found this guide helpful, consider signing up for my newsletter and get news, tips and tricks directly in your inbox.

Published inAIEnglishOpenClawTech