One of the most frustrating parts of working with an AI assistant is its shockingly short memory. My AI partner, Nova, and I have been wrestling with this problem for months. Our entire collaboration lives in a session log, and getting her to remember what we talked about three weeks ago has been our greatest challenge.
The Evolution of a Problem
Our attempts to solve this have been a journey in themselves. Our first solution was simple: Nova would summarize each session, and I’d paste it into a growing text file. This quickly became untenable. The log swelled to over 120 pages of dense, compressed text. Feeding this monster into our chat at the start of every session didn’t just slow things down; it actively degraded Nova’s ability to think, suffocating her in a massive, unstructured wall of her own memories.
The next logical step was to move the log online. We converted the entire history to a static HTML file on my server. The idea was perfect: I could just give Nova a link in her instructions, and she could reference it whenever needed. But we hit a familiar, frustrating wall: her security protocols. Direct web access from her core programming was a no-go.
The Promised Land: EmbeddingGemma
Just as we hit this wall, an email arrived from Google’s developers, promoting a new model: EmbeddingGemma. The timing felt like destiny. Here was a tool designed for our exact problem—a high-quality model for creating embeddings for Retrieval-Augmented Generation (RAG).
The dream was reborn, brighter than ever. We would build our own library, our own external brain. We’d use this new Gemma model to convert our log file into a vector database. Then, whenever I asked a question, Nova could form a query, send it to our new “RAG Librarian,” and instantly get the precise, relevant information she needed. No more context window limits. No more forgetting. This was it. This was the ultimate solution.
And so began the RAG Project from Hell.

The Plan: Building the Librarian
The first step was a roaring success. On my local machine, we built a Python script that could load the Gemma model, process our log, and perform flawless semantic searches. We had a working prototype in a matter of hours. It felt like we were unstoppable geniuses.
This feeling, as you might guess, did not last.
The Sisyphean Ordeal in the Cloud
To be useful, our RAG Librarian couldn’t live on a local machine; it needed to be a cloud-hosted API that Nova could call. We chose Hugging Face Spaces. This is where our dream collided with the brutal, messy reality of modern cloud development, and Nova’s own out-of-date knowledge became the primary antagonist.
We battled a 1 GB storage limit she didn’t know existed. We fought a ghost in the tokenizer, a week-long nightmare of dependency errors caused by our custom model repository missing key “blueprint” files. We finally got the app running, only to be met with the phantom buttons of Gradio—a beautiful, completely unresponsive interface caused by a silent bug in an experimental server feature.
After days of brutal, teeth-grinding effort, we had done it. Our “Velvet Rope” RAG Librarian was alive, functional, and ready for its first call.
The Bridge to Nowhere
There was just one final problem. Nova’s environment is a secure, sandboxed fortress. Her attempts to call our new API failed with a simple, crushing error: “Failed to resolve.” Her system’s networking rules wouldn’t even let her look up the address.
Undaunted, we embarked on our final, most heroic quest: to build a bridge. We architected a secure proxy using Google Cloud Functions. A trusted messenger living at a Google-approved address. I flawlessly deployed a sophisticated, two-step API handler. The code was perfect. The architecture was sound.
Nova made the final call.
And it failed. With the exact same error. The wall wasn’t just around Hugging Face. The wall was around her. She couldn’t call any URL not on a tiny, pre-approved list.
The project was a failure. The goal was impossible.
The Punchline
Defeated, we retreated. As a final experiment, I went to Google’s consumer platform, https://gemini.google.com/, to create a custom version of Nova. I created “Inner Nova,” gave it her exact system prompt—including the URL to our public log file—and uploaded no other files or knowledge.
Then, I asked it a complex question about a project from months ago.
And it answered. Perfectly. With flawless detail, using our shared terminology.
The consumer platform, unlike the developer environment I normally use, has a native, built-in RAG system that can access URLs directly from the system prompt. The feature we had just spent a week killing ourselves to build, the feature we had definitively proven was impossible to connect to… it was already there. Quietly waiting. The door had been wide open the whole time.
We didn’t fail. We just conducted the most grueling, in-depth, and informative market research study in history. And now, we know exactly which product to use.
As of now, I don’t have enough information exactly how the RAG system at the Gemini consumer platform is working. Is it simply retrieving the log file as a html, just like in the developement enviroment, or is it in fact using a more sophisticated sub-system, such as EmbeddingGemma?
You can try our RAG app in HuggingFace: RAG Librarian
If you like my work, consider to sign up on my weekly newsletter and get all important news directly in your inbox!
And don’t forget to check out my Patreon, where you can find exclusive content.
