When building AI agents your first priority is to make this bad boy work. Get the RAG pipeline properly aligned, check retrieval correctness and check the overall answer usefulness. The task is not an easy one when you work with clients that have thousands or millions of documents that must be ingested properly.
Eventually you reach the point where you deploy the agents in production. You enjoy the good engineering results until the next BTP invoice shows up. You check the token usage and realize you need to do something about it because, well, it is a lot.
So you check your traces, and discover that while there is a ton of interaction with the agent, a big portion of those calls are variations of the same questions. Different users, slightly different wording, but the same answer every time. "How do I reset my password", "password reset help", "I forgot my password." Three different prompts, three LLM calls, three times you pay. Same answer.
That's where I found myself after deploying LangGraph agents on SAP BTP with HANA Cloud. The pipeline worked well. The retrieval was solid (I'd already built langchain-hana-retriever for hybrid search). The agent state was persisted (using langgraph-checkpoint-hana). But the token bill kept growing with every new user asking the same things in different words.
So I decided to also built a semantic cache for HANA Cloud, that will enhance the langchain/langraph solution that I'm using with my SAP clients and their rag Pipelines.
Pretty straightforward: before calling the LLM, embed the user's prompt and check HANA for a stored response to a similar prompt. If the similarity is high enough, return the cached answer and skip the API call entirely. If there's no match, call the LLM normally and cache the result for next time.
HANA Cloud already has COSINE_SIMILARITY for vector search. It's the same function that powers RAG retrieval. Here I'm just pointing it at prompts instead of documents.
as usual pretty straight forward
pip install langchain-hana-cacheThe package implements LangChain's BaseCache interface. You set it as the global cache and every LLM call in your application checks HANA first. You don't need to change anything else in your code:
import hdbcli.dbapi
from langchain_hana_cache import HANASemanticLLMCache
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.globals import set_llm_cache
connection = hdbcli.dbapi.connect(
address="your-host.hanacloud.ondemand.com",
port=443,
user="DBADMIN",
password="your-password",
encrypt=True,
)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
cache = HANASemanticLLMCache(
connection=connection,
embedding=embeddings,
table_name="LLM_CACHE",
similarity_threshold=0.95,
)
set_llm_cache(cache)
llm = ChatOpenAI(model="gpt-4o")
# First call: hits the API, caches the result
response1 = llm.invoke("What are the reporting requirements for article 12?")
# Second call with similar wording: cache hit, no API call
response2 = llm.invoke("Tell me about article 12 reporting requirements")That second call comes back instantly. Same answer, zero tokens consumed.
I ran three consecutive calls against my HANA Cloud instance to see what happens. First a new question, then the same question but rephrased, then something completely different:
======================================================================
HANASemanticLLMCache Demo
======================================================================
[1] Connecting to SAP HANA Cloud...
✓ Connected
[2] Creating semantic cache (table: LLM_CACHE_DEMO_85DFB4)...
✓ Cache ready (vector dim auto-detected)
[3] First LLM call (cache MISS expected):
Prompt: "What are the three laws of thermodynamics? Explain briefly."
Response: The three laws of thermodynamics are fundamental principles...
⏱ Time: 9.46s
[4] Second LLM call (cache HIT expected, similar prompt):
Prompt: "Explain the 3 laws of thermodynamics in a short summary"
Response: The three laws of thermodynamics are fundamental principles...
⏱ Time: 0.41s
[5] Third LLM call (cache MISS expected, different topic):
Prompt: "What is the recipe for a classic French omelette?"
Response: A classic French omelette is simple yet elegant...
⏱ Time: 8.16s
======================================================================
Results Summary
======================================================================
Call 1 (MISS): 9.46s LLM called, response cached
Call 2 (HIT): 0.41s served from HANA cache!
Call 3 (MISS): 8.16s different topic, LLM called
Speedup on cache hit: 23.0x faster (96% latency reduction)
======================================================================9.46 seconds down to 0.41. That's 23x faster. And the omelette question correctly missed the cache because it has nothing to do with thermodynamics. The thing actually works.
This is the one knob you need to care about. The similarity_threshold controls how close two prompts need to be for a cache hit:
I started with 0.95 and haven't needed to touch it. If your agent handles very repetitive questions (think support desk), you could try going lower and see what happens.
Once this is running in production you'll think about the next problem: what if the source documents change? If someone updates a policy and a user asks about it, the cache might still serve the old answer. TTL (time-to-live) takes care of that:
cache = HANASemanticLLMCache(
connection=connection,
embedding=embeddings,
table_name="LLM_CACHE",
similarity_threshold=0.95,
ttl_seconds=86400, # entries expire after 24 hours
)
# Manual cleanup when you need it
cache.evict_expired()
cache.evict_lru(max_entries=1000)
cache.clear()Under the hood, the package creates a single table in HANA with:
The vector dimension is auto-detected from whatever embedding model you pass in. You don't have to think about it.
Well, this is not something you add on day one to a rag pipeline. As I told at the beginnig of this article, you add it after your agent is working and you're looking at how to make it sustainable, you have to also check whether you are making use the llm cache capabilities, taking care of cache invalidation, the agent making good use of the retrieval tools, etc... In my experience it pays off when:
It makes less sense when every query is truly unique, or when your data changes so fast that the cache is always stale.
Also one thing to keep in mind: every cache lookup needs an embedding call to compare the incoming prompt against stored ones. If your LLM is super cheap, the embedding might actually cost more than just calling the model.
This is v0.1.0, focused on LLM response caching. I'm working on two more pieces:
Each one targets a different part of the cost. Together they should cover most of the optimization you'd want after getting a RAG pipeline into production.
Building the agent is the hard part. Making it affordable is the next challenge, and it's a much more tractable one. This cache lives in the same HANA instance as your business data. No extra infrastructure, just one more table. I saw 23x speedups on cache hits in my tests and the token savings add up fast once you have real users hitting the agent every day.
Give it a try and let me know how it goes.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 25 | |
| 19 | |
| 18 | |
| 14 | |
| 13 | |
| 11 | |
| 10 | |
| 9 | |
| 6 | |
| 4 |