Artificial Intelligence Blogs Posts
cancel
Showing results for 
Search instead for 
Did you mean: 
david_diaz
Explorer
418

Introduction

When building AI agents your first priority is to make this bad boy work. Get the RAG pipeline properly aligned, check retrieval correctness and check the overall answer usefulness. The task is not an easy one when you work with clients that have thousands or millions of documents that must be ingested properly.

Eventually you reach the point where you deploy the agents in production. You enjoy the good engineering results until the next BTP invoice shows up. You check the token usage and realize you need to do something about it because, well, it is a lot.

So you check your traces, and discover that while there is a ton of interaction with the agent, a big portion of those calls are variations of the same questions. Different users, slightly different wording, but the same answer every time. "How do I reset my password", "password reset help", "I forgot my password." Three different prompts, three LLM calls, three times you pay. Same answer.

That's where I found myself after deploying LangGraph agents on SAP BTP with HANA Cloud. The pipeline worked well. The retrieval was solid (I'd already built langchain-hana-retriever for hybrid search). The agent state was persisted (using langgraph-checkpoint-hana). But the token bill kept growing with every new user asking the same things in different words.

So I decided to also built a semantic cache for HANA Cloud, that will enhance the langchain/langraph solution that I'm using with my SAP clients and their rag Pipelines.

The idea

Pretty straightforward: before calling the LLM, embed the user's prompt and check HANA for a stored response to a similar prompt. If the similarity is high enough, return the cached answer and skip the API call entirely. If there's no match, call the LLM normally and cache the result for next time.

HANA Cloud already has COSINE_SIMILARITY for vector search. It's the same function that powers RAG retrieval. Here I'm just pointing it at prompts instead of documents.

  1. User sends a prompt
  2. Embed the prompt into a vector
  3. Search HANA for cached entries using COSINE_SIMILARITY
  4. If there's a match above the threshold, return the cached response. No LLM call.
  5. If no match, call the LLM, store the prompt embedding + response in HANA, return the response

langchain-hana-cache-inphografic.jpg

Installation and setup

as usual pretty straight forward

pip install langchain-hana-cache

The package implements LangChain's BaseCache interface. You set it as the global cache and every LLM call in your application checks HANA first. You don't need to change anything else in your code:

import hdbcli.dbapi
from langchain_hana_cache import HANASemanticLLMCache
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.globals import set_llm_cache

connection = hdbcli.dbapi.connect(
    address="your-host.hanacloud.ondemand.com",
    port=443,
    user="DBADMIN",
    password="your-password",
    encrypt=True,
)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

cache = HANASemanticLLMCache(
    connection=connection,
    embedding=embeddings,
    table_name="LLM_CACHE",
    similarity_threshold=0.95,
)

set_llm_cache(cache)

llm = ChatOpenAI(model="gpt-4o")

# First call: hits the API, caches the result
response1 = llm.invoke("What are the reporting requirements for article 12?")

# Second call with similar wording: cache hit, no API call
response2 = llm.invoke("Tell me about article 12 reporting requirements")

That second call comes back instantly. Same answer, zero tokens consumed.

What I actually saw when I tested it

I ran three consecutive calls against my HANA Cloud instance to see what happens. First a new question, then the same question but rephrased, then something completely different:

     ======================================================================
       HANASemanticLLMCache Demo
     ======================================================================

     [1] Connecting to SAP HANA Cloud...
         ✓ Connected

     [2] Creating semantic cache (table: LLM_CACHE_DEMO_85DFB4)...
         ✓ Cache ready (vector dim auto-detected)

     [3] First LLM call (cache MISS expected):
         Prompt: "What are the three laws of thermodynamics? Explain briefly."
         Response: The three laws of thermodynamics are fundamental principles...
         ⏱  Time: 9.46s

     [4] Second LLM call (cache HIT expected, similar prompt):
         Prompt: "Explain the 3 laws of thermodynamics in a short summary"
         Response: The three laws of thermodynamics are fundamental principles...
         ⏱  Time: 0.41s

     [5] Third LLM call (cache MISS expected, different topic):
         Prompt: "What is the recipe for a classic French omelette?"
         Response: A classic French omelette is simple yet elegant...
         ⏱  Time: 8.16s

     ======================================================================
       Results Summary
     ======================================================================
       Call 1 (MISS):  9.46s  LLM called, response cached
       Call 2 (HIT):   0.41s  served from HANA cache!
       Call 3 (MISS):  8.16s  different topic, LLM called

       Speedup on cache hit: 23.0x faster (96% latency reduction)
     ======================================================================

9.46 seconds down to 0.41. That's 23x faster. And the omelette question correctly missed the cache because it has nothing to do with thermodynamics. The thing actually works.

Tuning the similarity threshold

This is the one knob you need to care about. The similarity_threshold controls how close two prompts need to be for a cache hit:

  • 0.98: very strict. Only nearly identical prompts match. Safe, but you won't save much.
  • 0.95: what I use. Catches rephrasings without being too loose.
  • 0.90: aggressive. You save more, but you risk serving a wrong answer when two different questions happen to look similar to the model.

I started with 0.95 and haven't needed to touch it. If your agent handles very repetitive questions (think support desk), you could try going lower and see what happens.

Dealing with stale answers

Once this is running in production you'll think about the next problem: what if the source documents change? If someone updates a policy and a user asks about it, the cache might still serve the old answer. TTL (time-to-live) takes care of that:

cache = HANASemanticLLMCache(
    connection=connection,
    embedding=embeddings,
    table_name="LLM_CACHE",
    similarity_threshold=0.95,
    ttl_seconds=86400,  # entries expire after 24 hours
)

# Manual cleanup when you need it
cache.evict_expired()
cache.evict_lru(max_entries=1000)
cache.clear()

What the cache table looks like

Under the hood, the package creates a single table in HANA with:

  • PROMPT_HASH: SHA256 of the exact prompt to prevent storing duplicates
  • PROMPT_TEXT: the original prompt text
  • PROMPT_EMBEDDING: the vector for semantic matching (REAL_VECTOR column)
  • LLM_STRING: which model generated the response. GPT-4o cache won't serve GPT-3.5 queries.
  • RESPONSE: the serialized LLM response
  • CREATED_AT, LAST_ACCESSED, HIT_COUNT: for TTL and eviction

The vector dimension is auto-detected from whatever embedding model you pass in. You don't have to think about it.

When all this makes sense?

Well, this is not something you add on day one to a rag pipeline. As I told at the beginnig of this article, you add it after your agent is working and you're looking at how to make it sustainable, you have to also check whether you are making use the llm cache capabilities, taking care of cache invalidation, the agent making good use of the retrieval tools, etc... In my experience it pays off when:

  • Multiple users are hitting the agent with variations of the same questions
  • You're on an expensive model and the monthly bill reflects it
  • The answers stay valid for hours or days (policies, regulations, product docs, internal knowledge bases)

It makes less sense when every query is truly unique, or when your data changes so fast that the cache is always stale.

Also one thing to keep in mind: every cache lookup needs an embedding call to compare the incoming prompt against stored ones. If your LLM is super cheap, the embedding might actually cost more than just calling the model.

What's next

This is v0.1.0, focused on LLM response caching. I'm working on two more pieces:

  • Retriever cache: cache the retrieved document chunks so similar queries skip the embedding + vector search step entirely
  • Pipeline cache: cache the full pipeline result, retrieval + LLM together, for maximum savings on repeated queries

Each one targets a different part of the cost. Together they should cover most of the optimization you'd want after getting a RAG pipeline into production.

Conclusion

Building the agent is the hard part. Making it affordable is the next challenge, and it's a much more tractable one. This cache lives in the same HANA instance as your business data. No extra infrastructure, just one more table. I saw 23x speedups on cache hits in my tests and the token savings add up fast once you have real users hitting the agent every day.

Give it a try and let me know how it goes.