Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add exact match caching #1717

Merged
merged 26 commits into from
Dec 15, 2024

Conversation

ayulockin
Copy link
Contributor

@ayulockin ayulockin commented Nov 28, 2024

This PR introduces a robust caching abstraction to improve performance and prevent progress loss during API calls (e.g., LLM or embedding requests). This was brought up in #1522 and later added as an enhancement request in #1602.

Key Features:

  • Flexible Backend Support: Added CacheInterface with DiskCacheBackend (using diskcache) and InMemoryCacheBackend (not tested yet).
  • Deterministic Cache Key Generation: So far, I was successful in generating deterministic and unique cache keys.
  • Decorator-Based Integration: Simple @cacher decorator to seamlessly apply caching to both synchronous and asynchronous functions. This can act as a singular interface. Additionally, I can create a CacheMixin class from which other classes can inherit. But personally liked the elegance of a decorator.

Questions:

  1. What's would be a good assumption for the size of dataset used for evaluation/number of LLM calls made? Asking this so that I can plan benchmarking accordingly. cc: @jjmachan

  2. The implemented _generate_cache_key uses the args and kwargs to find attributes that should be deterministic. For now I am excluding keys that have memory address. This can be done more gracefully maybe. Or we can leverage the __repr__ or __hash__ written for PydanticPrompt but not sure if this is the consensus for writing most core components in the lib.

Few obvious TODOs:

  • proper abstractions
  • test for dataset generation
  • test in memory caching
  • write tests
  • benchmark for latency
  • docs

@jjmachan
Copy link
Member

Great to see this draft 🙂

This should give you an average distribution for last month @ayulockin
image

this ignores rows which had the value 1 (assume single evaluations)

@ayulockin
Copy link
Contributor Author

Thanks @jjmachan, I will stress test on an evaluation set of 300 rows (on the top of my mind). I think it would be sufficient amount give the percentile 95 is 120 and average is 40 rows. Will benchmark a few methods and let you know. Super useful data points.

@jjmachan
Copy link
Member

hey @ayulockin I was thinking, instead of benchmarking evaluation, we have to evaluate prompt inputs here right - that will be more accurate benchmark. We can synthically create a few of these too

but a few questions

  1. what aspects are we benchmarking/measuring?
  2. what different situations should we consider (I think mostly it will be the different key_length but is there any other factors?)

@ayulockin
Copy link
Contributor Author

instead of benchmarking evaluation, we have to evaluate prompt inputs here right

Caching should add some latency. I want to benchmark evaluation/testset generation (as a case study) with caching enabled and without it. The difference should not be noticeable imo. Running this is easy.

I am not sure what you exactly mean by "evaluate prompt inputs here". My guess is on unique key generation given a prompt input and that it is deterministic (we get the same key when we give the same prompt input) and the lookup speed/insertion speed, etc. I think with evaluation benchmark I meant the same thing.

So to answer your question on "what aspects are we benchmarking/measuring?" > It's latency. Also because we are saving the raw output as the value to the key, need to be careful about the disk memory usage too imo (but should not be such a big deal considering the percentile usage data you shared -- one will never run LLM based eval on 100k samples 😅).

what different situations should we consider (I think mostly it will be the different key_length but is there any other factors?)

I am generating key like this:

def _generate_cache_key(func, args, kwargs):

it's quite generic but want to check it with the embeddings as well (embedding models are very cheap so up to you if you want this to be cached too.)

TLDR;

Aspect Metric How to Measure
Cache Lookup Speed Lookup latency (ms) Measure time to fetch from cache for keys of varying lengths.
Cache Insertion Speed Insertion latency (ms) Measure time to insert new cache entries.
End-to-End Latency Total function time (ms) Compare cached vs. uncached function calls.
Cache Hit Rate Hit rate (%) Track cache hits vs. misses during repeated evaluations.
Memory/Disk Overhead Cache size (MB/GB) Monitor cache size with different inputs and backends.
Key Length Impact Latency by key length Test small vs. large keys for hashing and lookup time.
Concurrency Latency under load (ms) Simulate multiple threads/processes using the cache concurrently.
Backend Comparison Latency for disk vs. memory Compare diskcache and in-memory caching for similar workloads.

@ayulockin
Copy link
Contributor Author

ayulockin commented Nov 30, 2024

Ran evaluation with and without caching on the rungalileo/ragbench "tech" subset with 314 samples in the test set.

Without caching: 6 minutes 32.21 seconds
image

With caching: 5 minutes 46.31 seconds

image

Hence diskcache is practically having no overhead because it gets overlapped with the openai's latency.

Running the same eval that was cached took only 3.27 seconds.

image

The resulting .cache/cache.db is only taking 2 MB for this example. I don't think key_length is a concern here.

image

Thoughts:

  • diskcache is good enough. anything else would be over engineering at this point.
  • caching should be parameterised i.e, the user should easily be able to turn it on or off given the non-deterministic nature of evals -- one would like to run without cache.

@jjmachan
Copy link
Member

(but should not be such a big deal considering the percentile usage data you shared -- one will never run LLM based eval on 100k samples 😅).

this actually doesn't hold for testset generation, especially since there are a lot of transforms involved which process a lot of documents

I am not sure what you exactly mean by "evaluate prompt inputs here". My guess is on unique key generation given a prompt input and that it is deterministic (we get the same key when we give the same prompt input) and the lookup speed/insertion speed, etc. I think with evaluation benchmark I meant the same thing.

one are that might perform different is testset generation which will behave differently when compared to things that are cached when evaluating. This is why I though we would consider a different benchmark but that is an overkill. Instead what we could do is to benchmark how caching affects the testset generation module too - what do you think?

  • diskcache is good enough. anything else would be over engineering at this point.

  • caching should be parameterised i.e, the user should easily be able to turn it on or off given the non-deterministic nature of evals -- one would like to run without cache.

  • should we consider the trade-offs for in memory? I don't think we have too because even though disk caching is way slower than in-memory it is cheaper and can store more and given the biggest bottleneck is LLMs it should be a great fit
  • yep, lets do that. how many aspects of this should we configure. 2 things I can think off are
  1. on/off (if there are any caching backend specific args, that will come here too)
  2. how to which backend to use

how users will use these config is the question

@ayulockin
Copy link
Contributor Author

Instead what we could do is to benchmark how caching affects the testset generation module too - what do you think?

You are right. Test set generation is where I am right now.

should we consider the trade-offs for in memory? I don't think we have too because even though disk caching is way slower than in-memory it is cheaper and can store more and given the biggest bottleneck is LLMs it should be a great fit

In memory should be an option for the users -- maybe test generation benefits from it (just started investigating). The default should be diskcache.

on/off (if there are any caching backend specific args, that will come here too)

I think I have a way to do it cleanly. Will let you know.

@ayulockin
Copy link
Contributor Author

The caching works fine with test generation.

But I faced a few issues from my dev branch while running test generation so, did a fresh install in a new environment by cloning directly from the main branch (no fork) to make sure if the issues persist.

I have documented them here #1718. I am either doing something stupid or the issues are real.

@jjmachan
Copy link
Member

jjmachan commented Dec 2, 2024

thanks a lot for the update @ayulockin 🙂

should we aim to just work on diskcache for now and then open an issue to get more feedback on this

@ayulockin ayulockin marked this pull request as ready for review December 7, 2024 11:05
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Dec 7, 2024
@ayulockin
Copy link
Contributor Author

@jjmachan the PR is ready for review.

@ayulockin
Copy link
Contributor Author

@jjmachan moved caching to the async def generate method of BaseRagasLLM as per our discussion.

Comment on lines 85 to 88
@cacher()
async def generate(
self,
prompt: PromptValue,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this approach

Copy link
Member

@jjmachan jjmachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayulockin there are a couple of things here

because cacher is a decorator users don't have control over this other than the environment variable, so this has the same problem as this being in PydanticPrompt. In my mind there where 2 usecases

  1. user has to enable/disable caching
  2. user has to be able to change the caching service

I'll write done how I was thinking about this

1. enable caching

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

gpt4o = ChatOpenAI()
evaluator_llm = LangchainLLMWrapper(gpt4o, caching=True)

# now you can use the LLM as you want
...

the same applies for embeddings too - this was something missing in this PR.

customize caching

# implement a cache backend from the interface
from ragas.cache import CacheInterface

# maybe with Redis backend
class RedisCacher(CacheInterface):
	# implementation

# use with LLMWrapper itself
gpt4o = ChatOpenAI()
evaluator_llm = LangchainLLMWrapper(gpt4o, cacher=RedisCacher())

now one con I see of this approach is that there are 2 keywords we have to add so for the "ennable caching usecase" I was thinking myabe we can join them

from ragas.cache import DefaultCacher
# we can name it DiskCacheCacher too

# use with LLMWrapper itself
gpt4o = ChatOpenAI()
evaluator_llm = LangchainLLMWrapper(gpt4o, cacher=DefaultCacher())

what do you think?

@ayulockin
Copy link
Contributor Author

Hey @jjmachan, this is great. Let me quickly implement this.

what do you think?

This is clearly a cleaner approach from a user pov (env can be scary especially in a big project). My approach was to make caching a hidden thing with just the control to turn on/off and select backend but I find your idea more appealing.

@ayulockin
Copy link
Contributor Author

I went with the evaluator_llm = LangchainLLMWrapper(gpt4o, cache=DiskCacheBackend()) approach.

Copy link
Member

@jjmachan jjmachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 🥳

@jjmachan jjmachan linked an issue Dec 15, 2024 that may be closed by this pull request
@jjmachan jjmachan merged commit 9f2ba5b into explodinggradients:main Dec 15, 2024
15 checks passed
shahules786 pushed a commit to shahules786/ragas that referenced this pull request Dec 16, 2024
This PR introduces a robust caching abstraction to improve performance
and prevent progress loss during API calls (e.g., LLM or embedding
requests). This was brought up in explodinggradients#1522 and later added as an
enhancement request in explodinggradients#1602.
shahules786 pushed a commit to shahules786/ragas that referenced this pull request Dec 20, 2024
This PR introduces a robust caching abstraction to improve performance
and prevent progress loss during API calls (e.g., LLM or embedding
requests). This was brought up in explodinggradients#1522 and later added as an
enhancement request in explodinggradients#1602.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R-311] Adding exact match caching
2 participants