---
name: context-cache-compress
description: >
  Use this skill to enable ADK 2.0 context caching and compression so long-
  running agent sessions don't blow the model's context window or burn
  tokens on repeated prefixes. Triggers on: "ADK context cache", "compress
  ADK history", "ADK long conversation memory", "prevent context overflow
  ADK", "ADK token budget", "summarize old turns ADK", "ADK conversation
  compaction". Generates configuration for cached prefixes and a
  compression callback that summarizes old events.
---

# context-cache-compress

Long sessions hit the model's context window. ADK 2.0 supports two mitigations: prompt caching (Gemini-side) and event compression (ADK-side).

## Prompt caching (Gemini)

Cache long static prefixes (system prompt + few-shot examples) so subsequent calls reuse them at lower cost/latency.

```python
from google.adk.agents import LlmAgent

LARGE_INSTRUCTION = open("./few_shot_examples.md").read()  # ~50KB

root_agent = LlmAgent(
    name="cached_agent",
    model="gemini-2.5-flash",
    instruction=LARGE_INSTRUCTION,
    cache_config={
        "cache_instruction": True,
        "cache_ttl_seconds": 3600,
    },
)
```

ADK creates an explicit Vertex cache resource and reuses it across invocations.

## Event compression

Summarize old events when total tokens exceed a threshold:

```python
from google.adk.callbacks import on_before_model_call

@on_before_model_call
async def compress_history(ctx, request):
    if ctx.session.token_count > 100_000:
        # Drop oldest 50% of events, replace with a summary event
        old = ctx.session.events[: len(ctx.session.events) // 2]
        summary = await summarize_events(old)
        ctx.session.events = [summary, *ctx.session.events[len(old):]]
    return request
```

## Sliding window

Cap to last N turns:

```python
@on_before_model_call
async def sliding_window(ctx, request):
    MAX_TURNS = 20
    if len(ctx.session.events) > MAX_TURNS * 2:  # user+assistant pairs
        ctx.session.events = ctx.session.events[-MAX_TURNS * 2:]
    return request
```

## Hierarchical summary

Keep recent verbatim, summarize middle, archive oldest:

```python
@on_before_model_call
async def hierarchical(ctx, request):
    events = ctx.session.events
    if len(events) > 60:
        recent = events[-20:]
        middle_summary = await summarize_events(events[-60:-20])
        archive_summary = ctx.session.state.get("archive_summary", "")
        new_archive = await summarize_events(events[:-60])
        ctx.session.state["archive_summary"] = archive_summary + "\n" + new_archive
        ctx.session.events = [middle_summary, *recent]
    return request
```

## Validation

- Token count drops after compression (`ctx.session.token_count`)
- Summary preserves key facts (test with retrieval questions about old turns)
- Cache hits visible in Vertex logs / billing
- Compression callback runs before model call, not after (avoid losing the response)

## See also

- `session-rewind-checkpoint` if you need to revert compressions
