agents. architecture. memory.

The Three-Buffer Architecture for Agent Memory

Vladimir Kesler · 2025-05-26 · 8 min read

Most agent memory is an append-only log

If you are building an agent system, you have likely hit this wall: the conversation history grows with every turn, every tool call, every result. Eventually the context window fills up. The agent starts losing track of earlier decisions. Responses degrade.

The default fix is to reach for external storage. Vector databases, RAG pipelines, embedding models. But before adding infrastructure, there is a simpler question worth answering: are you using the context window well?

In most agent systems I have seen discussed and built, the answer is no. The context window is treated as an append-only log. Everything goes in, nothing comes out, until it overflows.

This post describes three composable buffers that handle agent memory within the context window. Each one solves a specific problem. Each one has clear failure modes. Together, they form an architecture that handles complex multi-step workflows without external memory infrastructure.

Buffer 1: The Summary Buffer

The highest-leverage technique for context management is summarization. Instead of keeping the full conversation history, summarize older turns into a compact representation and keep only recent turns verbatim.

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    """Estimate token count from message content length.
    Replace with tiktoken or your model's tokenizer for precision."""
    return sum(len(m["content"]) // 4 for m in messages)

def summarize_messages(messages: list[dict]) -> str:
    """Ask the model to compress a block of conversation into key points."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Summarize the key decisions, facts, and open questions "
                "from this conversation. Be concise. Preserve exact values "
                "(names, numbers, IDs) when they appear."
            )},
            *messages,
        ],
    )
    return response.choices[0].message.content

def apply_summary_buffer(
    messages: list[dict],
    token_limit: int = 16_000,
    recent_turns: int = 4,
) -> list[dict]:
    """Keep recent turns verbatim, summarize everything older."""
    if count_tokens(messages) <= token_limit:
        return messages

    system = messages[0]  # system prompt
    recent = messages[-recent_turns:]
    older = messages[1:-recent_turns]

    summary = summarize_messages(older)
    return [
        system,
        {"role": "system", "content": f"Summary of earlier conversation:\n{summary}"},
        *recent,
    ]

This is cheap, effective, and preserves what matters. The model gets full fidelity on recent context and a compressed view of everything older.

The key insight: summarization is lossy, but so is retrieval. A RAG system that returns the wrong chunks is worse than a summary that captured the essential points. At least the summary has seen the full conversation.

Where it fails: when the agent needs exact values from early turns. A customer ID mentioned in turn 2 may not survive summarization at turn 40. If your workflow depends on precise recall of early details, the summary buffer alone is not enough. That is what the scratchpad solves.

Buffer 2: The Scratchpad

Instead of relying on the model to extract important information from a wall of conversation history, maintain a structured scratchpad that tracks key state explicitly.

from pydantic import BaseModel

class Scratchpad(BaseModel):
    """Typed state that the agent carries across turns.

    The fields here are a contract between the developer and the agent:
    the developer decides what state matters, and the agent is responsible
    for keeping it current."""
    objective: str
    key_facts: list[str] = []
    decisions: list[str] = []
    current_step: str = ""
    blockers: list[str] = []

# Initialize at the start of a workflow
pad = Scratchpad(
    objective="Migrate auth from JWT to session-based",
    key_facts=["User pool: AWS Cognito", "~12k active sessions"],
    decisions=["Redis for session storage", "30-min TTL, sliding expiry"],
    current_step="Implement session middleware",
)

def scratchpad_to_message(pad: Scratchpad) -> dict:
    """Inject the scratchpad as a system message."""
    return {
        "role": "system",
        "content": f"Current task state:\n{pad.model_dump_json(indent=2)}",
    }

Inject this scratchpad into the system prompt. A 200-token scratchpad replaces thousands of tokens of raw conversation history because it contains only the structured signal the agent needs.

The scratchpad also needs to be updated as the workflow progresses. After each significant step, ask the model to produce an updated scratchpad:

import json

def update_scratchpad(
    pad: Scratchpad,
    recent_messages: list[dict],
) -> Scratchpad:
    """Ask the model to update the scratchpad based on recent activity."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "You are updating a task scratchpad. Based on the recent messages, "
                "return an updated JSON object with the same schema. "
                "Add new decisions and facts. Update current_step. "
                "Remove resolved blockers."
            )},
            {"role": "user", "content": (
                f"Current scratchpad:\n{pad.model_dump_json(indent=2)}\n\n"
                f"Recent activity:\n{json.dumps(recent_messages, indent=2)}"
            )},
        ],
        response_format={"type": "json_object"},
    )
    return Scratchpad.model_validate_json(
        response.choices[0].message.content
    )

The design trade-off: scratchpad fields are a contract between the developer and the agent. Too few fields and the agent loses important state. Too many and the scratchpad becomes a second conversation log. I recommend starting with four or five fields and adjusting based on what the agent actually forgets.

Where it fails: the scratchpad tracks what the developer anticipated needing. It does not capture unexpected information. If the conversation takes an unforeseen direction, important details may exist only in the (now summarized) conversation history. The scratchpad and the summary buffer compensate for each other's weaknesses.

Buffer 3: The Context Budget

Not all context is equally important. The context budget makes this explicit by dividing the context window into zones with different retention policies.

Tier 1: System context (always present). Identity, capabilities, rules, current scratchpad. This is your fixed-cost memory. Keep it tight.

Tier 2: Working context (recent, full fidelity). The last few turns of conversation, recent tool results, current task state. This is your working memory.

Tier 3: Archive context (compressed or dropped). Older conversation turns, previous task results, resolved sub-tasks. Fill remaining budget with these, summarized or truncated.

def build_context(
    system_prompt: dict,
    scratchpad: dict,
    messages: list[dict],
    token_budget: int = 16_000,
    recent_turns: int = 4,
) -> list[dict]:
    """Assemble context with explicit budget allocation."""

    # Tier 1: system context (always included)
    tier1 = [system_prompt, scratchpad]

    # Tier 2: working context (recent turns, full fidelity)
    tier2 = messages[-recent_turns:]

    # Calculate remaining budget for archive
    fixed_tokens = count_tokens(tier1) + count_tokens(tier2)
    archive_budget = token_budget - fixed_tokens

    # Tier 3: archive context (fill remaining budget)
    older = messages[:-recent_turns]
    archive = []
    running_total = 0
    # Work backwards from most recent older messages
    for msg in reversed(older):
        msg_tokens = count_tokens([msg])
        if running_total + msg_tokens > archive_budget:
            break
        archive.insert(0, msg)
        running_total += msg_tokens

    return [*tier1, *archive, *tier2]

The budget approach gives you explicit control. When context runs tight, you know exactly what to compress first: tier 3 goes before tier 2, and tier 1 is never touched.

Where it fails: a static budget does not adapt to the shape of the conversation. Some turns carry more information density than others. A tool call that returns a 2000-token JSON result is not the same as a two-sentence user message. More sophisticated implementations can weight messages by type, but the simple tier system is a strong starting point.

Composing the three buffers

In practice, you use all three together. The summary buffer compresses old conversation history. The scratchpad captures structured state. The context budget allocates space across categories.

Here is what this looks like in a single function:

def build_agent_context(
    system_prompt: str,
    scratchpad: Scratchpad,
    messages: list[dict],
    token_budget: int = 16_000,
    recent_turns: int = 4,
) -> list[dict]:
    """Compose all three buffers into a single context array."""

    system_msg = {"role": "system", "content": system_prompt}
    scratchpad_msg = scratchpad_to_message(scratchpad)
    recent = messages[-recent_turns:]
    older = messages[1:-recent_turns]  # skip original system msg

    # Calculate available space for archive
    fixed = [system_msg, scratchpad_msg, *recent]
    fixed_tokens = count_tokens(fixed)
    archive_budget = token_budget - fixed_tokens

    # If older messages fit in budget, keep them as-is
    older_tokens = count_tokens(older)
    if older_tokens <= archive_budget:
        archive = older
    else:
        # Summarize to fit within budget
        summary = summarize_messages(older)
        archive = [
            {
                "role": "system",
                "content": f"Summary of earlier conversation:\n{summary}",
            }
        ]

    return [system_msg, scratchpad_msg, *archive, *recent]

The agent loop ties these pieces together: after each turn, update the scratchpad from the recent messages, then rebuild context for the next call.

# The agent loop
while not done:
    context = build_agent_context(system_prompt, scratchpad, messages)
    response = call_model(context)
    messages.append(response)
    scratchpad = update_scratchpad(scratchpad, messages[-recent_turns:])

The arithmetic is straightforward: a 200-token scratchpad, a 2000-token summary, and 4 recent turns of conversation leave ample room for system instructions and tool results within a 16k-token budget. For a model with 128k tokens of context, this architecture handles workflows that span hundreds of turns without external storage.

When you need something else

Context-window memory has real limits. Four cases where these buffers are insufficient:

Multi-session persistence. If your agent needs to remember things across separate conversations (user preferences, past interactions, learned facts), you need storage outside the context window. No amount of summarization helps if the context does not exist yet.

Large knowledge bases. If your agent needs to reference more than roughly 100k tokens of domain knowledge (documentation, policies, product catalogs), you cannot fit it all in context. Retrieval becomes necessary, and vector search is a reasonable approach.

Cross-agent state. If multiple agents need to share memory, where one agent's output becomes another agent's input across different sessions, you need a shared store. This is less about vectors and more about structured state, but it is a legitimate infrastructure need.

Regulatory or audit requirements. If you need a durable, queryable record of what the agent knew and when, that is a storage problem by definition.

These are the cases where external memory infrastructure earns its complexity. If none of them apply, the three-buffer architecture is likely sufficient.

The cost of skipping this step

Teams that jump straight to vector databases without structuring their context window pay a compounding cost.

You need an embedding pipeline, which means choosing an embedding model, versioning it, and re-embedding your entire corpus when you switch models (and you will switch models). You need a chunking strategy, which sounds simple until your 800-word product docs and your 50-page compliance PDFs need completely different chunk sizes, overlap settings, and metadata schemas.

Then there is retrieval quality. Your agent starts hallucinating and you cannot tell if it is the prompt, the model, the chunk boundaries, or the retrieval ranking. Debugging a context window is reading a prompt. Debugging RAG is tracing a query through embedding, search, reranking, and assembly, with each step lossy in its own way.

The three-buffer architecture is debuggable by inspection. You can print the entire context array and read it. That property disappears the moment you add retrieval.

I recommend building the three-buffer architecture first, even if you know you will eventually need external memory. The discipline of structuring your context window does not go away when you add a vector database. It becomes more important, because now your retrieval results need to fit within a budget alongside everything else.

The playbook

Define your scratchpad schema. Start with objective, decisions, current step, and blockers. Adjust based on what the agent forgets.
Implement the summary buffer. Summarize older turns, keep recent turns verbatim.
Set your context budget. Allocate fixed space for system context and scratchpad, flexible space for working and archive context.
Wire all three into a single build_context function that runs before every LLM call.
Log your token usage per tier. This tells you where the budget is going and when compression needs to get more aggressive.
Add external memory only when you hit one of the four triggers: multi-session persistence, large knowledge bases, cross-agent state, or audit requirements.

The context window is a capable memory system. The three-buffer architecture is a way to use it deliberately instead of treating it as a log that eventually overflows.

Transparency note: this post was AI-assisted. The architecture, opinions, and technical recommendations are Vladimir's. The drafting process used an AI writing tool.

journal