Fundamentals

Long-Term vs Short-Term Memory in AI Agents

Devasheesh Mishra··7 min read

Two Kinds of Forgetting

AI agents forget in two distinct ways.

The first is immediate: within a single long conversation, the model loses track of earlier messages as the context window fills up. Important facts mentioned at turn 3 are effectively invisible by turn 50. This is a short-term memory problem.

The second is structural: between sessions, the model has no state at all. Everything said yesterday is gone. The user has to re-introduce themselves, re-state their preferences, re-explain their situation. This is a long-term memory problem.

Most developers encounter the second one first. But both need architectural solutions.

Short-Term Memory: The Context Window

Short-term memory in AI agents is the context window — the set of tokens currently visible to the model. It includes system instructions, recent messages, tool definitions, and any retrieved information.

Short-term memory has three defining characteristics:

Volatile. It exists only for the duration of a single API call. Nothing persists between calls unless explicitly re-sent.

Bounded. Even models with million-token context windows have hard limits. And accuracy degrades well before those limits — the “lost in the middle” phenomenon means models effectively ignore information in the middle of long contexts.

Expensive. Every token in the context window costs money on every API call. Re-sending full conversation history causes quadratic cost growth.

Slow at scale. More tokens means more inference time. A 200-token prompt returns almost instantly. A 500,000-token prompt with dozens of sessions stuffed in takes noticeably longer — and that latency hits on every single exchange. For voice agents and real-time applications, this makes large context windows impractical even when you can afford them.

Practical Short-Term Memory Strategies

  • Sliding window: Keep only the last N messages. Simple but loses context from earlier in the conversation.
  • Summarization: Periodically summarize older messages into a compressed form. Reduces token count but introduces lossy compression — details get dropped.
  • Selective inclusion: Only include messages relevant to the current query. Requires a retrieval mechanism within the session.

All three are approximations. Short-term memory management is about making the best use of a finite, expensive resource.

Long-Term Memory: Persistent State

Long-term memory is what survives between sessions. It is the system that remembers the user’s name, preferences, history, and context — even if the last conversation was three months ago.

Long-term memory has different characteristics:

Persistent. Information is stored in an external system (database, knowledge graph, vector store) and retrieved on demand. It does not depend on the context window.

Evolving. User facts change over time. The user moves to a new city, changes jobs, develops new preferences. A long-term memory system must handle these updates without destroying history.

Structured. Raw conversation text is not memory. Long-term memory requires extraction — pulling structured facts, entities, and relationships out of unstructured conversations. The quality of this extraction determines everything downstream.

Selective. Not everything said in a conversation is worth remembering. The user mentioning the weather is noise. The user mentioning a medication allergy is critical. Long-term memory systems need filtering, not just logging.

Why Both Matter

Short-term and long-term memory serve different purposes, and one cannot substitute for the other.

You cannot use short-term memory as long-term memory. Stuffing the entire conversation history into the context window is the naive solution. It works for 5 sessions. It fails at 50. The cost is quadratic, the accuracy degrades, and there is no temporal model to distinguish current facts from outdated ones.

You cannot skip short-term memory. Long-term memory handles cross-session persistence, but within a session, the agent still needs coherent access to the current conversation. If a user says “as I mentioned earlier in this conversation” and the earlier message was truncated from the context window, long-term memory will not help — the fact was never persisted because the session is still ongoing.

The right architecture uses both:

  • Short-term: Manage the context window efficiently within each session using summarization and selective inclusion.
  • Long-term: Process completed sessions into structured knowledge that persists across conversations.

The Implementation Spectrum

From simplest to most sophisticated:

Level 0: No Memory

Every session starts from scratch. The model has no history. This is the default for most LLM applications.

Level 1: Session Persistence

Conversation history is saved and re-sent on the next session. Simple but costs grow quadratically and accuracy degrades with history length.

Level 2: Vector Store

Facts are extracted and stored as vector embeddings. Semantic search retrieves relevant facts at query time. This handles basic recall but has no temporal model, no conflict resolution, and no version history. This is essentially RAG applied to conversation history.

Level 3: Knowledge Graph

Memories are organized as a graph of entities and relationships. Richer queries are possible — traversing connections, understanding relationships between facts. But most graph implementations still treat updates as overwrites without versioning.

Level 4: Versioned Memory Graph

The most advanced approach. Each fact maintains a version chain — a linked history of how it evolved over time. Dual timestamps distinguish when events occurred from when the system learned about them. Conflicts are detected and resolved explicitly rather than silently overwritten.

The Recallr Architecture

Recallr bridges short-term and long-term memory through a two-loop design:

The asynchronous curation loop (long-term): After each session completes, a background pipeline extracts structured knowledge — entities, relationships, temporal metadata — and integrates it into the versioned memory graph. Deduplication, conflict detection, and version chaining happen here. The user never waits for this processing.

The synchronous retrieval loop (short-term augmentation): When the user starts a new session, relevant long-term memories are injected into the context alongside recent conversation history and session summaries. Three retrieval strategies let developers control depth:

  • Low-Latency (median 299ms): Direct vector search with NER-based keyword filtering. No LLM calls. Built for voice agents.
  • Balanced (median 1.2s): LLM-assisted query analysis with temporal filtering and deeper search.
  • Agentic (median 7s): An autonomous agent iteratively explores the memory graph until it has high confidence it found everything relevant.

The key insight: because the curation loop does thorough structuring asynchronously, even the fast retrieval strategies achieve strong accuracy. On LongMemEval, Low-Latency scores 87.8% — higher than Mem0’s best strategy (62.6%) despite being significantly faster. Full benchmark results.

Practical Recommendations

  1. If your agent has fewer than 10 sessions per user: Simple session persistence may be sufficient. The cost is manageable and the accuracy degradation is minimal.

  2. If your agent has 10-100 sessions per user: You need a memory layer. The cost of naive history re-sending is already significant, and temporal reasoning becomes necessary.

  3. If your agent has 100+ sessions or handles sensitive domains (health, legal, education): You need versioned memory with conflict resolution. Silent overwrites and lost history are not acceptable when memory errors have real consequences.

See how Recallr integrates in two lines of code.