BenchmarkMem0Comparison

Recallr vs Mem0: Benchmark Comparison for AI Agent Memory

·10 min read

The question every AI developer asks

When you need to add long-term memory to an AI agent, two names come up immediately: Mem0 and Recallr. Mem0 has the brand recognition — YC-backed, $24M Series A, tens of thousands of GitHub stars. Recallr has the benchmark numbers. This post goes through the actual data so you can make an informed decision.

We will cover accuracy, temporal reasoning, latency, developer experience, and architecture. All benchmark numbers cited here are from the LongMemEval open benchmark, which we have fully open-sourced with reproducible evaluation scripts.

LongMemEval: the benchmark that matters

LongMemEval is the closest thing the industry has to a standard evaluation for long-term conversational memory. It tests 500 questions across six task types that represent real production workloads:

Task TypeWhat it tests
Single-Session UserRecall facts from a single conversation
Single-Session PreferenceRemember stated preferences
Knowledge UpdateTrack facts that change over time
Temporal ReasoningReason about *when* things happened
Multi-SessionPersist memory across separate conversations
Single-Session AssistantRecall what the assistant itself said

A memory system that scores well on simple recall but fails on temporal reasoning or knowledge updates will cause real problems in production — especially in healthcare, legal, or long-running support workflows where facts evolve.

The numbers

Task TypeRecallrMem0Mem0 GraphSupermemory
Single-Session User100%90%90%30%
Single-Session Preference100%90%90%20%
Knowledge Update97.4%76.9%75.6%60.3%
Temporal Reasoning97.0%48.9%50.4%27.1%
Multi-Session89.5%65.4%63.2%35.3%
Single-Session Assistant100%19.6%19.6%3.6%
Overall97.5%65.4%64.9%~30%

Recallr scores 97.5% overall. Mem0 scores 65.4%. That is a 32 percentage-point gap on the same benchmark, with open, reproducible evaluation scripts anyone can run.

The gaps are not uniform. On simple recall (single-session facts and preferences), Mem0 performs reasonably well at 90%. The failures compound on harder tasks.

Where Mem0 breaks down

Temporal reasoning: 48.9% vs 97.0%

This is the most important gap for any agent that serves users over time. Temporal reasoning means understanding *when* something was true, not just *what* was said.

A concrete example: a user tells your agent "I am in my freshman year of undergrad" in 2024, then "I am in my sophomore year" in 2025. A system without temporal reasoning either creates a duplicate memory (user is freshman AND sophomore) or silently overwrites the old fact with no understanding of why it changed.

Recallr's architecture treats every memory entity as a versioned node in a knowledge graph. Each update creates a new version with a "supersedes" edge to the previous version, along with two timestamps: event time (when the fact was true) and ingestion time (when the system learned about it). The result is a full provenance trail — the system knows what changed, when it changed, and why.

Mem0's architecture is closer to a key-value store with semantic search. It handles simple fact updates reasonably well but has no model of time. Queries like "what was the user's situation last year?" have no answer because the previous state was either overwritten or is disconnected from the current one.

Knowledge Update: 76.9% vs 97.4%

When a user corrects a previous statement, Mem0 updates the stored fact about 77% of the time. The 23% failure rate means roughly one in four corrections is silently ignored or mishandled. In healthcare or legal contexts, a 23% failure rate on fact correction is not acceptable.

Recallr's conflict resolution system classifies contradictions into four categories before resolving them: temporal update (facts changed over time), correction (user is fixing a mistake), preference change (opinions evolved), and true contradiction (needs clarification). Each type has a different resolution strategy. For ambiguous conflicts, Recallr can trigger a webhook that asks the user to clarify directly — a human-in-the-loop feature no other memory provider currently offers.

Single-Session Assistant: 19.6% vs 100%

This task type tests whether the memory system can recall what the *assistant* said in previous sessions — not just what the user said. Examples: "what restaurants did we decide on for my NYC trip last year?" or "what did you suggest I do about my shoulder pain in March?"

Mem0 scores 19.6% on this category. Recallr scores 100%. This is not a marginal difference — it is a completely different capability. Recallr stores assistant responses as first-class memory entities alongside user statements. Mem0 treats the conversation primarily from the user's perspective.

Multi-Session: 65.4% vs 89.5%

Mem0's multi-session recall drops to 65.4% — meaning it fails to retrieve relevant cross-session context about a third of the time. For any agent with regular returning users, this is the number that determines whether the experience feels personalized or amnesia-prone.

Latency comparison

Mem0 has a single recall mode with a median (p50) latency of ~800ms. There is no way to trade accuracy for speed or vice versa.

Recallr ships three modes that let you choose the right tradeoff for your use case:

ModeP50 LatencyBest for
Low-Latency~400msVoice agents, real-time chat
Balanced~1,200msStandard chatbots, copilots
Agentic~8sBackground agents, deep research

For voice agents, 800ms is often too slow for a natural conversation flow. The 400ms Low-Latency mode was built specifically for this constraint, and it still achieves 97% accuracy on temporal reasoning in that mode.

For background agents that run asynchronously, the Agentic mode provides accuracy that matches the benchmark ceiling — including the 100% score on Single-Session Assistant queries.

Developer experience

Both Mem0 and Recallr offer Python and TypeScript SDKs. The integration models differ significantly.

Mem0 requires you to explicitly call m.add() after each exchange and m.search() before each LLM call. This means modifying your agent's core loop, tracking user IDs manually, and handling the memory injection yourself.

Recallr uses a forward proxy model. Change one line — the base_url pointing to your LLM provider — and Recallr intercepts every call automatically. Memory extraction and context injection happen without additional code. Your existing agent logic is unchanged.

# Before: direct OpenAI call
client = OpenAI(api_key="...")

# After: Recallr proxy (one line changed) client = OpenAI( api_key="...", base_url="https://api.recallrai.com/v1", default_headers={"X-Recallr-User-ID": user_id} ) `

The SDK design also differs. Mem0's SDK uses a generic search() function with untyped keyword arguments — developers have no IDE guidance on what parameters are available. Recallr's SDK exposes fully typed functions with named parameters, default values, and documented return types. Both SDKs were designed differently: Mem0 used Stainless for SDK generation; Recallr's SDKs were hand-written.

Architecture differences

Mem0 uses a hybrid architecture: vector embeddings for semantic search, optional graph layer, and key-value storage for facts. It works well as a semantic fact store but does not model time or entity relationships deeply.

Recallr uses a versioned knowledge graph as its core data structure. Every memory entity is a node. Relationships between entities are edges. Updates create new versioned nodes rather than overwriting existing ones. Temporal metadata is first-class. The graph can be queried at any point in time — "what did the system know about this user in January?" is a valid, answerable query.

This architectural difference is why temporal reasoning scores are so far apart. It is not a tuning gap — it is a structural one.

When to choose Mem0

Mem0 is a reasonable choice if:

  • You need the widest ecosystem integrations and community support
  • Your use case is primarily simple fact retrieval (preferences, basic profile data)
  • You are prototyping and want to move fast with minimal setup
  • Your users have short session histories and minimal context evolution over time

When to choose Recallr

Choose Recallr if:

  • Temporal reasoning matters — users' situations change and your agent needs to track that
  • You are building for healthcare, legal, education, or sales where memory accuracy has real consequences
  • You need low-latency recall for voice agents
  • You want a drop-in proxy integration rather than modifying your agent loop
  • You need human-in-the-loop conflict resolution for contradictory facts
  • Your users will have long histories (months, years) with evolving context

The bottom line

Mem0 wins on ecosystem and community adoption. Recallr wins on every accuracy metric that matters for production-grade agents. The 32 percentage-point gap on LongMemEval is not a marginal difference — it represents failures that show up as user-facing bugs: agents that forget important context, mishandle corrections, or cannot answer questions about the past.

If you are building a prototype or a simple assistant where occasional memory failures are acceptable, Mem0 is easier to get started with. If you are building production agents where memory quality directly affects user trust, the benchmark data points clearly to Recallr.

View the full benchmark results with reproducible scripts.