BenchmarkZepComparison

Recallr vs Zep: Which Memory Layer Wins on Temporal Reasoning?

·9 min read

Two companies that took temporal reasoning seriously

Most AI memory layers treat time as an afterthought. Mem0 scores 48.9% on temporal reasoning. Supermemory scores 27.1%. Both Zep and Recallr took a different approach — building knowledge graph architectures specifically designed to understand how facts evolve over time.

Zep created Graphiti, an open-source temporal knowledge graph engine that has attracted significant attention in the developer community. Recallr built a versioned knowledge graph from the ground up, submitted a 100+ page research paper to ICML, and scored 97.5% on LongMemEval — the highest publicly verified score in the category.

This post compares the two architectures, their benchmark results, and where each system fits in practice.

Benchmark results

Zep does not publish a complete LongMemEval score. Their Graphiti architecture corresponds to the "Mem0 Graph" category in our open benchmark results, which uses the same graph-based approach Graphiti pioneered.

Task TypeRecallrMem0 Graph (Graphiti architecture)Mem0 (standard)
Single-Session User100%90%90%
Single-Session Preference100%90%90%
Knowledge Update97.4%75.6%76.9%
Temporal Reasoning97.0%50.4%48.9%
Multi-Session89.5%63.2%65.4%
Single-Session Assistant100%19.6%19.6%
Overall97.5%~65%65.4%

The graph-based architecture improves temporal reasoning slightly over standard vector approaches — 50.4% vs 48.9%. But Recallr's 97% on temporal reasoning is not a marginal improvement. It represents a fundamentally different capability.

Why graph architectures still struggle with time

Zep's Graphiti and similar graph-based approaches represent knowledge as nodes and edges, which is an improvement over flat vector search. But the standard graph model has a critical limitation for temporal reasoning: it treats the current state of the graph as ground truth.

When a fact changes, a graph system typically either: 1. Overwrites the existing node with the new value, losing history, or 2. Creates a new node and marks the old one as deprecated, but has no explicit time model for *when* the change occurred

This is why Graphiti scores 50.4% on temporal reasoning rather than 97%. The graph structure models *what* entities exist and *how* they relate, but not *when* facts were true or the causal chain that led to changes.

Recallr's approach: versioned knowledge graph

Recallr's architecture introduces two innovations on top of the standard knowledge graph model.

Dual timestamps on every entity

Every memory entity in Recallr carries two timestamps:

  • Event time: when the fact was actually true in the real world
  • Ingestion time: when the system learned about it

This distinction is what makes temporal queries answerable. When a user says "I moved to Delhi last month," the event time is last month — not today when they mentioned it. A system without event time has no way to answer "where was the user living six months ago?" correctly.

Version chains instead of overwrites

When information changes, Recallr does not update the existing node. It creates a new version node with a "supersedes" edge to the previous version. The old version is archived, not deleted.

This means: - The full history of any fact is always available - "What did the system know about this user in January?" is a valid query - The reason for a change (temporal update vs correction vs preference change) is preserved - Audit trails are complete

A standard graph — including Graphiti — does not have this property by default. Update semantics overwrite the existing state.

The conflict resolution gap

Zep handles fact contradictions through Graphiti's graph update logic — when a new fact contradicts an existing node, the graph resolves it according to recency or confidence rules.

Recallr adds a layer that no other memory provider currently offers: human-in-the-loop conflict resolution via webhooks. When Recallr detects a true contradiction that cannot be resolved automatically, it sends a multiple-choice question to the user through a webhook. The user's answer then updates the knowledge graph with explicit human confirmation.

This matters most in high-stakes contexts. In a healthcare agent, if a patient says "I've never had surgery" after previously mentioning a procedure, automatic resolution in either direction could be wrong. Recallr's webhook system surfaces the contradiction to the right person rather than making a unilateral decision.

Latency: a critical practical difference

Zep does not publish detailed latency percentiles publicly. Based on community reports, graph-based recall adds meaningful overhead compared to vector-only approaches.

Recallr ships three explicit latency modes:

ModeP50 LatencyUse case
Low-Latency~400msVoice agents, real-time interfaces
Balanced~1,200msStandard chatbots and copilots
Agentic~8sBackground agents, long-running tasks

The 400ms Low-Latency mode still achieves 97% accuracy on temporal reasoning. This is important: you do not have to choose between speed and temporal correctness. Most knowledge graph approaches cannot offer this because graph traversal is inherently more expensive than vector similarity search. Recallr achieves the low-latency mode through architectural optimizations including a custom clustering embeddings model built in-house.

Integration model

Both systems require integration work, but the approach differs.

Zep integrates at the application layer. You call Zep's memory API explicitly to add memories and retrieve context, then inject that context into your LLM calls manually. This gives you control but requires modifying your agent's core loop and maintaining the integration logic.

Recallr uses a forward proxy model. You change one line — the base URL for your LLM provider — and Recallr handles everything automatically. No explicit add/search calls. No manual context injection. Your existing agent code is unchanged.

# Before Recallr
client = OpenAI(api_key="sk-...")

# After Recallr (one line) client = OpenAI( api_key="sk-...", base_url="https://api.recallrai.com/v1", default_headers={"X-Recallr-User-ID": user_id} ) `

This is a meaningful difference for teams that want to ship fast. The proxy model also means Recallr works identically with OpenAI, Anthropic, and Gemini — no provider-specific integration work.

Enterprise and self-hosting

Zep has strong enterprise traction and offers an enterprise tier. Recallr supports full self-hosting for customers who need complete data control — useful for healthcare, legal, and financial services applications where data residency requirements prevent cloud-only deployments. This is something neither Zep nor Mem0 currently offer at the same level.

Research backing

Recallr has a 100+ page research paper submitted to ICML (Core A* publication). The benchmark runs are fully open-sourced at github.com/recallrai/benchmarks with all evaluation JSON files and reproduction scripts. Zep has published Graphiti as open source, which has helped drive adoption and community trust — a different but equally valid approach to establishing credibility.

When to choose Zep

Zep is a strong choice if:

  • You need enterprise support and an established vendor track record
  • You are already using LangChain or LangGraph and want tight ecosystem integration
  • Graphiti's open-source nature gives you confidence to audit and modify the underlying memory engine
  • Your use case is primarily cross-session recall without demanding temporal reasoning

When to choose Recallr

Choose Recallr if:

  • Temporal reasoning accuracy is a core requirement (97% vs 50.4%)
  • You are building voice agents where 400ms latency matters
  • You need human-in-the-loop conflict resolution
  • You want a zero-code-change proxy integration rather than explicit API calls
  • You need self-hosting for data control and compliance
  • You want the highest publicly verified LongMemEval score in the category

The core tradeoff

Zep and Graphiti are well-engineered, open-source, and have genuine enterprise traction. For teams that value ecosystem maturity and community transparency, they are legitimate choices.

But on the benchmark that matters — temporal reasoning — the gap is 46 percentage points. That gap is not about tuning. It is about whether the architecture models time as a first-class concept or as an afterthought. Recallr's versioned knowledge graph with dual timestamps was designed specifically to close that gap.

See the full benchmark breakdown with open-source reproduction scripts.