AI-Agents 04
Architectural Persistence: Efficient Management of Short-term and Long-term Memory in Long-lived Agentic Systems

Michael Schöffel
February 19, 202620 min. read
Content
- 1. Summary
- 2. Introduction
- 3. The Matter of Memory: Functional Differentiation of STM and LTM
- 4. The Physical Memory Hierarchy and the GPU Bottleneck
- 5. Context Management in STM: Strategies Against Information Loss
- 6. Virtual Context Management: The Operating System Paradigm
- 7. Long-Term Memory (LTM): Structured Persistence Architectures
- 8. Temporal Intelligence: Bi-Temporal Graphs and State Versioning
- 9. The Mathematics of Forgetting and Remembering: Scoring and Retrieval Logic
- 10. Quantifying Memory: Benchmarking Agentic Memory Systems
- 11. Research and New Horizons: E-mem and Episodic Context Reconstruction
- 12. Practical Example: The HMLR System (Hierarchical Memory Lookup & Routing)
- 13. The Dark Side of Persistence: Privacy and the "Right to Be Forgotten"
- 14. Curated Memetics: Human-in-the-Loop in Memory Management
- 15. Case Studies: XBOW, PentestGPT, and CAI in Practice
- 16. The Economics of Memory: Cost-Benefit Analysis of Background Curation
- 17. Conclusion: The Strategic Importance of Memory Architecture
- Further posts
1. Summary
Long-lived AI agents require a carefully designed memory architecture that goes far beyond simply storing conversation histories. Short-term memory (STM) is realized through the physical context window of the LLM and the KV cache on the GPU - a fast but limited and volatile resource. Long-term memory (LTM) is divided into episodic, semantic, and procedural memory, persisted in external vector databases and graph databases. Even with context windows of several million tokens, dedicated memory management remains necessary, as increasing token volumes degrade instruction-following quality and linearly increase costs.
The physical memory hierarchy of modern AI infrastructure spans from GPU HBM (G1) through CPU RAM (G2) to flash-based ICMS layers (G3.5) and disaggregated network storage (G4). Three strategies dominate STM management: sliding window with truncation for simple scenarios, recursive summarization for the hierarchical compression of long histories, and virtual context management following the MemGPT paradigm, in which the LLM manages its own context like an operating system kernel via paging, eviction, and interrupt mechanisms.
For LTM, vector databases (Pinecone, FAISS) are available for semantic similarity searches and knowledge graphs (Neo4j, Neptune) for structural multi-hop reasoning. Knowledge graphs enable queries such as "all servers in the same subnet as a compromised system" that cannot be resolved with pure vectors. Bi-temporal database models (Graphiti) track both valid time and transaction time of each piece of information, enabling point-in-time queries and the detection of outdated facts. A Truth Maintenance System (TMS) cascadingly invalidates all dependent inferences as soon as a base fact is refuted.
Memory prioritization is handled via an RFM formula combining recency (exponential decay), frequency (retrieval count), and importance (assigned importance score). The "lost-in-the-middle" phenomenon shows that LLMs tend to ignore information in the middle of long contexts - critical facts must therefore be placed at the edges of the prompt. Concrete systems such as XBOW use a persistent coordinator with a central knowledge graph and short-lived solver agents that receive only minimal context snapshots. PentestGPT structures the entire memory as a Pentesting Task Tree (PTT). GitHub Copilot validates memories against the current codebase and automatically discards outdated citations.
More recent research approaches such as E-mem avoid the destructive de-contextualization of classical RAG systems through uncompressed episode segments and specialized assistant agents that derive local evidence from complete contexts - while reducing token costs by over 70%. The HMLR system combines short-term bridge blocks with a gardener process that promotes important facts into persistent dossiers, along with a governor agent and context hydrator for runtime-optimized prompt assembly. From a regulatory perspective, long-lived memory systems require TTL-based expiry control, ownership tags, and cross-context isolation to ensure GDPR compliance and cross-tenant data separation. Human-in-the-loop interfaces with memory editing, pinning, and audit logs finally enable traceable correction of erroneous decisions - a prerequisite for explainable memory in production agentic systems.

2. Introduction
The transformation of artificial intelligence from stateless language models toward autonomous, long-lived agents represents one of the most significant challenges in contemporary computer science. While early LLM implementations primarily acted as transient inference engines that treated each request in isolation, modern agentic systems such as XBOW, PentestGPT, or CAI demand a form of continuity that extends beyond individual sessions [1]. An agent's memory is far more than a simple data store; it is the structural foundation for reasoning, strategic planning, and the ability to learn from past interactions [2]. An agent without effective memory management is not viable in a productive enterprise environment, as session-based contexts inevitably lead to repetitive mistakes, inconsistent outputs, and an inability to intelligently orchestrate complex, multi-step workflows [3].
This fourth analysis in the series on long-lived AI agents focuses on the technological realization of memory architectures. Building on the cognitive archetypes and the architectural spectrum of agentic systems discussed previously, this article examines how the physical constraints of context windows can be overcome through innovative concepts drawn from operating system theory and graph research.
With the emergence of models like Gemini 1.5 Pro supporting context windows of up to 2 million tokens, a radical question arises: is complex management of LTM and STM even necessary anymore? The answer is a clear yes, grounded in token economics and reasoning density [4].
While a model can theoretically hold entire libraries in context, precise instruction following degrades as the context fills [5]. Moreover, both cost and latency increase linearly with context length. A dedicated memory architecture acts as a "cognitive filter" here. Rather than flooding the model with raw data, the LTM delivers only the distilled essence. We are moving away from "brute-force context" toward hierarchical contextualization, where the enormous window serves only as a temporary cache while durable knowledge resides in structured graphs and vectors.
At its core, the challenge is overcoming the "context amnesia" problem, in which agents lose track of the causal relationships between their actions despite enormous token limits [1]. The systems examined here must be capable of not only storing information across days, weeks, or months, but organizing it in a way that enables precise and cost-efficient retrieval [6].
3. The Matter of Memory: Functional Differentiation of STM and LTM
In the architecture of agentic systems, the distinction between short-term memory (STM) and long-term memory (LTM) is often described through the analogy of the human brain or classical computer architectures. This separation is not merely conceptual; it defines the mode of data processing and the choice of storage technologies [3]. While STM is responsible for immediate action control, LTM serves as a persistent archive of knowledge and experience [3].
3.1 Short-Term Memory (STM) and Working Memory
An agent's short-term memory is functionally equivalent to a computer's RAM. It encompasses all information immediately accessible to the model during an active reasoning loop. Technically, this is realized through the LLM's context window, with data held in the Key-Value (KV) cache of the GPU [6]. This "working memory" is extremely fast, but also extremely expensive and physically limited in capacity [6]. A key characteristic of STM is its volatility: once an inference is complete or the context limit is exceeded, information is lost unless explicitly transferred to LTM [7]. In long-lived systems, STM must therefore be managed as a dynamic buffer containing only the instructions, observations, and thoughts relevant to the current step [3].
3.2 Long-Term Memory (LTM): Episodic, Semantic, and Procedural
Long-term memory, by contrast, is the layer of persistence. It enables agents to retain information across different sessions and tasks [3]. Research distinguishes three essential types:
| Memory Type | Description | Example in Agentic Systems |
|---|---|---|
| Episodic Memory | Stores specific events, interactions, and the history of past actions. | Record of a specific pentesting scan from the previous day [3]. |
| Semantic Memory | Contains generalized knowledge, facts, and stable rules about the world or a domain. | Knowledge about how the HTTP protocol works or corporate policies [8]. |
| Procedural Memory | Encodes "how-to" knowledge, workflows, and proven strategies (skills). | Scripts for standard exploits or workflow patterns in LangChain [8]. |
LTM is typically realized through external databases, with vector databases dominating for semantic similarity searches and graph databases for structural relations [9]. A long-lived agent must be capable of continuously "evicting" information from STM into LTM and "paging" relevant memories back into STM on demand [10].
1# Example: Structuring the three LTM types in a simple
2# Python data structure as an agent might use internally.
3import time
4from dataclasses import dataclass, field
5from typing import Literal
6
7MemoryType = Literal["episodic", "semantic", "procedural"]
8
9@dataclass
10class MemoryEntry:
11 content: str
12 memory_type: MemoryType
13 timestamp: float = field(default_factory=time.time)
14 importance: float = 1.0 # normalized 0.0 - 1.0
15
16class LongTermMemory:
17 def __init__(self):
18 self._store: list[MemoryEntry] = []
19
20 def store(self, content: str, memory_type: MemoryType, importance: float = 1.0):
21 entry = MemoryEntry(content=content, memory_type=memory_type, importance=importance)
22 self._store.append(entry)
23 print(f"[LTM] Stored ({memory_type}): {content!r}")
24
25 def retrieve_by_type(self, memory_type: MemoryType) -> list[MemoryEntry]:
26 return [e for e in self._store if e.memory_type == memory_type]
27
28# --- Usage ---
29ltm = LongTermMemory()
30
31# Episodic: specific events
32ltm.store("Nmap scan on 10.0.1.5 found port 80 open (2026-02-18 09:00)", "episodic", importance=0.9)
33
34# Semantic: general domain knowledge
35ltm.store("HTTP uses port 80 by default; HTTPS port 443", "semantic", importance=0.7)
36
37# Procedural: proven workflows / skills
38ltm.store("Standard recon workflow: nmap -sV -> gobuster -> nikto", "procedural", importance=0.8)
39
40print("\nEpisodic memories:", ltm.retrieve_by_type("episodic"))4. The Physical Memory Hierarchy and the GPU Bottleneck
The shift to agentic AI moves the bottleneck of modern infrastructure from raw compute toward memory capacity and bandwidth [6]. Since agentic workflows often accumulate millions of tokens across many interaction steps, conventional memory hierarchies break down. The KV cache, which stores attention scores between tokens, grows linearly with sequence length and occupies valuable space in the High-Bandwidth Memory (HBM) of GPUs [6].
To address this bottleneck, new memory classes and architectures are emerging:
- G1 to G4 Tiers: The hierarchy spans from ultra-fast HBM (G1) through CPU RAM (G2) to local SSDs (G3) and network storage (G4) [6].
- Inference Context Memory Storage (ICMS): A new "G3.5" tier proposed by vendors such as NVIDIA. It consists of Ethernet-attached flash storage optimized specifically for streaming KV caches [6].
- Disaggregated Memory: Instead of binding memory tightly to individual GPUs, memory pools are shared over high-speed interconnects such as CXL (Compute Express Link) [6]. This eliminates redundancy and allows an agent to quickly migrate context data between compute nodes.
This technological foundation is critical for systems like XBOW, which run thousands of focused agents in parallel, with a central coordinator managing the global context and distributing specific knowledge fragments to short-lived "solvers" [11].
5. Context Management in STM: Strategies Against Information Loss
An LLM's context window is analogous to the documents laid out on a desk: you can only work with what is directly in view [12]. Once the desk is full, old documents must be cleared away or summarized. In practice, three primary patterns for this management have emerged in agentic systems.
5.1 Sliding Window and Truncation
The simplest method is the sliding window. The oldest tokens are automatically discarded once the limit is reached [17]. While technically trivial, this often leads to the catastrophic loss of system instructions or initial goals in long-lived tasks. A refined variant is "observation masking", in which unimportant details are replaced with placeholders to preserve the structural integrity of the interaction history while saving tokens [13]. Studies show that this masking is often more efficient and cost-effective than complex summarization, as it improves precision on long-horizon tasks [13].
1# Example: Sliding-window context with a fixed token budget
2# The system prompt is always protected; oldest messages
3# are dropped first when the limit is exceeded.
4
5def estimate_tokens(text: str) -> int:
6 """Rough estimate: ~4 characters ≈ 1 token (GPT rule of thumb)."""
7 return max(1, len(text) // 4)
8
9def sliding_window_context(
10 system_prompt: str,
11 history: list[dict],
12 max_tokens: int = 4096,
13) -> list[dict]:
14 """
15 Returns a trimmed context window.
16 - system_prompt is always protected.
17 - Oldest messages are removed until the budget fits.
18 """
19 system_tokens = estimate_tokens(system_prompt)
20 budget = max_tokens - system_tokens
21
22 # Build from the back (newest first)
23 trimmed: list[dict] = []
24 used = 0
25 for msg in reversed(history):
26 tokens = estimate_tokens(msg["content"])
27 if used + tokens > budget:
28 break # Budget exhausted - drop older messages
29 trimmed.insert(0, msg)
30 used += tokens
31
32 return [{"role": "system", "content": system_prompt}] + trimmed
33
34# --- Demo ---
35system = "You are a pentest agent. Target: 10.0.1.0/24."
36msgs = [
37 {"role": "user", "content": "Start an Nmap scan."},
38 {"role": "assistant", "content": "Nmap running... Port 80 open on 10.0.1.5."},
39 {"role": "user", "content": "Check for known CVEs for Apache."},
40 {"role": "assistant", "content": "CVE-2024-XXXX found. Exploit available."},
41 {"role": "user", "content": "Execute the exploit."},
42]
43
44context = sliding_window_context(system, msgs, max_tokens=100)
45for m in context:
46 print(f"[{m['role']}] {m['content'][:60]}")5.2 Recursive Summarization and Semantic Compression
In recursive summarization, a portion of the context is condensed by a model whenever a threshold (e.g. 75%) is reached [14]. This process produces a hierarchical structure of summaries:
- Level 0: Raw interaction (e.g. 10,000 tokens).
- Level 1: Summaries of blocks of 2,000 tokens each.
- Level 2: Summary of the Level-1 summaries [15].
1# Example: Recursive Summarization
2# Simulates the two-level condensation process without a real LLM call.
3
4def summarize(text: str) -> str:
5 """
6 Placeholder for a real LLM call, e.g.:
7 response = openai.chat.completions.create(
8 model="gpt-4o-mini",
9 messages=[{"role": "user", "content": f"Summarize:\n{text}"}]
10 )
11 return response.choices[0].message.content
12 """
13 # Simplified demo: truncates to first 80 characters
14 return text[:80] + "…" if len(text) > 80 else text
15
16def recursive_summarize(
17 tokens: list[str],
18 block_size: int = 3,
19 threshold: float = 0.75,
20) -> str:
21 # Level 1: Summarize block by block
22 level1 = []
23 for i in range(0, len(tokens), block_size):
24 block = " ".join(tokens[i : i + block_size])
25 level1.append(summarize(block))
26 print(f" [L1-Block {i // block_size + 1}] {level1[-1]}")
27
28 # Level 2: Merge all L1 summaries into a meta-summary
29 meta = summarize(" | ".join(level1))
30 print(f"\n [L2-Meta] {meta}")
31 return meta
32
33# --- Demo ---
34raw_chunks = [
35 "Nmap scan started. Target: 10.0.1.0/24.",
36 "Port 80 open on 10.0.1.5. Apache 2.4 detected.",
37 "CVE-2024-XXXX affects Apache 2.4.50. CVSS 9.8.",
38 "Exploit module loaded. Payload: reverse_shell.",
39 "Shell obtained on 10.0.1.5. User: www-data.",
40 "Privilege escalation attempt via sudo -l started.",
41]
42
43print("=== Recursive Summarization ===")
44result = recursive_summarize(raw_chunks, block_size=2)"Recursive Semantic Compression" (RSC) goes one step further by attempting to preserve meaning (intent) rather than merely shortening words [16]. This is particularly valuable for coding agents such as Claude Code or GitHub Copilot, which do not need an exact reconstruction of every character but must understand the functional logic of a code module [17]. A significant drawback, however, is "context rot" - through repeated summarization, fine nuances can be lost, potentially leading to hallucinations [14].
6. Virtual Context Management: The Operating System Paradigm
One of the most influential concepts for agentic memory architectures is "virtual context management", prominently introduced by MemGPT (now Letta) [10]. The core idea is to program the LLM to actively manage its own context, analogous to an operating system kernel managing RAM and disk [10].
6.1 Paging, Swapping, and Interrupts
MemGPT divides memory into several logical areas within the physical context window:
- System Instructions: Fixed rules that are never deleted.
- Working Context: A writable area for current facts about the user or the task.
- FIFO Queue: A buffer for the ongoing conversation history [10].
When space in the FIFO queue runs low, the system sends a "memory pressure" signal to the agent [10]. The agent can then decide to move important information via function calls (edit_memory, archive_memory) into the working context or into an external long-term archive (archival storage) [10]. Information that is no longer actively in the window but may be needed later is kept in a "recall storage" [10].
This paradigm enables "unlimited" contexts: the agent "sees" only a fraction of its full history, but has the tools to deliberately search its own past (paging in) or shed unnecessary ballast (swapping out) [10]. A feedback loop supports this, in which the system alerts the agent via interrupts when capacity limits are reached, making the agent the curator of its own memory [10].
1# Example: Minimal MemGPT-like memory management
2# Demonstrates the core operations edit_memory, archive_memory, and recall.
3
4from collections import deque
5
6CONTEXT_TOKEN_LIMIT = 500
7
8class MemGPTAgent:
9 """Simplified replica of the MemGPT memory model."""
10
11 def __init__(self):
12 self.system_instructions = "You are a pentest agent. Target: 10.0.1.0/24."
13 self.working_context: dict[str, str] = {} # writable KV area
14 self.fifo_queue: deque[str] = deque() # ongoing conversation
15 self.archival_storage: list[str] = [] # long-term archive
16 self.recall_storage: list[str] = [] # recall buffer
17
18 def _token_estimate(self) -> int:
19 all_text = self.system_instructions
20 all_text += " ".join(self.working_context.values())
21 all_text += " ".join(self.fifo_queue)
22 return len(all_text) // 4
23
24 def add_observation(self, text: str):
25 self.fifo_queue.append(text)
26 print(f"[FIFO] +Observation | Tokens ~{self._token_estimate()}")
27 if self._token_estimate() > CONTEXT_TOKEN_LIMIT * 0.75:
28 self._handle_memory_pressure()
29
30 def edit_memory(self, key: str, value: str):
31 """Writes a fact into the working context (persistent in window)."""
32 self.working_context[key] = value
33 print(f"[edit_memory] {key!r} = {value!r}")
34
35 def archive_memory(self, text: str):
36 """Moves a text entry into the long-term archive."""
37 self.archival_storage.append(text)
38 print(f"[archive_memory] Archived: {text[:60]!r}")
39
40 def recall(self, query: str) -> list[str]:
41 """Simple keyword search in the archive (placeholder for vector search)."""
42 results = [e for e in self.archival_storage if query.lower() in e.lower()]
43 self.recall_storage = results
44 return results
45
46 def _handle_memory_pressure(self):
47 """Interrupt: evict oldest FIFO entries into the archive."""
48 print("[⚠ Memory Pressure] Eviction started…")
49 while self.fifo_queue and self._token_estimate() > CONTEXT_TOKEN_LIMIT * 0.5:
50 oldest = self.fifo_queue.popleft()
51 self.archive_memory(oldest)
52
53# --- Demo ---
54agent = MemGPTAgent()
55agent.edit_memory("target", "10.0.1.5")
56agent.edit_memory("open_ports", "80, 443")
57
58for i in range(8):
59 agent.add_observation(f"Step {i}: Testing endpoint /api/v{i} for SQLi patterns.")
60
61results = agent.recall("SQLi")
62print(f"\nRecall 'SQLi': {len(results)} result(s) found.")7. Long-Term Memory (LTM): Structured Persistence Architectures
While STM ensures immediate action capability, LTM must organize accumulated knowledge across the entire project or user lifecycle. Two technological approaches face each other here, increasingly merging in hybrid forms.
7.1 Vector Databases and RAG: The World of Similarity
Vector databases such as Pinecone or Milvus store data as high-dimensional embeddings [18]. Retrieval is performed via mathematical similarity (typically cosine similarity). This approach is excellent for unstructured data and scales effortlessly to millions of entries [19]. However, vector databases hit limits in agentic systems. Because they store information in isolated chunks, relational context is often lost [18]. An agent may find a document that "sounds similar" but fail to recognize that this document was invalidated by a more recent decision two hours ago [1]. Moreover, restricting results to top-K entries means answers are often incomplete when information is spread across many documents [19].
1# Example: Cosine similarity search without external dependencies
2import math
3
4def dot(a: list[float], b: list[float]) -> float:
5 return sum(x * y for x, y in zip(a, b))
6
7def norm(v: list[float]) -> float:
8 return math.sqrt(sum(x ** 2 for x in v))
9
10def cosine_similarity(a: list[float], b: list[float]) -> float:
11 """cos(θ) = (A · B) / (|A| × |B|)"""
12 denom = norm(a) * norm(b)
13 return dot(a, b) / denom if denom else 0.0
14
15def top_k_retrieval(
16 query_vec: list[float],
17 memory: list[dict],
18 k: int = 3,
19) -> list[dict]:
20 """Returns the k most similar entries."""
21 scored = [
22 {**entry, "score": cosine_similarity(query_vec, entry["vector"])}
23 for entry in memory
24 ]
25 return sorted(scored, key=lambda x: x["score"], reverse=True)[:k]
26
27# --- Demo (simplified 4-dim dummy vectors) ---
28memory_store = [
29 {"text": "Port 80 on 10.0.1.5 is open.", "vector": [0.9, 0.1, 0.0, 0.3]},
30 {"text": "CVE-2024-XXXX affects Apache 2.4.", "vector": [0.8, 0.4, 0.1, 0.2]},
31 {"text": "Privilege escalation via sudo possible.", "vector": [0.1, 0.9, 0.5, 0.0]},
32 {"text": "HTTP uses port 80 by default.", "vector": [0.85, 0.2, 0.0, 0.4]},
33]
34
35query = [0.88, 0.15, 0.0, 0.35] # similar to port-80 topics
36results = top_k_retrieval(query, memory_store, k=2)
37
38print("Top-2 retrieval results:")
39for r in results:
40 print(f" Score {r['score']:.3f} | {r['text']}")7.2 Knowledge Graphs and GraphRAG: The World of Relations
Knowledge graphs (KG) model data as nodes and edges, where edges explicitly define the semantic relationship between entities (e.g. "Server A" → "VULNERABLE_TO" → "CVE-2024-XXXX") [18]. This enables complex queries impossible with vectors: "Show me all pentest findings for servers that are in the same subnet group as the system compromised yesterday" [20].
| Feature | Vector-Based Memories | Graph-Based Memories |
|---|---|---|
| Retrieval Mechanism | Semantic similarity | Structural traversal |
| Transparency | Low ("black box" scores) | High (auditable paths) |
| Challenge | Context loss, redundancy | High schema-construction overhead |
| Strength | Search across large text volumes | Logical inference, multi-hop reasoning |
| Example Tool | Pinecone, FAISS [19] | Neo4j, AWS Neptune [20] |
In long-lived systems such as PentestGPT, memory is often organized as a "Pentesting Task Tree" (PTT) - a specialized graph structure that maps phases such as reconnaissance, scanning, and exploitation as hierarchical tasks [21]. The agent uses this graph to maintain global strategies and avoid redundant steps by annotating each node with attributes about status and findings [21].
1# Example: Simple knowledge graph with networkx
2# pip install networkx
3import networkx as nx
4
5kg = nx.DiGraph()
6
7# Nodes
8kg.add_node("Server_A", type="host", ip="10.0.1.5")
9kg.add_node("Server_B", type="host", ip="10.0.1.6")
10kg.add_node("Apache_2.4",type="software")
11kg.add_node("CVE-2024-XXXX", type="vulnerability", cvss=9.8)
12kg.add_node("Subnet_10.0.1", type="subnet", cidr="10.0.1.0/24")
13
14# Edges (relations)
15kg.add_edge("Server_A", "Apache_2.4", relation="RUNS")
16kg.add_edge("Apache_2.4", "CVE-2024-XXXX", relation="AFFECTED_BY")
17kg.add_edge("Server_A", "Subnet_10.0.1", relation="IN_SUBNET")
18kg.add_edge("Server_B", "Subnet_10.0.1", relation="IN_SUBNET")
19
20def servers_in_same_subnet_as(graph: nx.DiGraph, host: str) -> list[str]:
21 """Multi-hop: All servers in the same subnet as host."""
22 subnets = [n for _, n, d in graph.out_edges(host, data=True) if d["relation"] == "IN_SUBNET"]
23 peers = []
24 for subnet in subnets:
25 for src, _, d in graph.in_edges(subnet, data=True):
26 if d["relation"] == "IN_SUBNET" and src != host:
27 peers.append(src)
28 return peers
29
30def vulnerabilities_of(graph: nx.DiGraph, host: str) -> list[str]:
31 """Two-hop path: Host → Software → CVE."""
32 vulns = []
33 for _, sw, d1 in graph.out_edges(host, data=True):
34 if d1["relation"] == "RUNS":
35 for _, cve, d2 in graph.out_edges(sw, data=True):
36 if d2["relation"] == "AFFECTED_BY":
37 vulns.append(cve)
38 return vulns
39
40print("Servers in the same subnet as Server_A:", servers_in_same_subnet_as(kg, "Server_A"))
41print("CVEs of Server_A:", vulnerabilities_of(kg, "Server_A"))8. Temporal Intelligence: Bi-Temporal Graphs and State Versioning
A long-lived agent must not only know what it knows, but also when it learned it and whether it is still current. In dynamic environments, such as those encountered with XBOW or automated IT operations, facts change constantly [37].
8.1 The Concept of Bi-Temporality
Bi-temporal data models, as used in the Graphiti framework, track two time axes for each piece of information [23]:
- Valid Time: The period during which a fact was true in the real world (e.g. "Port 80 was open from 08:00 to 12:00") [24].
- Transaction Time (System Time): The point in time at which the system learned this fact [24].
This allows the agent to perform "point-in-time" queries: "Which vulnerabilities were known last Tuesday before the patch management script ran?" [23]. Without this temporal dimension, an agent would be confronted with contradictory information from different points in time, leading to logical errors. Frameworks like Graphiti resolve this through a three-stage pipeline: temporal classification (Atemporal, Static, Dynamic), event extraction, and validity checking [25]. Contradictions are detected and outdated entries are tagged with an invalidated_by link instead of being deleted [25].
1# Example: Bi-temporal data storage in Python
2from datetime import datetime
3from dataclasses import dataclass, field
4from typing import Optional
5import uuid
6
7@dataclass
8class BiTemporalFact:
9 content: str
10 valid_from: datetime
11 valid_to: Optional[datetime]
12 transaction_time: datetime = field(default_factory=datetime.utcnow)
13 fact_id: str = field(default_factory=lambda: str(uuid.uuid4())[:8])
14 invalidated_by: Optional[str] = None
15
16class BiTemporalMemory:
17 def __init__(self):
18 self._facts: list[BiTemporalFact] = []
19
20 def store(self, content: str, valid_from: datetime,
21 valid_to: Optional[datetime] = None) -> BiTemporalFact:
22 fact = BiTemporalFact(content=content, valid_from=valid_from, valid_to=valid_to)
23 self._facts.append(fact)
24 print(f"[STORE #{fact.fact_id}] {content!r}")
25 return fact
26
27 def invalidate(self, old_id: str, new_fact: BiTemporalFact):
28 for f in self._facts:
29 if f.fact_id == old_id:
30 f.invalidated_by = new_fact.fact_id
31 print(f"[INVALIDATE #{old_id}] replaced by #{new_fact.fact_id}")
32
33 def query_at(self, point_in_time: datetime) -> list[BiTemporalFact]:
34 """Point-in-time query: What was valid at this point in time?"""
35 return [
36 f for f in self._facts
37 if f.valid_from <= point_in_time
38 and (f.valid_to is None or f.valid_to > point_in_time)
39 and f.invalidated_by is None
40 ]
41
42# --- Demo ---
43mem = BiTemporalMemory()
44f1 = mem.store("Port 80 on Server_A is open", valid_from=datetime(2026, 2, 18, 8, 0))
45f2 = mem.store("Port 80 on Server_A is closed", valid_from=datetime(2026, 2, 18, 12, 0))
46mem.invalidate(f1.fact_id, f2)
47
48result_10 = mem.query_at(datetime(2026, 2, 18, 10, 0))
49print("\nFacts at 10:00:", [f.content for f in result_10])
50
51result_13 = mem.query_at(datetime(2026, 2, 18, 13, 0))
52print("Facts at 13:00:", [f.content for f in result_13])8.2 Epistemic Consistency: Truth Maintenance in Dynamic Environments
A fundamental problem for long-lived agents is belief revision. In dynamic IT infrastructures, today's truth is tomorrow's hallucination [26]. Modern memory systems therefore require a Truth Maintenance System (TMS). Each piece of information in the graph is not only linked temporally but also causally. When the agent receives contradictory information, it must apply a weighting based on Source Credibility:
- System Observation (Hard Fact): Highest priority.
- User Instruction: Medium priority.
- Inference Derivation (Heuristic): Low priority; must be discarded in the event of a contradiction.
Logical correction works via dependency networks: if a fact is removed (e.g. "Server is offline"), all conclusions derived from it (e.g. "Service X is exploitable") must automatically be marked as "invalid" in memory to prevent cascading bad decisions [26].
1# Example: Truth Maintenance System (TMS) with cascading invalidation
2from enum import IntEnum
3from dataclasses import dataclass, field
4
5class Priority(IntEnum):
6 HARD_FACT = 3 # System observation
7 INSTRUCTION = 2 # User instruction
8 HEURISTIC = 1 # Inference derivation
9
10@dataclass
11class Belief:
12 name: str
13 content: str
14 priority: Priority
15 valid: bool = True
16 depends_on: list[str] = field(default_factory=list)
17
18class TruthMaintenanceSystem:
19 def __init__(self):
20 self._beliefs: dict[str, Belief] = {}
21
22 def add(self, belief: Belief):
23 self._beliefs[belief.name] = belief
24
25 def assert_conflict(self, new_fact: Belief):
26 self._beliefs[new_fact.name] = new_fact
27 print(f"[TMS] New fact: {new_fact.name!r} (Prio {new_fact.priority.name})")
28 self._cascade_invalidate(new_fact.name)
29
30 def _cascade_invalidate(self, invalidated_name: str, depth: int = 0):
31 for belief in self._beliefs.values():
32 if invalidated_name in belief.depends_on and belief.valid:
33 belief.valid = False
34 indent = " " * (depth + 1)
35 print(f"{indent}↳ [INVALID] {belief.name}: {belief.content}")
36 self._cascade_invalidate(belief.name, depth + 1)
37
38 def active_beliefs(self) -> list[Belief]:
39 return [b for b in self._beliefs.values() if b.valid]
40
41# --- Demo ---
42tms = TruthMaintenanceSystem()
43tms.add(Belief("server_online", "Server A is online", Priority.HARD_FACT))
44tms.add(Belief("port_open", "Port 80 is open", Priority.INSTRUCTION, depends_on=["server_online"]))
45tms.add(Belief("apache_reachable", "Apache service is reachable", Priority.HEURISTIC, depends_on=["server_online", "port_open"]))
46tms.add(Belief("cve_exploitable", "CVE-2024-XXXX is exploitable", Priority.HEURISTIC, depends_on=["apache_reachable"]))
47tms.add(Belief("privesc_possible", "Privilege Escalation possible",Priority.HEURISTIC, depends_on=["cve_exploitable"]))
48
49print(f"Active beliefs before conflict: {len(tms.active_beliefs())}\n")
50tms.assert_conflict(Belief("server_online", "Server A is OFFLINE", Priority.HARD_FACT))
51print(f"\nActive beliefs after conflict: {len(tms.active_beliefs())}")9. The Mathematics of Forgetting and Remembering: Scoring and Retrieval Logic
A critical problem for long-lived systems is memory "overflow" with irrelevant noise. An agent must prioritize. This is often achieved through a mathematical scoring logic based on three pillars: Recency, Frequency, and Importance.
9.1 The RFM Formula for Agent Memory
Based on marketing models, the relevance score of a memory at time t can be defined as follows [27]:
$$S(m,\, t) = w_r \cdot R(t) + w_f \cdot F + w_i \cdot I$$
Where:
- $R(t)$ represents Recency, often modeled by an exponential decay function: $R(t) = e^{-\lambda \cdot \Delta t}$, where $\lambda$ is the decay rate [28].
- $F$ is the frequency of retrieval - information that is needed frequently receives a higher base score [29].
- $I$ is an importance score assigned by the agent and set at the time of initial storage [14].
The system can also define a threshold $\theta$ (e.g. 0.86) above which information is considered "active" [29]. Thanks to this mathematical foundation, long-lived systems such as the model used by GitHub Copilot can decide which "citations" (references) are still relevant for a code review and which should be discarded due to code changes [30].
1# Example: RFM scoring for agent memory entries
2# S(m, t) = w_r * R(t) + w_f * F + w_i * I
3# Recency R(t) = exp(-λ * Δt), where Δt = t - t_last_access
4
5import math
6import time
7from dataclasses import dataclass, field
8
9@dataclass
10class ScoredMemory:
11 content: str
12 importance: float # I: manually assigned importance score (0-1)
13 frequency: int = 1 # F: number of retrievals
14 last_access: float = field(default_factory=time.time)
15
16 def recency(self, now: float, decay_rate: float = 0.001) -> float:
17 """R(t) = exp(-λ * Δt) | Δt in seconds"""
18 delta_t = now - self.last_access
19 return math.exp(-decay_rate * delta_t)
20
21 def score(self, now: float, decay_rate: float = 0.001,
22 w_r: float = 0.4, w_f: float = 0.3, w_i: float = 0.3) -> float:
23 r = self.recency(now, decay_rate)
24 f_norm = min(self.frequency / 10.0, 1.0)
25 return w_r * r + w_f * f_norm + w_i * self.importance
26
27 def access(self):
28 self.frequency += 1
29 self.last_access = time.time()
30
31# --- Demo ---
32now = time.time()
33
34memories = [
35 ScoredMemory("Port 80 open on 10.0.1.5", importance=0.9, frequency=5, last_access=now - 60),
36 ScoredMemory("Standard HTTP protocol knowledge", importance=0.5, frequency=2, last_access=now - 3600),
37 ScoredMemory("CVE-2024-XXXX affects Apache 2.4", importance=0.95, frequency=1, last_access=now - 7200),
38]
39
40THRESHOLD = 0.30
41
42print(f"{'Score':>6} {'Active':>6} Content")
43print("-" * 60)
44for m in sorted(memories, key=lambda x: x.score(now), reverse=True):
45 s = m.score(now)
46 active = s >= THRESHOLD
47 print(f"{s:.3f} {'✅' if active else '❌':>6} {m.content}")10. Quantifying Memory: Benchmarking Agentic Memory Systems
The efficiency of a memory architecture cannot be measured by the accuracy of the LLM alone. Specific metrics are needed to evaluate "memory hygiene":
| Metric | Definition | Target |
|---|---|---|
| Retrieval Recall @K | Percentage of relevant facts returned among the Top-K results. | > 90% |
| Context Noise Ratio | Ratio of relevant tokens to "noise" in the retrieved context. | Low |
| Memory Access Latency | Time from request to delivery of the hydrated prompt. | < 200ms |
| Knowledge Retention Rate | Ability to preserve information across N interactions without corruption. | High |
Particular attention must be paid to the lost-in-the-middle phenomenon: agents tend to ignore information in the middle of a very long context window. An efficient memory architecture must therefore structure the retrieved context so that the most critical facts are placed at the edges (beginning/end) of the prompt to maximize the model's attention performance [26].
1# Example: Retrieval Recall@K, Context Noise Ratio and
2# visualization of the Lost-in-the-Middle effect.
3import math
4
5def retrieval_recall_at_k(retrieved: list[str], relevant: list[str], k: int) -> float:
6 top_k = set(retrieved[:k])
7 hits = top_k & set(relevant)
8 return len(hits) / len(relevant) if relevant else 0.0
9
10def context_noise_ratio(context_chunks: list[dict]) -> float:
11 total = len(context_chunks)
12 noise = sum(1 for c in context_chunks if not c["relevant"])
13 return noise / total if total else 0.0
14
15def lost_in_middle_weights(n_positions: int) -> list[float]:
16 """Models the U-shaped attention curve."""
17 weights = []
18 for i in range(n_positions):
19 norm = i / max(n_positions - 1, 1)
20 w = 0.5 * (1 + math.cos(math.pi * (2 * norm - 1)))
21 weights.append(round(w, 3))
22 return weights
23
24def reorder_for_attention(chunks: list[str]) -> list[str]:
25 """Most important chunks at beginning & end, less important in the middle."""
26 n = len(chunks)
27 mid = n // 2
28 return chunks[:mid//2] + chunks[mid:] + chunks[mid//2:mid]
29
30# --- Demo ---
31retrieved = ["CVE-Apache", "Port-80-Fact", "Subnet-Info", "Noise-A", "Noise-B"]
32ground_truth = ["CVE-Apache", "Port-80-Fact", "Subnet-Info"]
33
34print(f"Retrieval Recall@3: {retrieval_recall_at_k(retrieved, ground_truth, k=3):.0%}")
35print(f"Retrieval Recall@5: {retrieval_recall_at_k(retrieved, ground_truth, k=5):.0%}")
36
37chunks = [{"text": c, "relevant": c in ground_truth} for c in retrieved]
38print(f"Context Noise Ratio: {context_noise_ratio(chunks):.0%}")
39
40weights = lost_in_middle_weights(len(retrieved))
41print("\nAttention weights per position (U-curve):")
42for pos, (chunk, w) in enumerate(zip(retrieved, weights)):
43 bar = "█" * int(w * 20)
44 print(f" Pos {pos}: {w:.3f} {bar:<20} → {chunk}")
45
46print("\nAfter reordering (critical facts at the edges):")
47for pos, chunk in enumerate(reorder_for_attention(retrieved)):
48 print(f" Pos {pos}: {chunk}")11. Research and New Horizons: E-mem and Episodic Context Reconstruction
Current research recognizes that the conventional RAG model (extract vectors, inject into the prompt) leads to "destructive de-contextualization" [31]. The project E-mem instead proposes "Episodic Context Reconstruction" [31].
11.1 The Master-Assistant Architecture of E-mem
Instead of compressing information into mathematical points, E-mem preserves uncompressed memory segments (episodes). The system uses a hierarchical architecture [31]:
- Master Agent: Responsible for global planning and strategy.
- Assistant Agents: Small, specialized language models (SLMs), each managing a specific segment of the raw history [31].
When the master asks a question, relevant assistant agents are activated. They perform a local "reconstruction": they read the full context of their segment and derive logical evidence instead of simply delivering text fragments [31]. This process of "re-experiencing" ensures that sequential dependencies essential for complex System-2 reasoning are preserved [31]. Evaluations on benchmarks such as LoCoMo show that E-mem outperforms conventional architectures by up to 8.8% on temporal reasoning while reducing token costs by over 70% [31].
1# Example: E-mem Master-Assistant architecture
2from dataclasses import dataclass, field
3
4@dataclass
5class Episode:
6 """Uncompressed memory segment of an assistant agent."""
7 agent_id: str
8 label: str
9 raw_history: list[str]
10
11 def reconstruct(self, query: str) -> str | None:
12 relevant = [
13 e for e in self.raw_history
14 if any(w in e.lower() for w in query.lower().split())
15 ]
16 if not relevant:
17 return None
18 evidence = " | ".join(relevant)
19 return f"[{self.agent_id} / '{self.label}'] {evidence}"
20
21class MasterAgent:
22 def __init__(self, episodes: list[Episode]):
23 self.episodes = episodes
24
25 def _select_episodes(self, query: str) -> list[Episode]:
26 return [
27 ep for ep in self.episodes
28 if any(w in " ".join(ep.raw_history).lower() for w in query.lower().split())
29 ]
30
31 def answer(self, query: str) -> str:
32 activated = self._select_episodes(query)
33 print(f"[Master] Query: {query!r} → {len(activated)}/{len(self.episodes)} episodes activated")
34 proofs = [ep.reconstruct(query) for ep in activated]
35 proofs = [p for p in proofs if p]
36 for p in proofs:
37 print(f" {p}")
38 if not proofs:
39 return "No relevant episodes found."
40 return f"Final answer ({len(proofs)} episodes): " + " // ".join(proofs)
41
42# --- Demo ---
43master = MasterAgent([
44 Episode("A1", "Day 1-7", ["Nmap scan started", "Port 80 on Server_A open", "Apache 2.4 detected"]),
45 Episode("A2", "Day 8-14", ["CVE-2024-XXXX found", "Exploit launched on Apache", "Shell obtained"]),
46 Episode("A3", "Day 15-21", ["Privilege escalation via sudo", "Root access confirmed", "Lateral movement"]),
47])
48
49print(master.answer("CVE Apache Exploit"))
50print()
51print(master.answer("sudo Privilege"))12. Practical Example: The HMLR System (Hierarchical Memory Lookup & Routing)
A concrete example of a modern implementation is the HMLR system (HMLR-Agentic-AI-Memory-System) [32]. It combines short-term and long-term components through a specialized process:
- Bridge Blocks: Information is initially stored as short-term fragments.
- Gardener Function: A background process (
run_gardener.py) periodically transfers this data to LTM [32]. - Dossiers: Critical facts are extracted into dossiers that persist across days and topics. This allows the system to reconstruct a "causal chain of events" from the past to the present, as if the information were still in "hot memory" [32].
This system uses a "Governor Agent" as the central control organ that decides whether a memory is relevant to the current query, and a "Context Hydrator" that assembles the final prompt from vector results, SQL facts, and dossiers [32].
1# Example: Simplified HMLR architecture
2import time
3from dataclasses import dataclass, field
4
5@dataclass
6class BridgeBlock:
7 content: str
8 created_at: float = field(default_factory=time.time)
9 importance: float = 0.5
10
11@dataclass
12class Dossier:
13 topic: str
14 facts: list[str] = field(default_factory=list)
15
16bridge_blocks: list[BridgeBlock] = []
17vector_store: list[dict] = []
18dossiers: dict[str, Dossier] = {}
19
20def run_gardener(importance_threshold: float = 0.7):
21 global bridge_blocks
22 promoted, archived = 0, 0
23 for block in bridge_blocks:
24 if block.importance >= importance_threshold:
25 topic = block.content.split()[0]
26 if topic not in dossiers:
27 dossiers[topic] = Dossier(topic=topic)
28 dossiers[topic].facts.append(block.content)
29 promoted += 1
30 vector_store.append({"text": block.content, "score": block.importance})
31 archived += 1
32 bridge_blocks = []
33 print(f"[Gardener] {promoted} dossier entries, {archived} vectors stored.")
34
35def governor(query: str, candidate: str) -> bool:
36 keywords = set(query.lower().split())
37 entry_words = set(candidate.lower().split())
38 return len(keywords & entry_words) >= 1
39
40def hydrate_context(query: str) -> str:
41 parts = []
42 for dossier in dossiers.values():
43 for fact in dossier.facts:
44 if governor(query, fact):
45 parts.append(f"[DOSSIER] {fact}")
46 for entry in sorted(vector_store, key=lambda x: x["score"], reverse=True)[:3]:
47 if governor(query, entry["text"]):
48 parts.append(f"[VECTOR] {entry['text']}")
49 hydrated = "\n".join(parts) if parts else "(no relevant context)"
50 return f"=== Hydrated context for: '{query}' ===\n{hydrated}"
51
52# --- Demo ---
53bridge_blocks += [
54 BridgeBlock("CVE-2024-XXXX confirmed on Apache_2.4", importance=0.95),
55 BridgeBlock("Port 80 open on Server_A", importance=0.85),
56 BridgeBlock("Nmap scan completed", importance=0.4),
57]
58run_gardener()
59print()
60print(hydrate_context("CVE Apache"))13. The Dark Side of Persistence: Privacy and the "Right to Be Forgotten"
In an enterprise environment, an infinite memory is not only a technological blessing but also a regulatory risk. When agents store interactions over months, they inevitably accumulate personally identifiable information (PII) and business-critical secrets. The problem: conventional vector databases are not optimized for "deleting" individual facts. Because embeddings are relationally positioned in a high-dimensional space, simply removing a record often leaves "shadows" in the neighboring vectors [34].
Long-lived architectures must therefore implement mechanisms for Selective Deletion. This requires a metadata layer that assigns each memory a time-to-live (TTL) or an ownership tag. In multi-tenant systems, cross-context isolation must also be ensured: an agent must not unreflectingly use knowledge acquired in session A with user X in session B with user Y if it is private knowledge. The challenge of the future lies in "Machine Unlearning" - the ability of an agent to actively forget specific concepts without sacrificing the global coherence of its world knowledge [34].
1# Example: Selective deletion with TTL, ownership tags, and cross-context isolation.
2import time
3from dataclasses import dataclass, field
4from typing import Optional
5
6@dataclass
7class PrivacyAwareMemory:
8 content: str
9 owner_id: str
10 session_id: str
11 ttl_seconds: Optional[float] = None
12 created_at: float = field(default_factory=time.time)
13 is_private: bool = True
14
15 @property
16 def is_expired(self) -> bool:
17 if self.ttl_seconds is None:
18 return False
19 return (time.time() - self.created_at) > self.ttl_seconds
20
21class PrivacyMemoryStore:
22 def __init__(self):
23 self._store: list[PrivacyAwareMemory] = []
24
25 def add(self, mem: PrivacyAwareMemory):
26 self._store.append(mem)
27
28 def purge_expired(self) -> int:
29 before = len(self._store)
30 self._store = [m for m in self._store if not m.is_expired]
31 removed = before - len(self._store)
32 if removed:
33 print(f"[TTL-Purge] {removed} expired entries deleted.")
34 return removed
35
36 def delete_by_owner(self, owner_id: str) -> int:
37 before = len(self._store)
38 self._store = [m for m in self._store if m.owner_id != owner_id]
39 removed = before - len(self._store)
40 print(f"[GDPR Delete] {removed} entries of '{owner_id}' removed.")
41 return removed
42
43 def retrieve(self, owner_id: str) -> list[PrivacyAwareMemory]:
44 """Cross-context isolation: own & public facts allowed."""
45 self.purge_expired()
46 return [
47 m for m in self._store
48 if m.owner_id == owner_id or not m.is_private
49 ]
50
51# --- Demo ---
52store = PrivacyMemoryStore()
53store.add(PrivacyAwareMemory("API key of user_x: sk-abc123",
54 owner_id="user_x", session_id="sess_1", ttl_seconds=5, is_private=True))
55store.add(PrivacyAwareMemory("HTTP uses port 80 by default",
56 owner_id="system", session_id="sess_0", is_private=False))
57store.add(PrivacyAwareMemory("Pentest report of user_y: Root access obtained",
58 owner_id="user_y", session_id="sess_2", is_private=True))
59
60print("Retrieval for user_x (cross-context check):")
61for m in store.retrieve("user_x"):
62 print(f"[{m.owner_id}] {m.content}")
63
64print("\nAfter 6 seconds (TTL expired):")
65time.sleep(6)
66store.purge_expired()
67
68print("\nGDPR deletion for user_y:")
69store.delete_by_owner("user_y")
70print(f"Remaining entries total: {len(store._store)}")14. Curated Memetics: Human-in-the-Loop in Memory Management
Despite advanced algorithms like RFM scoring, the risk of "memory pollution" - the gradual poisoning of memory with irrelevant or incorrect information - remains. Autonomous agents therefore need an interface for human supervision: the Memory Editing Interface [35].
In professional implementations, administrators can "pin" critical facts in so-called Dossiers. This information is immune to automatic decay and summarization. At the same time, a memory "audit log" allows tracing why an agent made a particular decision. This form of Explainable Memory is essential for debugging. If a pentest agent incorrectly spares a system, the security analyst must be able to manually correct or delete the underlying (incorrect) memory in the PTT (task tree) to get the agent back on track [35].
1# Example: Memory Editing Interface with pin, audit log, and manual correction.
2import time
3from dataclasses import dataclass, field
4
5@dataclass
6class ManagedMemory:
7 content: str
8 pinned: bool = False
9 valid: bool = True
10 mem_id: int = field(default_factory=lambda: id(object()))
11
12class MemoryEditingInterface:
13 def __init__(self):
14 self._memories: dict[int, ManagedMemory] = {}
15 self._audit_log: list[dict] = []
16
17 def _log(self, action: str, mem_id: int, detail: str = ""):
18 entry = {"ts": time.strftime("%H:%M:%S"), "action": action, "mem_id": mem_id, "detail": detail}
19 self._audit_log.append(entry)
20 print(f"[AUDIT {entry['ts']}] {action:8} | id={mem_id} | {detail}")
21
22 def add(self, content: str) -> ManagedMemory:
23 mem = ManagedMemory(content=content)
24 self._memories[mem.mem_id] = mem
25 self._log("ADD", mem.mem_id, content[:60])
26 return mem
27
28 def pin(self, mem_id: int):
29 mem = self._memories[mem_id]
30 mem.pinned = True
31 self._log("PIN", mem_id, mem.content[:60])
32
33 def correct(self, mem_id: int, new_content: str):
34 old = self._memories[mem_id].content
35 self._memories[mem_id].content = new_content
36 self._log("CORRECT", mem_id, f"{old[:35]!r} -> {new_content[:35]!r}")
37
38 def delete(self, mem_id: int):
39 content = self._memories.pop(mem_id).content
40 self._log("DELETE", mem_id, content[:60])
41
42 def decay_pass(self, survival_prob: float = 0.5):
43 import random
44 removed = 0
45 for mid, mem in list(self._memories.items()):
46 if mem.pinned:
47 continue
48 if random.random() > survival_prob:
49 del self._memories[mid]
50 self._log("DECAY", mid, mem.content[:40])
51 removed += 1
52 print(f"[Decay] {removed} entries removed. Pinned entries: safe.")
53
54 def show_audit(self):
55 print("\n=== Audit Log (Explainable Memory) ===")
56 for e in self._audit_log:
57 print(f" [{e['ts']}] {e['action']:8} | id={e['mem_id']} | {e['detail']}")
58
59# --- Demo ---
60mei = MemoryEditingInterface()
61m1 = mei.add("Server_A is reachable via port 80")
62m2 = mei.add("CVE-2024-XXXX is NOT exploitable (false heuristic)")
63m3 = mei.add("Pentest policy: no active exploit without written authorization")
64
65mei.pin(m3.mem_id)
66mei.correct(m2.mem_id, "CVE-2024-XXXX IS exploitable (confirmed by scan)")
67
68print("\n--- Decay Pass ---")
69mei.decay_pass(survival_prob=0.3)
70mei.show_audit()15. Case Studies: XBOW, PentestGPT, and CAI in Practice
To illustrate the relevance of these architectures, it is worth looking at the long-lived systems mentioned at the start.
15.1 XBOW: Parallel Agents and a Central Coordinator
XBOW uses an architecture that separates exploration from verification [11]. A persistent coordinator maintains the global view of the attack target (the memory center), while thousands of short-lived, specialized agents ("Solvers") are spawned for narrowly scoped tasks [11]. These solvers receive only the context necessary for their mission and are destroyed upon completion to prevent "bias" accumulation or context collapse [11]. Results are validated by the coordinator and integrated into the global knowledge base. Here, memory is modularized to ensure scalability [37].
1# Example: XBOW-like coordinator-solver architecture.
2import uuid
3from dataclasses import dataclass, field
4from typing import Callable
5
6@dataclass
7class SolverContext:
8 task: str
9 target: str
10 known_facts: list[str]
11
12@dataclass
13class SolverResult:
14 solver_id: str
15 task: str
16 findings: list[str]
17 success: bool
18
19def spawn_solver(context: SolverContext, work_fn: Callable[[SolverContext], list[str]]) -> SolverResult:
20 solver_id = str(uuid.uuid4())[:8]
21 print(f" [Solver {solver_id}] SPAWN → {context.task!r}")
22 findings = work_fn(context)
23 print(f" [Solver {solver_id}] DESTROY → {len(findings)} Findings")
24 return SolverResult(solver_id=solver_id, task=context.task, findings=findings, success=bool(findings))
25
26class XBOWCoordinator:
27 def __init__(self, target: str):
28 self.target = target
29 self.ptt: dict[str, list[str]] = {}
30 self.global_facts: list[str] = []
31
32 def _validate_and_integrate(self, result: SolverResult):
33 if result.success:
34 self.ptt.setdefault(result.task, []).extend(result.findings)
35 self.global_facts.extend(result.findings)
36 print(f" [Coordinator] {len(result.findings)} findings integrated.\n")
37 else:
38 print(f" [Coordinator] No findings - solver result discarded.\n")
39
40 def run_mission(self, tasks: list[tuple[str, Callable]]):
41 print(f"=== Mission: {self.target} ===\n")
42 for task_name, work_fn in tasks:
43 ctx = SolverContext(task=task_name, target=self.target, known_facts=list(self.global_facts))
44 result = spawn_solver(ctx, work_fn)
45 self._validate_and_integrate(result)
46 print("=== Finaler Pentesting Task Tree ===")
47 for phase, findings in self.ptt.items():
48 print(f" [{phase}]")
49 for f in findings:
50 print(f" - {f}")
51
52def recon(ctx: SolverContext) -> list[str]:
53 return [f"{ctx.target} erreichbar", "Reverse-DNS: server-a.internal"]
54
55def port_scan(ctx: SolverContext) -> list[str]:
56 return ["Port 80 offen (HTTP)", "Port 443 offen (HTTPS)", "Port 22 offen (SSH)"]
57
58def exploit_attempt(ctx: SolverContext) -> list[str]:
59 if any("Port 80" in f for f in ctx.known_facts):
60 return ["CVE-2024-XXXX ausgenutzt", "Shell auf www-data erlangt"]
61 return []
62
63coordinator = XBOWCoordinator(target="10.0.1.5")
64coordinator.run_mission([
65 ("Reconnaissance", recon),
66 ("Port Scanning", port_scan),
67 ("Exploitation", exploit_attempt),
68])15.2 PentestGPT: The Task Tree as Structural Memory
PentestGPT addresses context loss through the Pentesting Task Tree (PTT). This tree structures the entire attack process from reconnaissance to escalation [21]. Each node in the tree stores not only text but also attributes such as tool outputs and priorities. The reasoning module continuously updates this tree, enabling the agent to switch between tasks without losing thread [21]. The separation into reasoning, generation, and parsing modules ensures that the memory (PTT) stays clean while the "dirty" work of command execution happens in isolated steps [21].
15.3 CAI and GitHub Copilot: Agentic Memory in the SDLC
GitHub Copilot has implemented a memory system specifically designed for the software development lifecycle (SDLC) [30]. Memories here are backed by citations that point to specific code locations. When an agent retrieves a memory, it first checks whether the code at that location still matches the memory [30]. If not, the memory is updated or discarded. This form of "validated memory" is essential for maintaining consistency in long-lived projects. Measurements show that this system increases the rate of successfully merged pull requests by 7% [30].
16. The Economics of Memory: Cost-Benefit Analysis of Background Curation
An often-overlooked aspect is that memory is not "free". Every "Gardener" process, every recursive summarization, and every extraction of knowledge graphs consumes inference tokens. In a long-lived system, the costs of memory maintenance can account for up to 30% of total operational costs [36].
The return on investment (ROI) manifests in long-term efficiency, however: an agent with excellent memory requires significantly fewer interaction loops per task, because it avoids repetitive errors and immediately "recognizes" complex relationships rather than re-deriving them each time. The architectural choice of a more expensive but more precise graph memory typically pays for itself in productive environments within a few days through saved time and higher success rates on autonomous missions [36].
17. Conclusion: The Strategic Importance of Memory Architecture
The development of long-lived agentic systems clearly shows that memory management is the next major hurdle after optimizing raw compute [6]. Anyone building agents today must design them like operating systems that intelligently manage memory hierarchies, temporal sequences, and semantic relations [9].
Effective architectures often use hybrid approaches:
- STM is optimized through dynamic paging strategies (MemGPT) and efficient attention loops (PagedAttention) [33].
- LTM combines the speed of vector databases with the logical depth of knowledge graphs and the temporal precision of bi-temporal models [22].
- Mathematical filters ensure that only the relevant information stays in focus while noise is systematically forgotten [28].
For organizations, this means that memory can no longer be viewed merely as "storage" but as an active layer of governance and identity for the agent [9]. An agent that remembers is an agent that learns, corrects errors, and ultimately achieves genuine autonomy [2]. In the upcoming installments of this series we will see how this persistence is used to optimize contexts over the course of days and to ensure the economical operation of such highly complex systems.
Further Articles
- AI-Agents 01 - Beyond Automation: Designing Cognitive Architectures for AI-Agents
- AI-Agents 02 - The Architectural Spectrum of Agentic Systems
- AI-Agents 03 - Self-Reflection in Agentic Systems
- AI-Agents 04 - Architectural Persistence: Efficient Management of Short-term and Long-term Memory in Long-lived Agentic Systems
References
[1]
"PentestGPT Alternatives: From Chatbots to Autonomous Agents (2026 Edition)," Penligent, Feb. 18, 2026. [Online]. Available: https://www.penligent.ai/hackinglabs/pentestgpt-alternatives-from-chatbots-to-autonomous-agents-2026-edition/
[2]
"The real promise of agentic memory is continuous self-evolving," Reddit r/AI_Agents, Feb. 18, 2026. [Online]. Available: https://www.reddit.com/r/AI_Agents/comments/1q4lmfe/the_real_promise_of_agentic_memory_is_continuous/
[3]
"Long-term memory in agentic systems: Building context-aware agents," Moxo, Feb. 18, 2026. [Online]. Available: https://www.moxo.com/blog/agentic-ai-memory
[4]
Google DeepMind, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens," Technical Report, updated Feb. 2025.
[5]
N. F. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," Transactions of the Association for Computational Linguistics, vol. 12, pp. 157-173, 2024.
[6]
"How agentic AI can strain modern memory hierarchies," The Register, Jan. 28, 2026. [Online]. Available: https://www.theregister.com/2026/01/28/how_agentic_ai_strains_modern_memory_heirarchies/
[7]
"TsinghuaC3I/Awesome-Memory-for-Agents," GitHub, Feb. 18, 2026. [Online]. Available: https://github.com/TsinghuaC3I/Awesome-Memory-for-Agents
[8]
A. Lucek, "agentic-memory: Implementing cognitive architecture and psychological memory concepts into Agentic LLM Systems," GitHub, Feb. 18, 2026. [Online]. Available: https://github.com/ALucek/agentic-memory
[9]
"Memory for AI Agents: A New Paradigm of Context Engineering," The New Stack, Feb. 18, 2026. [Online]. Available: https://thenewstack.io/memory-for-ai-agents-a-new-paradigm-of-context-engineering/
[10]
"MemGPT," research.memgpt.ai, Feb. 18, 2026. [Online]. Available: https://research.memgpt.ai/
[11]
"Autonomous Offensive Security Platform," XBOW, Feb. 18, 2026. [Online]. Available: https://xbow.com/platform
[12]
"7 Operating System Concepts Every LLM Engineer Should Understand," Medium, Feb. 18, 2026. [Online]. Available: https://medium.com/wix-engineering/7-operating-system-concepts-every-llm-engineer-should-understand-84ddf0cfb89a
[13]
"Cutting Through the Noise: Smarter Context Management for LLM-Powered Agents," JetBrains Research, Dec. 2025. [Online]. Available: https://blog.jetbrains.com/research/2025/12/efficient-context-management/
[14]
"Context Window Overflow in 2026: Fix LLM Errors Fast," Redis, Feb. 18, 2026. [Online]. Available: https://redis.io/blog/context-window-overflow/
[15]
"Enhancing LLM Context with Recursive Summarization Using Python," GitHub, Feb. 18, 2026. [Online]. Available: https://github.com/xbeat/Machine-Learning/blob/main/Enhancing%20LLM%20Context%20with%20Recursive%20Summarization%20Using%20Python.md
[16]
"Recursive Semantic Compression (RSC)," Scribd, Feb. 18, 2026. [Online]. Available: https://www.scribd.com/document/947818914/Recursive-Semantic-Compression-RSC
[17]
H. Soul, "Context Management for Agentic AI: A Comprehensive Guide," Medium, Feb. 18, 2026. [Online]. Available: https://medium.com/@hungry.soul/context-management-a-practical-guide-for-agentic-ai-74562a33b2a5
[18]
"Vector database vs. graph database: Knowledge Graph impact," Writer Engineering, Feb. 18, 2026. [Online]. Available: https://writer.com/engineering/vector-database-vs-graph-database/
[19]
"Vector Databases vs. Knowledge Graphs for RAG," Paragon Blog, Feb. 18, 2026. [Online]. Available: https://www.useparagon.com/blog/vector-database-vs-knowledge-graphs-for-rag
[20]
"Knowledge Graph vs. Vector Database for Grounding Your LLM," Neo4j, Feb. 18, 2026. [Online]. Available: https://neo4j.com/blog/genai/knowledge-graph-vs-vectordb-for-retrieval-augmented-generation/
[21]
"PentestGPT: Automated LLM Pen Testing," Emergent Mind, Feb. 18, 2026. [Online]. Available: https://www.emergentmind.com/topics/pentestgpt
[22]
"Vector Databases vs Knowledge Graphs: Which One Fits Your AI Stack?" Medium, Feb. 18, 2026. [Online]. Available: https://medium.com/@nitink4107/vector-databases-vs-knowledge-graphs-which-one-fits-your-ai-stack-816951bf2b15
[23]
"Graphiti: Giving AI a Real Memory-A Story of Temporal Knowledge Graphs," Presidio, Feb. 18, 2026. [Online]. Available: https://www.presidio.com/technical-blog/graphiti-giving-ai-a-real-memory-a-story-of-temporal-knowledge-graphs/
[24]
"Bitemporal Property Graphs to Organize Evolving Systems," Oracle Labs, Feb. 18, 2026.
[25]
"Temporal Agents with Knowledge Graphs," OpenAI for Developers, Feb. 18, 2026. [Online]. Available: https://developers.openai.com/cookbook/examples/partners/temporal_agents_with_knowledge_graphs/temporal_agents/
[26]
N. Schick and J. Lehmann, "Truth Maintenance Systems for Retrieval-Augmented Generation: Managing Belief Revision in Dynamic Environments," Journal of Artificial Intelligence Research, vol. 78, pp. 445-469, Oct. 2024.
[27]
"RFM scores - Recency, frequency, monetary," Amperity Docs, Feb. 18, 2026. [Online]. Available: https://docs.amperity.com/user/rfm.html
[28]
"Memory Mechanisms in LLM Agents," Emergent Mind, Feb. 18, 2026. [Online]. Available: https://www.emergentmind.com/topics/memory-mechanisms-in-llm-based-agents
[29]
"Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents," arXiv, Apr. 2024. [Online]. Available: https://arxiv.org/html/2404.00573v1
[30]
"Building an agentic memory system for GitHub Copilot," GitHub Blog, Feb. 18, 2026. [Online]. Available: https://github.blog/ai-and-ml/github-copilot/building-an-agentic-memory-system-for-github-copilot/
[31]
"E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory," arXiv, Jan. 2026. [Online]. Available: https://arxiv.org/html/2601.21714v1
[32]
S. V. Dev, "HMLR-Agentic-AI-Memory-System," GitHub, Feb. 18, 2026. [Online]. Available: https://github.com/Sean-V-Dev/HMLR-Agentic-AI-Memory-System
[33]
"Paged Attention from First Principles: A View Inside vLLM," Hamzas Blog, Feb. 18, 2026. [Online]. Available: https://hamzaelshafie.bearblog.dev/paged-attention-from-first-principles-a-view-inside-vllm/
[34]
Y. Zhang et al., "Machine Unlearning in Generative AI: Challenges and Opportunities for the Right to be Forgotten," IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 2, pp. 112-128, Jan. 2025.
[35]
M. Rossi, "Human-in-the-Loop Memory Curation: Interfaces for Explainable AI Agents," in Proc. 2025 CHI Conference on Human Factors in Computing Systems, 2025, pp. 1-15.
[36]
R. Jain and S. Gupta, "The Economics of Agentic Workflows: Token Consumption and ROI in Autonomous Systems," Enterprise AI Quarterly, vol. 4, no. 1, pp. 22-35, Feb. 2026.
[37]
"XBOW | Autonomous Offensive Security Platform," XBOW, Feb. 18, 2026. [Online]. Available: https://xbow.com/
Contact me
Contact me
You got questions or want to get in touch with me?
- Name
- Michael Schöffel
- Phone number
- Mobile number on request
- Location
- Germany, exact location on request
- [email protected]
Send me a message
* By clicking the 'Submit' button you agree to a necessary bot analysis by Google reCAPTCHA. Cookies are set and the usage behavior is evaluated. Otherwise please send me an email directly. The following terms of Google apply: Privacy Policy & Terms of Service.