Three Ways to Find a Memory: FTS5, Embeddings, and Graph Traversal in a Personal AI #

There are two kinds of forgetting in personal AI systems. The obvious kind: the context window fills up and older messages get truncated. The subtle kind: you've stored a memory, but when the user asks a related question, the retrieval system hands back the wrong things, or nothing.

The previous article in this series covered how Construct compresses conversations into observations. This article covers retrieval: how the agent finds relevant memories when it needs them.

Two Modes of Retrieval #

Memory is retrieved in two ways: passive (automatic, every message) and active (on-demand, via tool invocation). The code treats them very differently.

Passive Retrieval #

Passive retrieval happens automatically on every message. Before the agent even sees the user's input, the processMessage pipeline assembles context from three sources and injects them into the context preamble:

Observations: if the conversation has been compressed by the observer/reflector pipeline, those observations get prepended first as stable context
Recent memories: the last N memories (typically 10), injected for continuity
Relevant memories: memories semantically matched to the current message via embedding similarity

Here's the code:

 1// src/agent.ts
 2const recentMemories = await getRecentMemories(db, 10)
 3
 4let queryEmbedding: number[] | undefined
 5let relevantMemories = []
 6try {
 7  queryEmbedding = await generateEmbedding(env.OPENROUTER_API_KEY, message, env.EMBEDDING_MODEL)
 8  const results = await recallMemories(db, message, {
 9    limit: 5,
10    queryEmbedding,
11    similarityThreshold: 0.4,
12  })
13  const recentIds = new Set(recentMemories.map((m) => m.id))
14  relevantMemories = results
15    .filter((m) => !recentIds.has(m.id))
16    .map((m) => ({ content: m.content, category: m.category, score: m.score }))
17} catch {
18  // Embedding call failed, no relevant memories, that's fine
19}

(Observations are assembled separately in memoryManager.buildContext(), but they arrive in the same preamble.)

These get injected via the context preamble (not the system prompt):

 1// src/system-prompt.ts
 2if (context.observations) {
 3  preamble += '\n[Conversation observations: compressed context from earlier in this conversation]\n'
 4  preamble += context.observations + '\n'
 5}
 6
 7if (context.recentMemories && context.recentMemories.length > 0) {
 8  preamble += '\n[Recent memories: use these for context, pattern recognition, and continuity]\n'
 9  for (const m of context.recentMemories) {
10    preamble += `- (${m.category}) ${m.content}\n`
11  }
12}
13
14if (context.relevantMemories && context.relevantMemories.length > 0) {
15  preamble += '\n[Potentially relevant memories]\n'
16  for (const m of context.relevantMemories) {
17    const score = m.score !== undefined ? ` (${(m.score * 100).toFixed(0)}% match)` : ''
18    preamble += `- (${m.category}) ${m.content}${score}\n`
19  }
20}

The agent doesn't need to invoke any tool for this. It just arrives already knowing the compressed context of the current conversation (observations), recent history (memories), and whatever is semantically relevant to the current message (more memories).

Observations #

The observations in the preamble come from the observer/reflector pipeline (detailed in the previous article in this series). In short: as the conversation unfolds, the observer extracts structured observations from recent messages ("Reed mentioned planning a trip to Portugal"). When observations accumulate past a threshold, the reflector condenses them. These condensed observations are stable context. They survive until a new reflection pass runs. They arrive in the preamble before memories because they're the most current context about the ongoing conversation.

Active Retrieval #

Active retrieval happens when the agent decides it needs to dig deeper. The memory_recall tool lets the agent search explicitly by keyword or topic. This is where the full retrieval stack comes into play.

The FTS5 → Embedding Waterfall #

The recallMemories function in src/db/queries.ts runs two passes in sequence, merging and deduplicating results as it goes:

 1/**
 2 * Hybrid memory recall: FTS5 → embeddings.
 3 * Results are merged and deduplicated by memory ID.
 4 */
 5export async function recallMemories(
 6  db: DB,
 7  query: string,
 8  opts?: {
 9    category?: string
10    limit?: number
11    queryEmbedding?: number[]
12    similarityThreshold?: number
13  },
14): Promise<(Memory & { score?: number; matchType?: string })[]> {
15  const limit = opts?.limit ?? 10
16  const seen = new Set<string>()
17  const results: (Memory & { score?: number; matchType?: string })[] = []
18
19  // 1. FTS5 full-text search
20  // 2. Embedding cosine similarity
21}

Pass 1: FTS5. SQLite's FTS5 virtual table provides BM25-ranked full-text search over memory content and tags. (BM25 is a relevance ranking algorithm that scores how well a document matches a query—also used in Claude Code for tool selection.) The query tokenizer is minimal: it splits on whitespace, quotes each token, and joins with OR:

 1const ftsQuery = query
 2  .split(/\s+/)
 3  .filter((w) => w.length > 1)
 4  .map((w) => `"${w.replace(/"/g, '')}"`)
 5  .filter((w) => w !== '""')
 6  .join(' OR ')
 7
 8if (ftsQuery) {
 9  const ftsResults = await sql<Memory & { rank: number }>`
10    SELECT m.*, fts.rank
11    FROM memories_fts fts
12    JOIN memories m ON m.id = fts.id
13    WHERE memories_fts MATCH ${ftsQuery}
14      AND m.archived_at IS NULL
15    ORDER BY fts.rank
16    LIMIT ${limit}
17  `.execute(db)
18
19  for (const row of ftsResults.rows) {
20    if (!seen.has(row.id)) {
21      seen.add(row.id)
22      results.push({ ...row, matchType: 'fts5' })
23    }
24  }
25}

FTS5 is fast, runs entirely in SQLite with no external dependencies, and reduces word variants to their root form (so "running" and "ran" both match a query for "run") while ranking results well for short queries. The memories_fts virtual table is kept synchronized via three database triggers created in migration 002: an AFTER INSERT, AFTER DELETE, and AFTER UPDATE that maintain the FTS index in lockstep with the main memories table.

Pass 2: Embedding similarity. If FTS5 found fewer results than the limit, the embedding pass runs. It fetches all memories with stored embeddings, computes cosine similarity in-process, filters by threshold, and promotes the best matches:

 1if (opts?.queryEmbedding && results.length < limit) {
 2  const threshold = opts.similarityThreshold ?? 0.3
 3  const allWithEmbeddings = await db
 4    .selectFrom('memories')
 5    .selectAll()
 6    .where('archived_at', 'is', null)
 7    .where('embedding', 'is not', null)
 8    .execute()
 9
10  const scored = allWithEmbeddings
11    .map((m) => ({
12      ...m,
13      score: cosineSimilarity(opts.queryEmbedding!, JSON.parse(m.embedding!)),
14      matchType: 'embedding' as const,
15    }))
16    .filter((m) => m.score >= threshold)
17    .sort((a, b) => b.score - a.score)
18
19  for (const m of scored) {
20    if (!seen.has(m.id) && results.length < limit) {
21      seen.add(m.id)
22      results.push(m)
23    }
24  }
25}

The cosine similarity is computed in plain JavaScript in a simple loop:

 1// src/embeddings.ts
 2export function cosineSimilarity(a: number[], b: number[]): number {
 3  if (a.length !== b.length) return 0
 4  let dotProduct = 0
 5  let normA = 0
 6  let normB = 0
 7  for (let i = 0; i < a.length; i++) {
 8    dotProduct += a[i] * b[i]
 9    normA += a[i] * a[i]
10    normB += b[i] * b[i]
11  }
12  const denominator = Math.sqrt(normA) * Math.sqrt(normB)
13  if (denominator === 0) return 0
14  return dotProduct / denominator
15}

Embeddings are generated via OpenRouter using qwen/qwen3-embedding-8b (configurable via EMBEDDING_MODEL env var) and stored as JSON-serialized float arrays in a TEXT column. The embedding is generated asynchronously and stored after the memory is saved, so for the first few milliseconds after storing a memory, it's not yet semantically searchable. That's an acceptable tradeoff for a personal AI where memories accumulate slowly.

The Graph Expansion Layer #

Active memory recall goes one step further. After the FTS5/embedding waterfall, memory_recall performs a graph expansion pass using the knowledge graph:

 1// src/tools/core/memory-recall.ts
 2const seen = new Set(memories.map((m) => m.id))
 3const graphNodes = await searchNodes(db, args.query, 5, queryEmbedding)
 4
 5if (graphNodes.length > 0) {
 6  // Traverse 1-2 hops from matching nodes
 7  const allNodeIds = new Set<string>()
 8  for (const node of graphNodes) {
 9    allNodeIds.add(node.id)
10    const traversed = await traverseGraph(db, node.id, 2)
11    for (const t of traversed) {
12      allNodeIds.add(t.node.id)
13    }
14  }
15
16  // Find memories linked to these nodes
17  const relatedMemIds = await getRelatedMemoryIds(db, [...allNodeIds])
18  const newMemIds = relatedMemIds.filter((id) => !seen.has(id))
19
20  if (newMemIds.length > 0) {
21    const relatedMems = await db
22      .selectFrom('memories')
23      .selectAll()
24      .where('id', 'in', newMemIds)
25      .where('archived_at', 'is', null)
26      .limit(5)
27      .execute()
28
29    graphMemories = relatedMems.map((m) => ({ ...m, matchType: 'graph' }))
30  }
31}

The graph traversal uses a recursive CTE that walks edges bidirectionally:

 1WITH RECURSIVE traverse(node_id, depth, via_relation, visited) AS (
 2  SELECT ${startNodeId}, 0, NULL, ${startNodeId}
 3  UNION ALL
 4  SELECT
 5    CASE
 6      WHEN e.source_id = t.node_id THEN e.target_id
 7      ELSE e.source_id
 8    END,
 9    t.depth + 1,
10    e.relation,
11    t.visited || ',' || ...
12  FROM traverse t
13  JOIN graph_edges e ON (e.source_id = t.node_id OR e.target_id = t.node_id)
14  WHERE t.depth < ${maxDepth}
15    AND t.visited NOT LIKE '%' || ... || '%'
16)

The cycle prevention is done by tracking visited node IDs in a comma-delimited string that gets passed down through the recursion. This is a SQLite-native approach since there's no array type.

What this enables: suppose the agent stored a memory "Alice introduced me to the restaurant Nightshade." When the user later asks "do I know any good restaurants?", the text search might not connect "restaurant" to "Nightshade" if those words don't co-occur in the memory. But the graph has edges like Alice -> [introduced to] -> Nightshade and Nightshade -> [is a] -> restaurant. The graph expansion traverses those edges and surfaces the Nightshade memory even without a keyword match.

Graph expansion results are tagged with matchType: 'graph' so the agent knows these are indirect associations rather than direct matches.

Why the Preamble, Not the System Prompt #

The injected memories don't go into the system prompt. They go into the context preamble, which is prepended to the user's first message:

1// src/agent.ts
2await agent.prompt(preamble + message)

This is intentional. The system prompt is kept static, the same text for every conversation turn, which makes it eligible for prompt caching by the LLM provider. The dynamic per-request context (current time, memories, observations, skills) goes in the user message, where it's expected to vary.

The tradeoff: the memories arrive as a user-turn text block rather than authoritative system context. The agent has to treat injected memories as "here's what I know" rather than "here's ground truth." In practice this works fine. The model is instructed in the system prompt to use recent memories for context and pattern recognition, and the format makes the provenance clear.

What It Looks Like to the Agent #

For every message, the agent receives something like this prepended to the user's input:

[Current time: 2:15 PM | Thursday, February 26, 2026 | America/New_York | telegram]

[Conversation observations: compressed context from earlier in this conversation]
- [2026-02-26] Reed mentioned planning a trip to Portugal
- [2026-02-26] Interested in visiting Lisbon and learning fado
- [2026-02-26] Has been using Duolingo for Portuguese

[Recent memories: use these for context, pattern recognition, and continuity]
- [2026-02-24 09:15] (preference) User prefers dark-roast coffee in the morning
- [2026-02-22 14:30] (fact) User's partner is named Jordan
- [2026-02-20 18:45] (note) User is learning Portuguese via Duolingo

[Potentially relevant memories]
- [2026-01-15 11:20] (fact) User visited Lisbon in 2023 (78% match)
- [2026-02-10 16:05] (preference) User enjoys fado music (61% match)

Then the agent receives the current conversation turn, also timestamped:

[2026-02-26 15:30] User: Hey, I'm planning that Portugal trip and wondering if fado is worth catching a live show for

This way the agent can reason temporally: the memory says "2026-02-10 User enjoys fado music" and the current message is from 2026-02-26, so the user is consistent in their interests.

If the agent then invokes memory_recall for more depth, the results come back annotated:

Found 5 memories:
[2026-01-15 11:20] [abc123] (fact) User visited Lisbon in 2023 — tags: travel, portugal [fts5]
[2026-02-10 16:05] [def456] (preference) User enjoys fado music [embedding] (71%)
[2026-01-22 14:30] [ghi789] (fact) User met a Portuguese speaker at work last month [graph]

The distinction between fts5, embedding, and graph matches gives the agent signal about how confident to be in each result's relevance.

Tradeoffs #

The embedding scan is O(n). The current implementation fetches all memories with embeddings and computes similarity in a JavaScript loop. For a personal AI with hundreds of memories this is fine, probably milliseconds. At thousands of memories it starts to matter. A production system would use a proper vector index (sqlite-vec, pgvector, etc.). The implementation only runs the embedding pass when queryEmbedding is present and there's still room in the result set, so it degrades gracefully if embeddings aren't configured.

FTS5 and embeddings can disagree. A query like "gym routine" might get a strong FTS5 hit on "I started a new gym routine" but a stronger embedding match on "I've been doing weights three times a week." The waterfall takes FTS5 results first, then fills the remaining slots with embedding results. FTS5 results are given priority by position in the waterfall. A more sophisticated approach would merge and re-rank results from both methods together, but that adds complexity for a personal AI where the user can always ask again if the first results aren't quite right.

Graph node search uses both embeddings and LIKE. The graph expansion's searchNodes runs a LIKE match on node names and, when a query embedding is available, computes cosine similarity against node embeddings. Embedding matches go first, then LIKE results fill remaining slots, deduplicated:

 1// src/memory/graph/queries.ts
 2export async function searchNodes(db, query, limit = 10, queryEmbedding?) {
 3  const pattern = `%${query.toLowerCase().trim()}%`
 4  const likeResults = await db
 5    .selectFrom('graph_nodes').selectAll()
 6    .where('name', 'like', pattern)
 7    .orderBy('updated_at', 'desc').limit(limit).execute()
 8
 9  if (!queryEmbedding) return likeResults
10
11  const allWithEmbeddings = await db
12    .selectFrom('graph_nodes').selectAll()
13    .where('embedding', 'is not', null).execute()
14
15  const embeddingMatches = allWithEmbeddings
16    .map((n) => ({ ...n, score: cosineSimilarity(queryEmbedding, JSON.parse(n.embedding!)) }))
17    .filter((n) => n.score >= 0.3)
18    .sort((a, b) => b.score - a.score)
19    .slice(0, limit)
20
21  // Merge: embedding matches first, then LIKE, deduplicated
22  const seen = new Set<string>()
23  const merged = []
24  for (const node of [...embeddingMatches, ...likeResults]) {
25    if (!seen.has(node.id) && merged.length < limit) {
26      seen.add(node.id)
27      merged.push(node)
28    }
29  }
30  return merged
31}

This LIKE search is over graph nodes (entities like people, places, concepts), not memories themselves. It's the entry point into the graph: find the relevant nodes, then traverse the edge structure to surface connected memories. Node embeddings are generated during graph extraction after upserting each entity, from a string like "{name}: {description}". This runs in parallel via Promise.allSettled so a single failed API call doesn't break the extraction pipeline.

The result: if the user asks about "my dentist" and the graph has a node named "dr. martinez" with description "Reed's dentist," the embedding similarity catches the semantic connection that a LIKE match on the name alone would miss. The same queryEmbedding generated for memory retrieval gets reused here, so there's no extra API call.

Summary #

Most memory systems for LLM applications default to a single retrieval strategy. Vector similarity dominates because it's powerful and familiar. Some add keyword search (FTS or BM25) as a fallback, forming a two-stage hybrid pipeline. Graph traversal to surface indirect associations is rarer—it requires structured entities and relationships that most systems don't maintain.

Construct uses all three in a layered pipeline, not for complexity's sake, but because each covers real gaps the others leave uncovered. FTS5 is fast and precise for exact keyword matches. Embeddings catch semantic associations that don't share vocabulary. Graph traversal finds memories connected by relationship chains that neither keyword nor embedding search would surface. Each result carries its matchType ('fts5', 'embedding', or 'graph'), so the agent (and anyone debugging the system) can understand both what was retrieved and why.

The queryEmbedding generated for memory retrieval gets reused in three places: memory similarity search, graph node search, and tool pack selection. One API call, multiple consumers.

#memory #fts5 #embeddings #knowledge-graph #retrieval

last updated: 2026-03-16