ArchitectureMarch 25, 20266 min read

My Copilot Aha Moment — And What It Taught Me About Memento

Watching Copilot navigate an unknown codebase live, I realised Memento's agent is doing the exact same thing — just through your memories.

Agent Architecture

Coding Agent

Read files

Parse AST

Trace deps

same
loop

Memento

Search memory

Query index

Synthesize

Different memory. Same architecture.

It happened while I was using GitHub Copilot inside VS Code. I had opened a codebase I'd never touched before — a side project someone shared with me — and I asked Copilot to explain what a particular function did. Nothing unusual.

But instead of just summarising, Copilot started working. It read the file. Traced an import. Pulled up a type definition. Checked a config. And then it answered — with full context, citing the exact lines. I hadn't told it where to look. It just knew.

That was the aha moment. Copilot wasn't magic — it was methodical. It used tools, gathered evidence, and reasoned its way to an answer. And I suddenly saw Memento in a completely new light.

The pattern underneath the magic

After that Copilot session I wanted to understand the mechanics properly — not just observe the output. The challenge: Copilot itself isn't open source. So I went looking for something that was.

That led me to OpenCode — an open-source AI coding agent with a cleanly readable implementation. Because the code is public, I could actually trace exactly how the agent loop worked, not just guess from the outside. The loop is straightforward but remarkably effective:

1Understand the query — What does the user actually want?
2Select a tool — Read a file? Search a symbol? Run a test? Pick the best next signal.
3Execute and observe — Run the tool, receive the result.
4Update the mental model — What does this tell me? What's the next gap?
5Repeat or answer — Keep going until there's enough evidence to respond confidently.

The agent doesn't know the codebase in advance. It earns its understanding, tool call by tool call.

How this connects to Memento

Reading through OpenCode's source, one session stuck with me. The agent was asked to find why a function was broken in an unfamiliar repo. It searched, found three candidates, read the most relevant one, traced a dependency, surfaced the bug. Five tool calls. Zero prior knowledge.

I closed the file and thought about a typical Memento session for the first time through this lens.

A user asks: "What was that article about Rust error handling I read last month?"

Memento needs to run a semantic search on memory entries, filter by approximate timeframe, score and re-rank candidates, pull the most relevant text snippets, and synthesize an answer.

Same loop. Different memory. A coding agent's memory is the codebase — files, symbols, tests, configs. Memento's memory is your past — screenshots, OCR text, timestamps, window titles, URLs. Change the tools, change the data, and the underlying architecture is identical.

What I learned reading OpenCode's source

The reason I went to OpenCode specifically was simple: it's open source. The whole agent loop — tool dispatch, context assembly, result trimming, LLM formatting — is right there to read. I spent several evenings going through it, and four things stood out clearly.

Tool definitions are explicit contracts

Each tool has a precise schema — inputs, outputs, expected side effects. The agent never guesses what a tool does. This keeps the loop predictable and debuggable.

Context management is aggressive

Tool results are trimmed, ranked, and sometimes summarized before going back into context. Token budgets are a first-class concern, not an afterthought.

The agent loop is intentionally shallow

Surprisingly few abstractions. The core loop is tight and explicit: call tools → collect results → format context → call LLM → parse → repeat. No magic orchestration. Just disciplined structure.

Evidence is always traceable

Every claim the agent makes points back to something it actually read. The provenance chain is maintained throughout. No hallucination tolerated.

What transferred to Memento

Studying OpenCode changed how I think about Memento's retrieval architecture. Several patterns transferred almost directly.

Parallel retrieval, not sequential

Don't run searches one at a time. Run multiple targeted queries simultaneously — by time window, by app context, by semantic meaning — then merge and rank. This is how coding agents fan out exploratory tool calls before converging on an answer.

Evidence anchoring over fluency

Memento should resist the urge to give fluent answers at the cost of accuracy. Every answer should be traceable back to a specific memory entry. If it can't be sourced, it shouldn't be said.

Context curation is the real work

The LLM doesn't see everything. Memento must make sharp decisions about which memory snippets are actually useful for a given query. Studying how coding agents trim and prioritize tool results gave a lot to work with here.

Short tight loops beat long planning chains

Trying to plan out a full retrieval strategy upfront leads to brittle behavior. Better to run one retrieval step, observe the results, and decide the next step dynamically — exactly like a coding agent navigating an unknown file system.

Memento’s actual architecture

After that deep-dive into OpenCode I came back to Memento with fresh eyes and was honestly surprised by how much of this pattern was already there. Let me walk through what the agent actually does.

①

Chat Context Manager

Runs on every request first. Applies a sliding-window summarization to keep the full chat history under 1 500 tokens — older turns are compressed by a FREE-tier LLM, the two most recent pairs are kept verbatim. Downstream nodes never see unbounded history.

②

Classifier & Router

Single FREE-tier LLM call that rewrites the query (resolves pronouns like “it” and relative times like “last week” to ISO timestamps), then routes: chat-only, search, or mixed. If the query is too vague it emits a clarification question and stops.

③

Planner Node

For complex queries a PREMIUM-tier LLM produces a validated intent DAG — each step has a goal, a skill hint, and declared dependencies. Cycles are detected and rejected; the planner retries up to twice on validation failure.

④

Scheduler

Pure code, no LLM. Topological-sorts the DAG into execution levels. Steps with no shared dependency land in the same level and run in parallel (max concurrency 3).

⑤

Executor + ReAct Loop

Each step runs its own ReAct loop: Think → Act (tool call) → Observe → repeat. Tools include SQL against local SQLite, semantic vector search, hybrid FTS+vector search, web search, and readMore to fetch full chunk text only when needed (preview-first to protect the context window). Global caps: 12 LLM calls, 30 s wall-clock.

⑥

Final Answer Node

Collects all StepResult objects, fetches full evidence text, and streams a cited answer via SSE. Memory citations use [[chunk_N]], web citations use [[web_N]]. Suggested follow-up questions are parsed from the response and sent to the frontend alongside resolved source metadata.

The data layer underneath is a local SQLite database with FTS5 for keyword search and 384-dim vector embeddings for semantic search — both served by a Rust daemon running on-device. Nothing leaves the machine unless the user triggers a web search.

What needs to get better

The architecture is solid but there are places where it needs to grow. This is where I’d genuinely love input from people who’ve built similar systems.

Temporal reasoning at query time

The classifier rewrites relative times to ISO, but deeper queries like “what was I researching the week before that meeting” need the agent to reason across time spans, not just convert a timestamp. How should this layer work?

Inter-step memory in multi-turn conversations

The chat context manager handles summarization well for back-and-forth chat, but follow-ups that reference earlier search results within the same session — “what about the other thing you found” — are still fragile. Better session-level state management would help.

Smarter evidence ranking before the final node

Each ReAct step does preview-first retrieval which protects the context window. But the final node currently uses all collected evidence without re-ranking across steps. A cross-step relevance pass before synthesis could sharpen answers significantly.

If any of these map to problems you’ve solved — in a different product, a research project, or another open-source agent — I’d love to hear how you approached it. Open an issue or reach out directly.

Closing thought

The aha moment with Copilot didn’t just change how I think about coding agents — it changed how I think about Memento. OpenCode gave me a foundation and pointed my thinking in the right direction. Memento still needs better architectural design and sharper engineering practices — and that's exactly what I'm working toward.

If you’ve built retrieval systems, agent pipelines, or anything in this neighbourhood and have thoughts on the open questions above, suggestions are very welcome. Open an issue, drop a comment, or just tell me where you’d take it next.

Open Source

Have thoughts on agent design?

Memento is public and open for contributions. If you've worked on similar retrieval or agent systems and have ideas, open an issue or reach out directly.

View on GitHub Open an issue