For a road designed at 80 km/h, the manual gives a minimum radius of 250 meters. I had an agent that knew that number cold. It still got the radius wrong, because the number was never the answer.
250 meters is the floor only when the curve before it is short. Let the preceding straight run long and the operating speed climbs past the posted one, and the floor climbs with it. Tighten the next curve and the consistency check can push it higher again. The radius a curve actually needs is the output of several coupled variables and a feedback loop, not a row in a table. The table is true. It is also a trap, because it reads like the end of the question when it is the beginning.
This is the thing I got wrong about giving an agent domain knowledge, and it took building a few specialized ones to see it: the hard part isn’t getting the manual into the model. It’s keeping the structure the manual’s prose was carrying.
What “load the manual” usually means
The reflex, when you want an agent to be good at something specific, is to feed it the source material. PDF in, knowledge base out. Chunk the document, embed the chunks, retrieve the top few into context at question time. That’s the RAG playbook, and for a lot of tasks it’s exactly right.
It also quietly throws away the most valuable thing in a technical manual: not the values, but the way the values condition each other. A standard isn’t a list of numbers. It’s a system where the minimum radius depends on the operating speed, the operating speed depends on how long the preceding tangent is, the superelevation you choose changes the radius you can use, and the visibility requirement can override all of it. Flatten that into retrievable passages and the model gets the sentences back — but it answers as if the table were the truth: it hands you 250 and stops.
I don’t think this is a retrieval problem you fix with better embeddings. The passages were never the point. The edges were.
A criterion is a node, not a paragraph
So we stopped converting manuals into text and started converting them into graphs. One criterion per node. Each node is small and declares the same things:
- What it determines — one line.
- The rule or formula, with its variables named.
- What it depends on — every input, and why. Not “speed” but “the operating speed the vehicle actually carries into the curve, which is higher than the posted one.”
- What it conditions downstream — what this value triggers or constrains elsewhere.
- Why it is not a single number — the obligatory section. If you can read a value off the node without understanding what moves it, the node is malformed.
- The values, each tagged with its source and whether it has been checked against the original.
That “why it is not a single number” section is the whole discipline. It is the part a plain-text dump erases, and the part an expert actually carries in their head. Writing it down is what turns a reference into something an agent can reason with instead of recite from.
Building one
The process settled into five passes, and the order matters.
First, map the graph before writing anything. Read the manual to find which criteria it covers and how they wire together — the nodes and the edges — not to transcribe it. The output of this pass is a dependency list and, tellingly, the cycles. When you find a loop — a curve’s speed sets the consistency check, which sets the next curve’s speed, which sets the next radius — you have proof the knowledge can’t be stored flat.
Second, write two or three nodes by hand and stop. The node schema defines the entire knowledge base; getting it wrong and finding out after forty nodes is expensive. So you prototype the hardest few — the ones in the cycle — confirm the shape, then scale.
Third, populate by theme, in parallel. One worker per thematic block — speeds, plan, profile, visibility, cross-section — each handed the spec the first pass already produced. They write cards; they don’t re-derive the criteria.
Fourth — and this is the pass that separates a nice diagram from something you can sign — reconcile the numbers against the original PDF. Converting a manual to markdown wrecks its tables: cells collapse, columns merge, a value silently becomes the wrong value. The markdown is fine for the logic of what-depends-on-what. The numbers have to be read back off the source, page by page, each one tagged with where it came from. Until a value is reconciled, it is marked unverified and the agent is told not to trust it. No source, no number — a guess is worse than a gap.
Fifth, dogfood against a real case. Point the finished graph at an actual project and make it produce verdicts, each citing the node it used. This does two things: it proves the graph works, and it corrects the graph, because the real case always finds something the manual’s prose left implicit.
What the graph caught that a flat file wouldn’t
I want to be precise here, because this is easy to oversell. I don’t have an A/B benchmark — no “flat scored X, graph scored Y” across a hundred cases. What I have is a handful of real ones where the structure paid for itself.
On EUCLIDES, our road-geometry agent, the dogfood ran the curves of an actual alignment. The flat reading passed all of them. The graph flagged several to review — and in the process found that one of its own rules was too strict: the manual fixes a certain operating-speed case by a ratio of radii, not the blunt rule we had first encoded, which exonerated a curve we had wrongly marked. The graph didn’t just check the project. It checked itself.
On APELES, our brand agent — same method, completely different domain — the graph caught that a one-color logo we were about to approve collapsed at favicon size. A decision made “by eye” would have shipped it. Following the edges — drop the color, so distinctiveness now rests on the shape, so test the shape at small size — surfaced the failure before it left the building.
The pattern repeats. The discipline of declaring every edge acts as a consistency check on the knowledge itself. Gaps that hide comfortably in prose can’t hide in a graph, because a missing edge is visible.
One method, a bench of specialists
The useful surprise was that the method didn’t care about the domain. The same five passes that built EUCLIDES built POSEIDON from the drainage manual, ICARO from the civil-aviation standard, APELES from a shelf of design books, CADMO from a stack of typography and accessibility standards, THEMIS from the civil code. Each is its own graph with its own sources; each became its own specialized agent, narrow on purpose, that reasons from criteria instead of from vibes. (MARCO, the agent I’ve written about that places culverts in our road-design software, sits next to these: MARCO drives the software, EUCLIDES is the criteria it answers to.)
| Agent | Domain | Source standards | Nodes |
|---|---|---|---|
| EUCLIDES | Road geometry | MOPC (Paraguay) + AASHTO | ~35 |
| POSEIDON | Hydrology & drainage | MOPC drainage manual | ~23 |
| ICARO | Aerodrome design | DINAC R14 + ICAO Annex 14 | ~24 |
| APELES | Brand & identity | Wheeler · Chaves · Costa · Müller · Neumeier | ~38 |
| CADMO | Document production | WCAG 2.2 · Butterick · house standard | ~35 |
| THEMIS | Legal & contracts | Paraguayan Civil Code | ~39 |
None of these is a general assistant told to “act like a road engineer.” Each is a body of structured criteria with a narrow mandate and a hard edge — EUCLIDES doesn’t do hydrology, APELES doesn’t do contracts — and the value lives in the graph, not in the prompt.
When flat is fine
This is more work than chunk-and-embed, and it isn’t always worth it. If the agent’s job is to find and quote passages — search a contract for a clause, pull a definition — retrieval over flat text is the right tool and a graph is overhead. The graph earns its cost when the questions are about judgment: when the right answer depends on conditions, when criteria pull against each other, when “the common answer” and “the correct answer for this case” are different things. That is exactly where a model left to its own statistics hands you the average answer with confidence. The graph is how you make it reason about the case in front of it instead.
The manual already knew all of this. It was in the prose, in the cross-references, in the “except when” clauses everyone skims. Building the graph was mostly an act of refusing to throw that away.