We’re building infrastructure to generate LLM tools from OpenAPI specs. A typical enterprise API spec weighs in at several megabytes—hundreds of operations, thousands of schemas, deeply entangled references. An LLM can’t consume that. We need to extract just the schemas each operation actually uses.
This is tree shaking: follow references, keep what’s reachable, discard the rest. The algorithm is trivial. The semantics are not.
The deceptive simplicity
The core algorithm is simple: mark entry points, trace references, emit what’s reachable. Same as any “dead code elimination” algorithm.
But schemas have their own quirks. References are strings—JSON Pointers or URIs—not language-level imports. Schemas compose through allOf, anyOf, oneOf, if/then/else. Keywords like properties and items contain subschemas implicitly. The reference graph can have cycles. And the same schema might be referenced from dozens of places.
A naive approach—traverse, copy reachable nodes—produces duplicate schemas, broken references, lost structure. The semantics need more care than the algorithm suggests.
Reference relocation
When you extract a schema, its $ref targets might not exist in the output. A reference to #/components/schemas/User assumes that path exists. In the extracted output, it might live at #/$defs/User instead.
So you relocate references. Every $ref in the output points to where the schema actually landed, not where it originally lived.
But relocation creates new problems.
Name preservation
Schema names aren’t arbitrary. User, PaymentIntent, Repository—these carry semantic meaning. An LLM reading the extracted schema benefits from meaningful names. Generating #/$defs/Schema_47 loses information.
So you preserve names when relocating. Extract the final segment of the original pointer. #/components/schemas/PaymentIntent becomes #/$defs/PaymentIntent.
But what happens when two different source schemas have the same name? An API might define Error in multiple places with different structures. You need disambiguation without destroying semantics.
Referential identity
Here’s the subtle one. If schema A references #/components/schemas/User and schema B also references #/components/schemas/User, both references must resolve to the same extracted schema. Not two copies—the same one.
This matters for more than just output size. Schemas can be recursive. A User might have a manager property that references User. If you don’t preserve identity, you either infinite-loop or break the cycle incorrectly.
The tree shaker maintains a map from original schema identity to relocated schema. When you encounter a reference you’ve seen before, you emit a reference to the already-relocated target rather than re-extracting.
Cycle handling
Recursive schemas are everywhere. A TreeNode contains children: TreeNode[]. A Comment has replies: Comment[]. An Organization has parent: Organization.
Naive traversal loops forever. You need to detect cycles during traversal and handle them during output. The identity map solves this: when you encounter a schema you’re already processing, you know you’ve hit a cycle. Emit a reference to the in-progress relocation target and move on.
Keywords as traversal guides
Last month we built a keyword-centric schema processor. Each keyword knows how to parse, validate, traverse, and transform itself. Tree shaking is where that architecture pays off.
A properties keyword knows its values are subschemas—traverse them. A const keyword knows its value is data—don’t traverse it. A $ref keyword knows to follow the reference. An examples keyword knows its values are instances, not schemas—skip them even though they might look like objects with type and properties.
The tree shaker doesn’t need to know what keywords exist. It asks each keyword: “Which of your values are schemas I should follow?” Keywords answer according to their semantics. Add a new keyword, define its traversal behavior, and tree shaking handles it automatically.
Schema Entry Point ↓Keywords identify subschemas ↓Follow references transitively ↓Relocate with identity preservation ↓Output minimal subgraphThis is why we built transformation infrastructure before validation. Validation is important—we validate every parameter against its schema before execution. But validation is a consumer of schema structure. Tree shaking is a producer. The same keyword-centric architecture serves both, but transformation is the harder problem.
Dialect boundaries
An OpenAPI 3.0 spec uses a JSON Schema dialect that predates $defs. Extracted schemas should use definitions for compatibility. An OpenAPI 3.1 spec uses Draft 2020-12 with $defs. The tree shaker respects these boundaries, relocating to the appropriate container based on the source dialect.
This is another place where the keyword architecture helps. Dialect configuration tells the tree shaker which definition container keyword is active. No special cases, no version checks scattered through the code.
The output
Given an 8MB OpenAPI spec and a single operation, tree shaking produces a ~5KB self-contained schema. Everything that operation needs to validate its parameters and responses. Nothing it doesn’t.
{ "type": "object", "properties": { "customer": { "$ref": "#/$defs/Customer" }, "items": { "type": "array", "items": { "$ref": "#/$defs/LineItem" } } }, "$defs": { "Customer": { ... }, "LineItem": { ... }, "Product": { ... } }}References resolve. Names are meaningful. Cycles work. The schema validates correctly against its original dialect.
Why this matters
We’re not building a schema library. We’re building compiler infrastructure for API specifications.
A library parses and validates. A compiler parses, analyzes, transforms, and generates. Tree shaking is one transformation. Dialect conversion is another. Type generation is another. Each builds on the same keyword-centric foundation.
The goal is to make schemas a solved problem. Parse any dialect, transform as needed, extract what you need, generate what you want. The infrastructure handles the complexity so the rest of the system can treat schemas as reliable, well-understood data.
We’re going to need this for LLM tool generation—transpiling schemas into whatever format each model expects. But the infrastructure is general. Once you can rigorously transform schemas, you stop worrying about schemas.
