← Back to Autonomy
A Working Monograph · Code → Tokens → Trees → Reasoning

Feeding C to a
Reasoning Machine

How to take a raw .c file and turn it into something a large language model can actually think about — explained plainly, with the alternatives laid side by side.

01 The problem in one breath

An LLM reads text. Source code is text. So why isn't this trivial?

Because a C file is not really a sequence of characters — it only looks like one. Underneath the text lives structure: functions, types, scopes, who-calls-whom, which header pulls in which definition. If you paste the raw file into a model, you hand it the words but throw away the grammar. For one small file that's fine; the model will infer the rest. For anything real, three problems bite at once.

So "parsing C for an LLM" really means a small pipeline: understand the structure, slice along its natural seams, keep the relationships, and feed the model only the relevant pieces — within budget. The middle of this page builds that pipeline one layer at a time and lists every common method so you can pick. The final sections then turn to AUTOSAR — where embedded C stops being plain C and quietly breaks half the assumptions this pipeline rests on.

02 A ladder of representations

From "just the text" up to "a queryable graph." Each rung adds meaning — and cost.

Think of it as a ladder. The higher you climb, the more the machine understands rather than merely reads — but every rung costs more tooling and time. Most real systems stop somewhere in the middle and combine a couple of rungs.

The representation ladder — tap a rung
RUNG 0Raw text
RUNG 1Tokens
RUNG 2Syntax tree
RUNG 3Semantic model
RUNG 4Code graph

03 Step one: chopping into tokens

Lexing — the simplest structural step, and the foundation of everything above it.

Before any tree exists, a lexer (or tokenizer) sweeps the characters and groups them into meaningful atoms: keywords, identifiers, numbers, string literals, operators, punctuation. Whitespace and comments are usually tagged or dropped. It answers "what are the words?" but not "how do they fit together?"

You rarely build a lexer by hand for LLM work — the parser below does it for you — but it's worth seeing, because tokens are also what the model's own tokenizer (and your token-budget math) operate on.

// input
int sum(int a, int b) { return a + b; }

// tokens (kind : text)
keyword:int  ident:sum  punct:(  keyword:int  ident:a  punct:,
keyword:int  ident:b  punct:)  punct:{  keyword:return
ident:a  op:+  ident:b  punct:;  punct:}

04 Step two: building the tree

This is the heart of it. A parser turns the flat token stream into a tree that mirrors the code's grammar.

A syntax tree captures nesting: a function contains a body, which contains statements, which contain expressions. Two flavors exist. A Concrete Syntax Tree (CST) keeps every detail including punctuation and exact positions — great for tooling that must map back to the source. An Abstract Syntax Tree (AST) drops the noise and keeps the meaning — cleaner to reason over. People often say "AST" loosely for both.

Watch the same snippet move up the ladder:

From text to tree — tap a view
float clamp(float x, float lo, float hi) {
  if (x < lo) return lo;
  return x > hi ? hi : x;
}
kw:float  id:clamp  (  kw:float id:x , kw:float id:lo , kw:float id:hi )
{  kw:if  (  id:x  <  id:lo  )  kw:return id:lo ;
kw:return id:x > id:hi ? id:hi : id:x ;  }
function_definition ├─ type: float ├─ declarator: function_declarator │ ├─ name: clamp │ └─ parameters: x:float, lo:float, hi:float └─ body: compound_statement ├─ if_statement │ ├─ condition: binary_expr (x < lo) │ └─ then: returnlo └─ return_statement └─ conditional_expr (x>hi ? hi : x)
// a semantic parser also resolves types & symbols FunctionDecl clamp 'float (float, float, float)' ├─ ParmVarDecl x 'float' ├─ ParmVarDecl lo 'float' ├─ ParmVarDecl hi 'float' └─ CompoundStmt ├─ IfStmt │ └─ BinaryOperator '<' '_Bool' │ ├─ DeclRefExpr x → ParmVarDecl x │ └─ DeclRefExpr lo → ParmVarDecl lo └─ ReturnStmt → ConditionalOperator 'float'

Notice the jump from view three to four: the syntax tree only knows shapes, while the semantic tree has resolved that the x in the comparison is the same variable as the parameter x, and that the whole expression has type float. That resolution is exactly what separates the two main tools.

The three tools you'll actually reach for

Tree-sitter is the workhorse of modern code tooling. It's a fast, incremental parser that builds a CST, tolerates broken or half-written code, and has a ready-made C grammar. It does not run the preprocessor or resolve types — it parses the text as written, which is usually what you want for chunking and search.

# pip install tree-sitter tree-sitter-c
from tree_sitter import Language, Parser
import tree_sitter_c as tsc

C = Language(tsc.language())
parser = Parser(C)

src = b"float clamp(float x){ return x; }"
tree = parser.parse(src)
print(tree.root_node)        # walk .children, .type, .text
# query with S-expressions to grab every function:
q = C.query("(function_definition) @fn")

Reach for it when: you want fast, language-agnostic, error-tolerant structure for chunking, symbol extraction, or repo maps. It's what Aider, many IDEs, and most code-RAG pipelines use under the hood.

libclang exposes the real Clang C/C++ compiler frontend through Python (clang.cindex). Because it is a compiler, it runs the preprocessor, resolves #includes, expands macros, and assigns types. That's the semantic tree from view four. The price: it needs the right include paths and compile flags to work, ideally from a compile_commands.json.

# needs libclang installed; pip install libclang
from clang import cindex
idx = cindex.Index.create()
tu = idx.parse("motor.c", args=["-I./inc", "-std=c11"])

def walk(node):
    if node.kind == cindex.CursorKind.FUNCTION_DECL:
        print(node.spelling, node.result_type.spelling)
    for ch in node.get_children(): walk(ch)
walk(tu.cursor)

Reach for it when: you need ground truth — exact types, resolved macros, real call targets, cross-file symbols. Heavier to set up, unbeatable on accuracy.

pycparser is a pure-Python C99 parser — no compiler dependency, easy to embed, gives you a clean AST you can walk with a visitor. The catch: it cannot handle the preprocessor itself, so you must feed it already-preprocessed code (run gcc -E first, or use its fake-headers trick).

# pip install pycparser ; preprocess first!
# gcc -E -I./inc motor.c > motor.i
from pycparser import parse_file, c_ast

ast = parse_file("motor.i", use_cpp=False)

class FuncVisitor(c_ast.NodeVisitor):
    def visit_FuncDef(self, node):
        print(node.decl.name)
FuncVisitor().visit(ast)

Reach for it when: you want a lightweight, dependency-free AST for clean preprocessed C and full control of the walk — common in research and small tools.

05 The C-only landmine: the preprocessor

This is the trap that surprises people coming from Python or JavaScript tooling.

In C, what's on disk is not what compiles. The preprocessor runs first — splicing in headers, expanding macros, and deleting whole branches behind #ifdef. So a function call like MAX(a, b) might be a macro that vanishes, and a type might only exist after a header is pasted in.

Gotcha Tree-sitter and pycparser parse the unexpanded text, so macros and conditional code can confuse them. libclang expands everything correctly — but only if you give it the same include paths and flags the build uses.

The practical resolution depends on your goal:

06 Step three: capturing relationships

A tree describes one file. Reasoning over a system needs the wiring between files.

Once you have trees, you extract the connective tissue: which functions exist, where they're defined, and who references whom. These artifacts are what let a model answer "if I change this, what breaks?"

Why this matters for LLMs A model reasons far better when, alongside the function you're asking about, you also hand it the signatures of its callers and callees. The call graph is how you decide which extra pieces to include without dumping the whole repo.

07 Step four: slicing for the context window

The single most important practical decision: where do you cut?

You can't show the model everything, so you split the code into chunks that fit the window. The naive way — cut every N lines — is a quiet disaster: it slices functions in half, so neither chunk is a valid, self-contained unit. The fix is structure-aware chunking: cut along the tree's natural seams, so each chunk is a whole function, struct, or declaration. Tap to compare:

Same file, two ways to cut it

You don't have to build this yourself. Two popular libraries do structure-aware splitting for C out of the box:

# LangChain — code-aware separators per language
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter, Language)
splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.C, chunk_size=1200, chunk_overlap=150)

# LlamaIndex — true tree-sitter chunking
from llama_index.core.node_parser import CodeSplitter
splitter = CodeSplitter(language="c", chunk_lines=40,
                        chunk_lines_overlap=10, max_chars=1500)
Always attach metadata Whatever cuts you make, tag each chunk with its file path, line range, function name, and signature. That metadata is what lets the model (and you) trace an answer back to the exact place in the source — and it's what retrieval ranks on.

08 Step five: showing only what matters

For anything bigger than the window, you need to retrieve the relevant chunks per question.

Now you have a pile of clean, tagged chunks. For a real repo there are thousands. The job at query time is to surface the handful that actually bear on the question. Three families of methods, often combined:

Semantic retrieval (RAG)

Convert each chunk into a vector with a code-aware embedding model, store the vectors, and at query time fetch the chunks whose vectors sit nearest the question. This is classic retrieval-augmented generation. It's great at "find code that does X" even when the wording differs. Research like the cAST work shows that embedding AST-aligned chunks beats embedding arbitrary line slices.

Structural retrieval (graphs)

Walk the call/include graph instead of (or alongside) vectors: given the target function, pull its direct callers and callees. Encoding the codebase as a graph and traversing it — sometimes called GraphRAG — captures relationships that pure text similarity misses.

The repo-map trick — a worked example

The cleanest real-world synthesis is Aider's repository map. It parses every file with Tree-sitter to extract definitions and references, builds a graph where files and symbols are nodes, then ranks them with a PageRank-style algorithm — the same idea Google used for web pages. A symbol called by many important functions scores high. It then emits just the top-ranked signatures, trimmed to fit the token budget. The model gets a compact map of the whole codebase's skeleton without ever seeing every line.

The unifying idea Importance is relational. The best context-selection methods — PageRank repo maps, graph traversal, embedding similarity — all answer the same question: given limited room, which pieces of this codebase most help reason about the thing being asked?

09 How to actually hand it to the model

A subtle point that trips people up: don't over-engineer the format.

It's tempting to dump the full AST as JSON and feed that to the model. Usually a mistake — AST dumps are enormously verbose and burn tokens, and modern models reason on source text quite well already. The structure you worked so hard to extract is best used to decide what to include and how to label it, not to replace the source.

A reliable format for each retrieved piece:

// file: src/foc.c  ·  lines 88–121  ·  fn: foc_update
// calls: clarke_transform, park_transform, pi_step
void foc_update(MotorState *m, float i_a, float i_b) {
    // ... the real, unmodified source ...
}

That is: the original source, preceded by a short metadata header and a one-line summary of its dependencies. Keep comments and the real identifier names — they carry most of the semantic signal the model uses. Only fall back to richer encodings (typed signatures, flow facts) when a task genuinely needs them.

10 The whole pipeline, end to end

Tap each stage. This is the recipe most production code-RAG systems follow.

A reference pipeline — tap a stage
STAGE 1Collect
STAGE 2Decide on macros
STAGE 3Parse
STAGE 4Extract
STAGE 5Link
STAGE 6Chunk + tag
STAGE 7Index
STAGE 8Assemble

11 Every method, side by side

The "other available methods" in one matrix, so you can pick by constraint.

MethodWhat it givesResolves macros/typesEffortBest for
Raw pasteThe textNoTrivialOne small file
Regex / heuristicsRough function splitsNoLowQuick hacks; fragile
Tree-sitterCST, error-tolerantNoLow–medChunking, search, repo maps
libclang / ClangTyped semantic ASTYesHighExact analysis, call graphs
pycparserClean C99 ASTPre-process firstMediumLightweight tooling/research
ctags / cscope / GlobalSymbol & xref indexPartialLowFast "where defined/used"
clangd (LSP)On-demand semantic factsYesMediumAgents querying live
Doxygen / cflowCall graphs, docsPartialMediumVisual structure, summaries
Embedding RAGSemantic chunk searchN/AMediumLarge repos, fuzzy queries
GraphRAG / repo mapRanked structural contextDependsHighWhole-codebase reasoning

12 Practical notes — the things that bite

Hard-won gotchas. Tap to check them off as you build.


13 When the C is just an artifact: AUTOSAR

AUTOSAR Classic turns C into generated glue around a configuration. The meaning moves out of the file.

Everything above quietly assumes one thing: that the file is the unit of meaning. Read the function, parse the tree, follow the calls, and you understand the code. AUTOSAR Classic breaks that assumption. Here the C is mostly generated from a configuration (ARXML) by tools like DaVinci, EB tresos, or model generators such as TargetLink. The source of truth is the configuration; the C is one of its projections. Hand-written logic survives only inside the runnables of software components — everything around it is machinery.

The cleanest way to feel the difference is to see the same logic in both worlds:

The same function, two worlds — tap to switch
// plain C — self-contained, says what it means
float pi_step(PI *c, float err) {
    c->i += err * c->ki;
    return c->kp * err + c->i;   // caller links directly
}
// AUTOSAR Classic — the same math, as generated glue
#define PiCtrl_START_SEC_CODE
#include "PiCtrl_MemMap.h"        // re-included, no include guard

FUNC(void, PiCtrl_CODE) PiCtrl_Step(void) {     // a runnable
    VAR(float32, AUTOMATIC) err, out;
    (void)Rte_Read_PiCtrl_Err_val(&err);        // input via RTE
    out = c_kp*err + /* ... integral ... */;
    (void)Rte_Write_PiCtrl_Out_val(out);        // output via RTE
}

#define PiCtrl_STOP_SEC_CODE
#include "PiCtrl_MemMap.h"

Four things changed and none are cosmetic: the signature is wrapped in FUNC/VAR macros from the compiler-abstraction layer; the parameters and return value are gone because data flows through the RTE, not the call; the body is bracketed by MemMap includes that place it in a linker section; and — most important — who supplies Err and who consumes Out is nowhere in this file. It lives in the ARXML connectors.

The core inversion In general C, behavior is in the code and configuration is incidental. In AUTOSAR Classic, behavior is in the configuration and the C is its output. Parse the C alone and you have the verbs without the sentence.

The RTE makes call graphs lie

Components never call each other directly. They read and write through generated Rte_* stubs, and the wiring — which writer feeds which reader, which runnable is triggered by which event — is defined by ARXML connectors and the RTE event mapping. A call graph built from the C shows the stub, not the partner:

SWC: SensorCal SWC: PiCtrl Rte_Write_…_val ──┐ ┌──► Rte_Read_…_val │ │ └─── [ ARXML connector ] ───┘ ↑ this edge is the real data path — and it exists only in configuration, never as a C call.
Gotcha · MemMap Those #include "Xxx_MemMap.h" lines re-include the same header repeatedly with no include guard, toggled by section macros, to place symbols in linker sections. A naive chunker mishandles the repetition, and a semantic parser errors unless the section macros are defined and the MemMap headers are present — an idiom you essentially never meet in general C.

A more constrained, more regular language

AUTOSAR C is a disciplined subset: MISRA-C compliance, no dynamic memory in the classic platform, restricted pointers, strict <Module>_<Function> naming, platform types (uint8, sint16, float32, boolean) instead of native ones, and large regions gated on configuration, e.g. #if (PiCtrl_DEV_ERROR_DETECT == STD_ON). The regularity can help a model pattern-match — but the idioms differ sharply from the open-source C it mostly trained on.

Classic vs Adaptive Everything here is Classic AUTOSAR (C, statically configured, for ECUs). Adaptive AUTOSAR is a different beast: C++14/17 on POSIX, service-oriented over SOME/IP and ara::com, dynamically deployed — closer to general C++ tooling, with its own challenges.

14 Why AUTOSAR resists reasoning

Each peculiarity above maps to a concrete bottleneck for an LLM. Tap to explore the seven.

The seven bottlenecks — tap one
B1Meaning in ARXML
B2Graphs that lie
B3Boilerplate bloat
B4Preprocess dilemma
B5Many artifacts
B6Thin priors
B7Variant-dependent

What actually helps

The fixes all follow from one move: stop treating the C as the whole input.

The mental shift For general C the file is the unit of meaning. For AUTOSAR the configuration is the unit of meaning, and the C is one of its projections. Build the pipeline around the config — fuse ARXML with the code — and the C falls into place.

15 References & further reading

Primary docs first, then the research and worked examples cited above.

  1. Tree-sitter — official docs & the "Using Parsers" guide. tree-sitter.github.io; Python bindings: github.com/tree-sitter/py-tree-sitter (current API: Language(tsc.language()), Parser(LANG)).
  2. Clang / libclang — the LibClang C interface and Python clang.cindex bindings. clang.llvm.org/docs/Tooling.html.
  3. pycparser — pure-Python C99 parser by Eli Bendersky. github.com/eliben/pycparser (see the notes on preprocessing and fake headers).
  4. Bear — generates compile_commands.json so semantic parsers see your real build. github.com/rizsotto/Bear.
  5. clangd — Language Server providing semantic queries over C/C++. clangd.llvm.org.
  6. universal-ctags, cscope, GNU Global — symbol and cross-reference indexers. ctags.io · gnu.org/software/global.
  7. Doxygen & cflow — documentation and call-graph extraction. doxygen.nl · gnu.org/software/cflow.
  8. LangChain text splittersRecursiveCharacterTextSplitter.from_language(Language.C). docs.langchain.com.
  9. LlamaIndex CodeSplitter — Tree-sitter-backed code chunking. developers.llamaindex.ai.
  10. Aider repository map — Tree-sitter + PageRank-ranked, token-budgeted context. aider.chat/docs/repomap.html.
  11. cAST: Structural Chunking via Abstract Syntax Tree (2025) — evidence that AST-aware chunks improve code retrieval & RAG. arXiv:2506.15655.
  12. AUTOSAR Classic Platform specifications — the Software Specifications (SWS) for the RTE, Compiler Abstraction, Platform Types, and Memory Mapping define the Rte_* APIs, FUNC/P2VAR macros, platform types, and the MemMap idiom discussed above. autosar.org/standards/classic-platform.
  13. MISRA C — the coding-standard subset AUTOSAR C is written against. misra.org.uk.
  14. ISO 26262 — road-vehicle functional safety; the ASIL context that makes provenance and traceability non-optional for any reasoning over this code. iso.org · ISO 26262.

Library APIs and grammar versions move quickly — treat the code snippets as current-as-of-2026 patterns and check each project's docs before pinning versions.