LLM Interpretability

This is not a finished research project yet. It is the research direction I keep returning to: how mathematical reasoning might appear inside language models, what kind of evidence could localize part of that process, and how to avoid turning suggestive behavior into overconfident claims.

The question I keep asking

When a language model solves a mathematical problem, the visible answer is only the final surface. The question I want to keep sharpening is what kinds of internal evidence could distinguish memorized form, brittle pattern use, and a reusable reasoning procedure that survives intervention.

Why I care

Interpretability is not only a safety word. For mathematical reasoning, it is also a way to protect intellectual honesty: if a model appears to reason, we should ask where the relevant computation might be represented, when it breaks, and which interventions change the result rather than merely changing the explanation.

What I want to do next

Build small mathematical tasks where the rule is explicit and failure modes are legible.
Learn the probe, intervention, and representation-analysis tools needed for mechanism-level evidence.
Treat negative results as evidence about what a method cannot see, not as empty failure.
Write explanations that clearly separate observation, inference, and speculation.

LLM Interpretability

The question I keep asking

Why I care

What I want to do next

Nearby projects