LLM Interpretability
This is not a finished research project yet. It is the research direction I keep returning to: how mathematical reasoning might appear inside language models, what kind of evidence could localize part of that process, and how to avoid turning suggestive behavior into overconfident claims.
The question I keep asking
When a language model solves a mathematical problem, the visible answer is only the final surface. The question I want to keep sharpening is what kinds of internal evidence could distinguish memorized form, brittle pattern use, and a reusable reasoning procedure that survives intervention.
Why I care
Interpretability is not only a safety word. For mathematical reasoning, it is also a way to protect intellectual honesty: if a model appears to reason, we should ask where the relevant computation might be represented, when it breaks, and which interventions change the result rather than merely changing the explanation.
What I want to do next
- Build small mathematical tasks where the rule is explicit and failure modes are legible.
- Learn the probe, intervention, and representation-analysis tools needed for mechanism-level evidence.
- Treat negative results as evidence about what a method cannot see, not as empty failure.
- Write explanations that clearly separate observation, inference, and speculation.
