What You Can Measure From Claude Code Transcripts

Most people look at Claude Code usage and ask one question: how many tokens did I use? That is not the wrong question. It is just too small. Once you inspect the transcript format, it becomes obvious that Claude Code exposes a much richer behavioral surface than a single usage total. The interesting part is not raw volume. It is pattern.

Token counts are only the first layer

Assistant records include usage fields such as:

input_tokens
output_tokens
cache_read_input_tokens
cache_creation_input_tokens

This is already more useful than a single total because it lets you separate:

fresh prompt processing
output generation
cached context reuse
cache construction overhead

That matters because cost and behavior are not the same thing. A session with massive cache reads can look "huge" in total token terms while still being relatively efficient. A session with repeated fresh input and little cache reuse may indicate a very different workflow shape.

A simple example of why totals mislead

Imagine two sessions that both report 20 million total processed tokens:

Session	Fresh input	Cache reads	Likely interpretation
A	15M	1M	Repeatedly reprocessing large context
B	1.5M	17M	Reusing context effectively in a long session

Those sessions are not equivalent. If you only look at the total, they look the same. If you look at the composition, they describe different workflow shapes.

Cache reads are one of the most interesting signals

The reverse-engineered transcript data shows that cache_read_input_tokens can dwarf raw input_tokens in long sessions. This should change how you interpret usage immediately. If you are only staring at totals, you miss the point. High usage does not necessarily mean waste. It can mean the agent is reusing context effectively. This is one reason the framing "just reduce tokens" is too shallow. Token counts are just the surface. The real value is knowing where they came from and what kind of work they represent.

This is one of the clearest places where Vibenalytics benefited from transcript parsing directly. The transcript path gives us exact raw token breakdowns and request-level data. The hook path gives us speed and real-time capture. That split lets the product answer two different questions:

what is happening right now?
what exactly happened, with accurate token accounting?

If you collapse those into one pipeline, you usually end up sacrificing either freshness or correctness.

That tradeoff is also why we built a dedicated correctness benchmark against ccusage. There is no official gold-standard reference for Claude Code token accounting, so we had to validate the system ourselves.

The benchmark does a few important things:

reimplements FNV-1a hashing in Python to match the Rust implementation exactly
joins projects by hash instead of fuzzy naming
reads the actual cwd from transcript content because Claude Code's encoded folder names are lossy
categorizes mismatches into output-only differences versus broader grouping/counting differences

The most interesting mismatch pattern was streaming output accounting. If one tool takes the last output_tokens value and another takes the max across streamed chunks, they will agree most of the time and diverge exactly when retry or reset behavior shows up mid-stream. That is a good example of how easy it is to get plausible-looking numbers from a subtly wrong model.

If you want the structural reason these accounting differences happen, Inside Claude Code Transcripts: Record Types, Trees, and Turn Reconstruction breaks down how one visible "turn" becomes many records.

Turn duration is an underrated productivity metric

Claude Code writes system records with wall-clock turn duration and message counts. That lets you ask much better questions than "How expensive was this session?"

For example:

Which turns took the longest?
Which prompts triggered large multi-step tool loops?
Where did a session slow down?
Are hooks adding latency after every turn?

One expensive fast turn and one expensive slow turn are not the same workflow problem. The first may be normal heavy generation. The second may be friction.

Hooks are measurable operational overhead

Stop hook summaries reveal:

how many hooks ran
which commands executed
how long they took
whether any errors occurred
whether continuation was prevented

That is valuable because hooks are part of the user experience whether people notice them or not. If a hook stack is adding several seconds after every turn, that is no longer implementation detail. That is workflow latency.

This matters operationally too. Vibenalytics itself runs on the hook path, which means hook overhead is not somebody else's problem. It is part of the product surface. That is one reason the CLI is aggressive about stripping content locally and keeping the event payloads lightweight. Real analytics only works if the measurement system does not become the bottleneck.

It is also the reason we can make a stronger privacy claim than "we promise not to store your content." The content stripping happens before anything hits disk or the backend. The backend physically cannot store prompts or code because it never receives them in the first place.

Tool usage tells you how Claude actually works in your projects

Transcript data lets you move past generic AI usage questions and into operational ones:

Is Claude mostly reading or editing?
Is it spending more time in shell tools than file tools?
Are certain projects search-heavy?
Which sessions depend on delegation or compaction most often?

This is where project attribution becomes crucial. A raw total across all work is not very useful. You need to know which repository, workflow, or session generated the behavior. Without attribution, you are still flying blind.

Subagents can completely distort your picture if you ignore them

One of the easiest mistakes in transcript analytics is to analyze only the main session file. That misses sidechain work in subagents/*.jsonl, and in some cases that can mean missing a large share of actual token usage.

This is why any serious metric stack needs to account for:

main transcript usage
subagent usage
tool result structure
compaction events

Otherwise the numbers look precise while still being incomplete.

For the storage mechanics behind that, see How Claude Code Stores Subagents and Large Tool Results.

That is not a hypothetical risk. One of the hardest bugs we hit was subagent usage disappearing from the backend because sync happened before the subagent finished writing. The numbers still looked clean. They were just wrong. That is the uncomfortable truth of analytics engineering: clean charts are easy. Correct ingestion is the hard part.

The kinds of dashboards this data can power

Once the transcript is parsed correctly, you can build much more than a cost counter:

project-level usage trends
per-session drill-downs
hourly activity heatmaps
tool mix over time
compaction frequency trends
subagent share of total work
request-level token composition

That is the difference between usage reporting and workflow observability.

Compaction events reveal context pressure

When Claude Code triggers marble-origami compaction, the transcript records it explicitly. That gives you a direct signal that a session hit meaningful context pressure.

Over time, that opens up some genuinely useful analytics:

Which projects compact most often?
Which workflows tend to push the context window?
Do long investigative sessions compact more than implementation sessions?
Are compactions clustered around certain tools or prompt types?

This is much more actionable than a generic "large session" label.

Compaction is a good example of a metric that becomes valuable only after you have already done the ingestion work correctly. Vibenalytics tracks compaction both through the PreCompact hook and through transcript parsing of compact_boundary messages. That makes compaction visible as its own event type rather than a vague side effect of "long session" behavior.

And if you want the transcript-level explanation of what a compaction boundary actually means, marble-origami: How Claude Code Handles Context Compaction goes into that in detail.

The most useful metrics are usually cross-cutting

The best insights come from combining signals, not staring at one field. A few examples:

High cache read + low output

Likely a broad-context session with relatively small responses. This can indicate navigation, review, or context-heavy debugging.

High output + high edit/tool activity

More likely implementation work with direct code generation and file changes.

Long turn duration + many hooks

Possible post-turn automation overhead rather than model latency alone.

Frequent subagent use + low main-session output

An orchestration-heavy workflow where the parent agent delegates aggressively.

That is where transcript analytics starts becoming workflow analytics.

The 30-day limit is the hard ceiling on local insight

There is one problem you cannot solve with a clever parser. Claude Code's local transcript history is usually retained for only about 30 days before cleanup removes older session files and related artifacts, so even if the transcript exposes excellent metrics, local storage still behaves like a short-lived observation window.

That means:

trends disappear
quarter-over-quarter comparisons disappear
cross-project history gets fragmented
long-term workflow changes become hard to prove

This is why local transcript analysis is useful but incomplete. It gives you snapshots. It does not give you a durable timeline.

That is also why Vibenalytics treats backend persistence as part of the analytics model, not an implementation afterthought. Claude Code's local files are the source material. They are not the final history system.

That is the same argument expanded in Claude Code's 30-Day Memory Problem.

This is the dashboard layer Claude Code is missing

Vibenalytics turns raw transcript metadata into project views, session drill-downs, cost context, and long-term trends so you can see how AI work actually happens over time.

Explore Vibenalytics