Blog

How Scene-Based Context Compression Works

April 8, 2026

Every AI model has a context window — a fixed amount of text it can see at once. Claude's is a million tokens. Gemini's is a million. GPT's is 128K. These are big numbers. They are not big enough.

A single scene of interactive fiction generates a few thousand tokens per exchange. Ten exchanges and you're at 30-50K tokens. Add the system prompt, character descriptions, world context, and you're burning through that window fast. And that's one scene. A real story has dozens.

Most AI writing tools handle this by silently truncating the oldest messages. Your story loses its beginning, then its middle, one message at a time. The AI doesn't know it's forgotten anything. It just starts contradicting itself.

Why truncation kills fiction

Truncation works fine for chatbots. "What's the return policy?" doesn't depend on what you asked twenty messages ago. Fiction is the opposite — later scenes only make sense because of earlier ones. A character's reaction in scene twelve is driven by what happened in scene three. Drop scene three and the model generates plausible text, but not correct text. The specificity evaporates. Tension that was building across chapters just... isn't there anymore.

Naive summarization — "summarize everything so far" — is almost as bad. It produces flat plot recaps that strip out exactly what fiction needs: tone, subtext, unresolved tensions, the specific detail that a character's hands were shaking when they denied knowing about the letter.

Scenes, not messages

Underfiction treats the scene — not the message — as the fundamental unit of narrative. A scene has a setting, a cast, a tone, and a bounded arc. This mirrors how novels actually work. Chapters and scenes, not an unbounded stream of dialogue.

When you're writing a scene, the model sees a hierarchical context: the current scene at full resolution, summaries of all previous scenes, character descriptions, and world rules. Recent content gets maximum detail. Older content is preserved at lower resolution. Nothing is dropped — it's compressed.

How the compression pipeline works

When you end a scene, the system generates a structured summary. Not a plot recap — a narrative distillation. The summary prompt is designed to capture what changed: relationship shifts, unresolved tensions, key decisions, emotional trajectories. "They argued" is a bad summary. "She told him she knew about the letter, and he denied it — but she saw his hands shake" preserves the thread the next scene needs.

Summaries are generated by Sonnet — capable enough to understand narrative structure, fast enough to not feel like a loading screen. The cost is covered by us. You don't pay for compression, only for the prose you read.

Mid-scene compression

Not every scene is short. A complex negotiation or a slowly building confrontation can run long. When the uncompressed history in a scene crosses 6,000 tokens (roughly 15-20 exchanges), background compression kicks in. The oldest turns get summarized and replaced. The four most recent turns are always preserved at full resolution — they're your immediate narrative context.

This happens transparently. You don't manage context. You don't choose what to cut. The system compresses so that the model always sees the most relevant version of the story so far.

The math on a 30-scene story

Say you write a 30-scene story. Each scene averages 8K tokens of full prose. Without compression, that's 240K tokens — exceeding Claude's effective attention range (the model technically has a 1M window, but attention quality degrades well before that).

With scene compression, the current scene is ~8K at full resolution. The 29 previous scene summaries are maybe 300-500 tokens each, so ~12K total. Character cards and world context add another 2-4K. Total context: roughly 25K tokens. The model is working with a focused, relevant summary of the entire story, plus full detail on what's happening right now.

Characters maintain consistent personalities across chapters. Plot threads planted in scene two can pay off in scene twenty-eight. The story builds rather than drifts.

What you lose

Compression is lossy. No summary perfectly captures every nuance. A throwaway gesture, a minor aside you intended to be significant later — these can be lost if they weren't prominent in the scene. If a detail matters to you, make it matter in the text, and it'll survive summarization.

There's also a brief pause at scene boundaries while the summary generates. In practice, it feels like a natural chapter break. Barely noticeable.

These tradeoffs are real. They're dramatically better than the alternative, which is your story quietly losing coherence while the context window fills up and old content vanishes without a trace.

New accounts start with 500 free credits.

Try Underfiction

Frequently asked questions

How do AI models handle long conversations?

Most platforms silently truncate the oldest messages when the context fills up. The AI gradually forgets earlier content, leading to contradictions and lost plot threads. Scene-based compression preserves older content as summaries instead of dropping it.

What is a context window?

The maximum text an AI model can process at once, measured in tokens (roughly words). Claude and Gemini offer ~1M tokens, GPT-4 has 128K. But attention quality degrades before the limit — bigger windows help, but don't solve the problem alone.

Why does AI forget earlier conversation?

AI models have no persistent memory. They only see text currently in the context window. When a conversation overflows, most platforms truncate. The AI literally can't see what happened earlier.

How does Underfiction handle long stories?

Scene-based architecture. Each scene has its own context. Ended scenes are summarized (character states, relationship shifts, unresolved tensions) and the summary carries forward. Long scenes auto-compress in the background. Stories run 50+ scenes without losing coherence.