Implementing Microsoft’s OpenMementos: Trace Analysis and Context Compression for LLMs
These articles are AI-generated summaries. Please check the original sources for full details.
A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation
Microsoft’s OpenMementos dataset structures reasoning traces using blocks and memento summaries to optimize long-form reasoning. This implementation demonstrates how to achieve the ~6× trace-level token compression reported in the original research paper.
Why This Matters
While ideal LLM reasoning models require extensive context to maintain logic, the technical reality of inference costs and context window limits necessitates aggressive compression. OpenMementos addresses this by pairing detailed reasoning blocks with concise summaries, allowing models to retain essential logical steps without the overhead of full raw traces, which is critical for scaling long-form reasoning tasks across math, code, and science domains.
Key Insights
- The OpenMementos dataset utilizes a special-token schema including <|block_start|> and <|summary_start|> to segment long-form reasoning.
- Trace-level token compression of ~6× is achievable by representing historical reasoning blocks as condensed mementos.
- Inference-time context reduction is simulated by retaining only the last K blocks while compressing previous steps into summaries.
- Data preparation for Supervised Fine-Tuning (SFT) involves mapping raw streamed rows into structured message formats using Hugging Face Datasets.
- Qualitative analysis of reasoning organization differs significantly between domains like math and code, requiring per-domain median tracking.
Working Examples
Regex-based parser for extracting reasoning blocks and memento summaries from OpenMementos responses.
import re
BLOCK_RE = re.compile(r"<|block_start|>(.*?)<|block_end|>", re.DOTALL)
SUMMARY_RE = re.compile(r"<|summary_start|>(.*?)<|summary_end|>", re.DOTALL)
THINK_RE = re.compile(r"<think>(.*?)</think>", re.DOTALL)
def parse_memento(response: str):
blocks = [m.strip() for m in BLOCK_RE.findall(response)]
summaries = [m.strip() for m in SUMMARY_RE.findall(response)]
think_m = THINK_RE.search(response)
final_ans = response.split("</think>")[-1].strip() if "</think>" in response else ""
return {"blocks": blocks, "summaries": summaries, "reasoning": (think_m.group(1) if think_m else ""), "final_answer": final_ans}
Simulation of inference-time compression by replacing older reasoning blocks with summaries.
def compress_trace(response: str, keep_last_k: int = 1) -> str:
blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
if not blocks or len(blocks) != len(summaries): return response
out, n = ["<think>"], len(blocks)
for i, (b, s) in enumerate(zip(blocks, summaries)):
if i >= n - keep_last_k:
out.append(f"<|block_start|>{b}<|block_end|>")
out.append(f"<|summary_start|>{s}<|summary_end|>")
else:
out.append(f"<|summary_start|>{s}<|summary_end|>")
out.append("</think>")
out.append(response.split("</think>")[-1])
return "\n".join(out)
Practical Applications
- Context Window Optimization: Systems can replace historical reasoning chains with mementos to fit more complex problems into fixed context limits.
- SFT Preparation Pitfall: Failing to align block and summary counts during parsing can lead to broken reasoning traces in training data.
- Domain-Specific Analysis: Using median character and word ratios to adjust summarization density for different reasoning tasks like coding vs. scientific inquiry.
References:
Continue reading
Next article
GitNexus: The Open-Source Knowledge Graph Engine for MCP-Native AI Coding
Related Content
Mastering OpenMythos: Implementing Recurrent-Depth Transformers with MLA and MoE
OpenMythos enables deeper reasoning via recurrent computation, allowing Multi-Head Latent Attention (MLA) to achieve significantly smaller KV-cache footprints than GQA.
Implementing Semantic Discussion Clustering Using TF-IDF Instead of Vector Embeddings
Developer Mervin builds a cost-effective discussion monitor using TF-IDF and cosine similarity to avoid expensive OpenAI embedding and vector database costs.
Optimizing LLM Inference: How TurboQuant Achieves 6x KV Cache Compression
TurboQuant achieves a 6x reduction in KV cache memory, shrinking a 1GB context to 150MB to enable higher concurrency and longer context windows for LLMs.