Implementing Microsoft’s OpenMementos: Trace Analysis and Context Compression for LLMs

A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation

Microsoft’s OpenMementos dataset structures reasoning traces using blocks and memento summaries to optimize long-form reasoning. This implementation demonstrates how to achieve the ~6× trace-level token compression reported in the original research paper.

Why This Matters

While ideal LLM reasoning models require extensive context to maintain logic, the technical reality of inference costs and context window limits necessitates aggressive compression. OpenMementos addresses this by pairing detailed reasoning blocks with concise summaries, allowing models to retain essential logical steps without the overhead of full raw traces, which is critical for scaling long-form reasoning tasks across math, code, and science domains.

Key Insights

The OpenMementos dataset utilizes a special-token schema including <|block_start|> and <|summary_start|> to segment long-form reasoning.
Trace-level token compression of ~6× is achievable by representing historical reasoning blocks as condensed mementos.
Inference-time context reduction is simulated by retaining only the last K blocks while compressing previous steps into summaries.
Data preparation for Supervised Fine-Tuning (SFT) involves mapping raw streamed rows into structured message formats using Hugging Face Datasets.
Qualitative analysis of reasoning organization differs significantly between domains like math and code, requiring per-domain median tracking.

Working Examples

Regex-based parser for extracting reasoning blocks and memento summaries from OpenMementos responses.

import re
BLOCK_RE = re.compile(r"<|block_start|>(.*?)<|block_end|>", re.DOTALL)
SUMMARY_RE = re.compile(r"<|summary_start|>(.*?)<|summary_end|>", re.DOTALL)
THINK_RE = re.compile(r"<think>(.*?)</think>", re.DOTALL)

def parse_memento(response: str):
    blocks = [m.strip() for m in BLOCK_RE.findall(response)]
    summaries = [m.strip() for m in SUMMARY_RE.findall(response)]
    think_m = THINK_RE.search(response)
    final_ans = response.split("</think>")[-1].strip() if "</think>" in response else ""
    return {"blocks": blocks, "summaries": summaries, "reasoning": (think_m.group(1) if think_m else ""), "final_answer": final_ans}

Simulation of inference-time compression by replacing older reasoning blocks with summaries.

def compress_trace(response: str, keep_last_k: int = 1) -> str:
    blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
    if not blocks or len(blocks) != len(summaries): return response
    out, n = ["<think>"], len(blocks)
    for i, (b, s) in enumerate(zip(blocks, summaries)):
        if i >= n - keep_last_k:
            out.append(f"<|block_start|>{b}<|block_end|>")
            out.append(f"<|summary_start|>{s}<|summary_end|>")
        else:
            out.append(f"<|summary_start|>{s}<|summary_end|>")
    out.append("</think>")
    out.append(response.split("</think>")[-1])
    return "\n".join(out)

Practical Applications

Context Window Optimization: Systems can replace historical reasoning chains with mementos to fit more complex problems into fixed context limits.
SFT Preparation Pitfall: Failing to align block and summary counts during parsing can lead to broken reasoning traces in training data.
Domain-Specific Analysis: Using median character and word ratios to adjust summarization density for different reasoning tasks like coding vs. scientific inquiry.

References:

https://www.marktechpost.com/2026/04/24/a-coding-implementation-on-microsofts-openmementos-with-trace-structure-analysis-context-compression-and-fine-tuning-data-preparation/

On This Page

A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Unified Access to 50+ Chinese LLMs via OpenAI-Compatible API

Mastering OpenMythos: Implementing Recurrent-Depth Transformers with MLA and MoE

Optimizing LLM Inference: How TurboQuant Achieves 6x KV Cache Compression