Skip to main content

On This Page

Implementing Microsoft’s OpenMementos: Trace Analysis and Context Compression for LLMs

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation

Microsoft’s OpenMementos dataset structures reasoning traces using blocks and memento summaries to optimize long-form reasoning. This implementation demonstrates how to achieve the ~6× trace-level token compression reported in the original research paper.

Why This Matters

While ideal LLM reasoning models require extensive context to maintain logic, the technical reality of inference costs and context window limits necessitates aggressive compression. OpenMementos addresses this by pairing detailed reasoning blocks with concise summaries, allowing models to retain essential logical steps without the overhead of full raw traces, which is critical for scaling long-form reasoning tasks across math, code, and science domains.

Key Insights

  • The OpenMementos dataset utilizes a special-token schema including <|block_start|> and <|summary_start|> to segment long-form reasoning.
  • Trace-level token compression of ~6× is achievable by representing historical reasoning blocks as condensed mementos.
  • Inference-time context reduction is simulated by retaining only the last K blocks while compressing previous steps into summaries.
  • Data preparation for Supervised Fine-Tuning (SFT) involves mapping raw streamed rows into structured message formats using Hugging Face Datasets.
  • Qualitative analysis of reasoning organization differs significantly between domains like math and code, requiring per-domain median tracking.

Working Examples

Regex-based parser for extracting reasoning blocks and memento summaries from OpenMementos responses.

import re
BLOCK_RE = re.compile(r"<|block_start|>(.*?)<|block_end|>", re.DOTALL)
SUMMARY_RE = re.compile(r"<|summary_start|>(.*?)<|summary_end|>", re.DOTALL)
THINK_RE = re.compile(r"<think>(.*?)</think>", re.DOTALL)

def parse_memento(response: str):
    blocks = [m.strip() for m in BLOCK_RE.findall(response)]
    summaries = [m.strip() for m in SUMMARY_RE.findall(response)]
    think_m = THINK_RE.search(response)
    final_ans = response.split("</think>")[-1].strip() if "</think>" in response else ""
    return {"blocks": blocks, "summaries": summaries, "reasoning": (think_m.group(1) if think_m else ""), "final_answer": final_ans}

Simulation of inference-time compression by replacing older reasoning blocks with summaries.

def compress_trace(response: str, keep_last_k: int = 1) -> str:
    blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
    if not blocks or len(blocks) != len(summaries): return response
    out, n = ["<think>"], len(blocks)
    for i, (b, s) in enumerate(zip(blocks, summaries)):
        if i >= n - keep_last_k:
            out.append(f"<|block_start|>{b}<|block_end|>")
            out.append(f"<|summary_start|>{s}<|summary_end|>")
        else:
            out.append(f"<|summary_start|>{s}<|summary_end|>")
    out.append("</think>")
    out.append(response.split("</think>")[-1])
    return "\n".join(out)

Practical Applications

  • Context Window Optimization: Systems can replace historical reasoning chains with mementos to fit more complex problems into fixed context limits.
  • SFT Preparation Pitfall: Failing to align block and summary counts during parsing can lead to broken reasoning traces in training data.
  • Domain-Specific Analysis: Using median character and word ratios to adjust summarization density for different reasoning tasks like coding vs. scientific inquiry.

References:

Continue reading

Next article

GitNexus: The Open-Source Knowledge Graph Engine for MCP-Native AI Coding

Related Content