Implementing End-to-End Markdown Support in a Layered RAG Stack

Adding Markdown Support End-to-End (Part 7)

Engineer Josh Blair implemented Markdown support within the Sift RAG stack. The process required synchronized updates across C# APIs, Python extraction handlers, React frontends, and database constraints.

Why This Matters

While adding a file type seems trivial, a layered architecture creates multiple independent points of failure. A mismatch in any single layer—such as an inconsistent MIME type between the frontend and S3 presigned URL—results in a 403 Forbidden error, demonstrating that system reliability depends on strict contract synchronization rather than relying on browser hints.

Key Insights

Browser MIME types are unreliable; Chrome may report an empty string for .md files, necessitating extension-based mapping for S3 PUT requests (Blair, 2026).
Layered validation prevents silent failures by enforcing constraints at the API (extension check), S3 (Content-Type match), and Database (CHECK constraint) levels.
Preserving original file extensions like .md over generic .txt enables future targeted optimizations, such as splitting chunks on heading boundaries rather than character windows.
Database schema migrations for CHECK constraints on Aurora Serverless v2 can be executed nearly instantly via the RDS Data API using boto3.

Working Examples

C# service layer mapping extensions to S3 content types.

private static readonly Dictionary<string, string> ContentTypes = new()
{
["pdf"] = "application/pdf",
["docx"] = "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
["csv"] = "text/csv",
["txt"] = "text/plain",
["md"] = "text/markdown", // added
};

Python extraction handler dispatching logic for Markdown and text files.

if ext == "pdf":
    text, page_count = _extract_pdf(content)
elif ext == "docx":
    text, page_count = _extract_docx(content)
elif ext == "csv":
    text, page_count = _extract_csv(content)
elif ext in ("txt", "md"):
    text = content.decode("utf-8", errors="replace")
    page_count = 1
else:
    raise ValueError(f"Unsupported file type: {ext}")

Frontend fix to derive MIME type from filename instead of trusting browser file.type.

const MIME_MAP: Record<string, string> = {
pdf: "application/pdf",
docx: "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
csv: "text/csv",
txt: "text/plain",
md: "text/markdown",
};
function getMimeType(filename: string): string {
const ext = filename.split(".").pop()?.toLowerCase() ?? "";
return MIME_MAP[ext] ?? "application/octet-stream";
}
await axios.put(uploadUrl, file, {
headers: { "Content-Type": getMimeType(file.name) },
});

SQL migration to update the database CHECK constraint for supported file types.

ALTER TABLE documents 
drop CONSTRAINT documents_file_type_check,
ADD CONSTRAINT documents_file_type_check 
CHECK (file_type IN ('pdf', 'csv', 'docx', 'txt', 'md'));

Practical Applications

- Use Case: Sift RAG system uses specific ‘.md’ labels in the database to allow for future heading-aware chunking strategies. Pitfall: Renaming all text formats to ‘.txt’ at upload time leads to loss of structural information and prevents targeted embedding improvements.
- Use Case: AWS S3 Presigned URLs require exact Content-Type header matches. Pitfall: Relying on browser file.type results in intermittent 403 errors due to inconsistent OS reporting of Markdown MIME types.

References:

https://dev.to/josh_blair/adding-markdown-support-end-to-end-part-7-24g1getmimetypetofilename//C#

On This Page

Adding Markdown Support End-to-End (Part 7)

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Scaling AI Agents: When to Transition from Prototypes to an MCP Runtime

Architecting a Point of Sale Frontend with React, Next.js, and Material UI

Why Agent Memory is Not a Database: Shifting to Governed Evolving Memory