Implementing End-to-End Markdown Support in a Layered RAG Stack
These articles are AI-generated summaries. Please check the original sources for full details.
Adding Markdown Support End-to-End (Part 7)
Engineer Josh Blair implemented Markdown support within the Sift RAG stack. The process required synchronized updates across C# APIs, Python extraction handlers, React frontends, and database constraints.
Why This Matters
While adding a file type seems trivial, a layered architecture creates multiple independent points of failure. A mismatch in any single layer—such as an inconsistent MIME type between the frontend and S3 presigned URL—results in a 403 Forbidden error, demonstrating that system reliability depends on strict contract synchronization rather than relying on browser hints.
Key Insights
- Browser MIME types are unreliable; Chrome may report an empty string for .md files, necessitating extension-based mapping for S3 PUT requests (Blair, 2026).
- Layered validation prevents silent failures by enforcing constraints at the API (extension check), S3 (Content-Type match), and Database (CHECK constraint) levels.
- Preserving original file extensions like .md over generic .txt enables future targeted optimizations, such as splitting chunks on heading boundaries rather than character windows.
- Database schema migrations for CHECK constraints on Aurora Serverless v2 can be executed nearly instantly via the RDS Data API using boto3.
Working Examples
C# service layer mapping extensions to S3 content types.
private static readonly Dictionary<string, string> ContentTypes = new()
{
["pdf"] = "application/pdf",
["docx"] = "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
["csv"] = "text/csv",
["txt"] = "text/plain",
["md"] = "text/markdown", // added
};
Python extraction handler dispatching logic for Markdown and text files.
if ext == "pdf":
text, page_count = _extract_pdf(content)
elif ext == "docx":
text, page_count = _extract_docx(content)
elif ext == "csv":
text, page_count = _extract_csv(content)
elif ext in ("txt", "md"):
text = content.decode("utf-8", errors="replace")
page_count = 1
else:
raise ValueError(f"Unsupported file type: {ext}")
Frontend fix to derive MIME type from filename instead of trusting browser file.type.
const MIME_MAP: Record<string, string> = {
pdf: "application/pdf",
docx: "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
csv: "text/csv",
txt: "text/plain",
md: "text/markdown",
};
function getMimeType(filename: string): string {
const ext = filename.split(".").pop()?.toLowerCase() ?? "";
return MIME_MAP[ext] ?? "application/octet-stream";
}
await axios.put(uploadUrl, file, {
headers: { "Content-Type": getMimeType(file.name) },
});
SQL migration to update the database CHECK constraint for supported file types.
ALTER TABLE documents
drop CONSTRAINT documents_file_type_check,
ADD CONSTRAINT documents_file_type_check
CHECK (file_type IN ('pdf', 'csv', 'docx', 'txt', 'md'));
Practical Applications
-
- Use Case: Sift RAG system uses specific ‘.md’ labels in the database to allow for future heading-aware chunking strategies. Pitfall: Renaming all text formats to ‘.txt’ at upload time leads to loss of structural information and prevents targeted embedding improvements.
-
- Use Case: AWS S3 Presigned URLs require exact Content-Type header matches. Pitfall: Relying on browser
file.typeresults in intermittent 403 errors due to inconsistent OS reporting of Markdown MIME types.
- Use Case: AWS S3 Presigned URLs require exact Content-Type header matches. Pitfall: Relying on browser
References:
Continue reading
Next article
Preventing Confused Deputy Attacks in AI Agent Deployments
Related Content
Solving AI Agent Ambiguity with Domain-Driven Design's Ubiquitous Language
AI coding agents amplify vocabulary ambiguity, leading to semantic mismatches that can result in critical production incidents.
Architecting Production Systems: Integrating Go and Node.js for Scalability
Kevin Nambubbi details a systems-engineering approach to learning by integrating Go and Node.js into a production-minded incident platform.
Reducing Cognitive Load in DevOps: A Framework for Transparency and Scalability
Learn how to minimize cognitive load by implementing a one-repository-per-deployable-block rule and a standardized /version-info endpoint.