Pushing Large Files to GitHub: A Technical Deep Dive (For Educational Purposes)
This article is for educational purposes only. Using GitHub/GitLab as a general-purpose file storage backend may violate their Acceptable Use Policies. The techniques described here should only be used in self-hosted Git instances within organizations that explicitly permit such usage. The author and publisher are not responsible for any account restrictions or service disruptions. Always review and comply with your Git provider's Terms of Service.
Pushing Large Files to GitHub: A Technical Deep Dive
You’ve just trained a machine learning model, exported a massive dataset, or compiled a binary that’s north of 100MB. Now you want to push it to GitHub. Git politely refuses with an error about file size limits. Your first instinct might be to find a workaround.
Stop. This article will show you how, but also why you shouldn’t.
Why Would Anyone Want to Store Large Files in Git?
The use case is surprisingly common:
- ML practitioners who want to version models alongside code
- Data scientists sharing datasets with collaborators
- Game developers with asset files that exceed 100MB
- Researchers distributing large binary outputs
The appeal is obvious: keep everything in one repository, use the same PR workflow, and leverage GitHub’s web interface for access control.
The Problem: Git Wasn’t Built for This
Git is a version control system, not a storage engine. Here’s why large files are fundamentally incompatible with Git’s design:
1. Every Clone Downloads the Entire History
When you git clone a repository, you download every version of every file. A 50MB model file modified 10 times becomes 500MB in your .git folder. For collaborators with slow connections, this is a nightmare.
2. Git’s Delta Compression Fails on Binary Files
Git compresses file changes using delta encoding. This works beautifully for text diffs but catastrophically for binary files. A single byte change in a 100MB binary creates a new 100MB blob.
3. GitHub’s Hard Limits
GitHub enforces strict limits:
- 50MB warning - Git will warn but allow the push
- 100MB rejection - Git refuses the push entirely
- Repository size - Repositories over 5GB trigger warnings, 100GB+ risk account restrictions
What You Should Actually Use
Before we dive into the “how,” here are the proper solutions:
Git LFS (Large File Storage)
# Install Git LFS
git lfs install
# Track large files
git lfs track "*.pkl"
git lfs track "*.h5"
git lfs track "models/*"
# Add and commit normally
git add .gitattributes
git commit -m "Track model files with LFS"
Pros:
- Transparent workflow (feels like regular Git)
- First 1GB of storage is free
- Designed for this exact use case
Cons:
- Costs money beyond free tier ($5/month per 50GB)
- Requires Git LFS installation on all machines
Cloud Storage + Metadata
# Store in S3/GCS, track URL in Git
import boto3
# Upload to S3
s3 = boto3.client('s3')
s3.upload_file('model.pkl', 'my-bucket', 'models/v1.pkl')
# Store metadata in Git
metadata = {
"model_version": "v1",
"s3_uri": "s3://my-bucket/models/v1.pkl",
"sha256": "abc123...",
"size_mb": 250
}
This is what production systems actually use.
Pros:
- Built for ML workflows
- Works with any cloud storage
- Versioning built-in
Cons:
- Another tool to learn
- Requires cloud storage setup
The Educational Hack: Chunking + GitHub API
⚠️ WARNING: This is for educational purposes only. Using GitHub as a general-purpose storage backend violates GitHub’s Acceptable Use Policy. Abuse can result in account suspension.
That said, understanding how to work around Git’s limitations teaches valuable lessons about API design, chunking strategies, and distributed systems.
Requirements
Before you start, you need:
-
GitHub Personal Access Token with
reposcope- Go to Settings → Developer settings → Personal access tokens → Tokens (classic)
- Generate new token with
repopermissions (full control of private repositories) - Save it securely (you’ll never see it again)
-
A GitHub repository where you have write access
The Strategy: Chunk, Upload, Reassemble
Here’s the workflow in plain terms:
Saving a large file:
- Take your 150MB file
- Split it into 20MB pieces (chunk_000, chunk_001, chunk_002, etc.)
- Upload each chunk to GitHub one at a time using the API
- Create a manifest.json file that lists where each chunk is stored and in what order
- Upload the manifest to GitHub
Getting the file back:
- Download the manifest.json
- Download all the chunks
- Put the chunks back together in the right order
- Verify the file matches the original using a hash check
Implementation: The Uploader
import os
import base64
import hashlib
import json
import requests
from pathlib import Path
from typing import List, Dict
class GitHubChunkedUploader:
"""
Uploads large files to GitHub by splitting into chunks.
WARNING: Educational purposes only. Not for production use.
May violate GitHub's Acceptable Use Policy if abused.
"""
CHUNK_SIZE = 20 * 1024 * 1024 # 20MB chunks
def __init__(self, token: str, repo: str, branch: str = "main"):
"""
Initialize uploader.
Args:
token: GitHub personal access token (repo scope)
repo: Repository in format "username/repo"
branch: Target branch (default: main)
"""
self.token = token
self.repo = repo
self.branch = branch
self.base_url = f"https://api.github.com/repos/{repo}/contents"
self.headers = {
"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28"
}
def calculate_sha256(self, filepath: Path) -> str:
"""Calculate SHA-256 hash of file for integrity verification."""
sha256_hash = hashlib.sha256()
with open(filepath, "rb") as f:
for byte_block in iter(lambda: f.read(4096), b""):
sha256_hash.update(byte_block)
return sha256_hash.hexdigest()
def split_file(self, filepath: Path) -> List[bytes]:
"""
Split file into chunks.
Returns:
List of byte chunks, each ≤ CHUNK_SIZE
"""
chunks = []
with open(filepath, "rb") as f:
while True:
chunk = f.read(self.CHUNK_SIZE)
if not chunk:
break
chunks.append(chunk)
# for simplicity the chunks are kept in memory, but it's they should be stored in a tmp directory
return chunks
def upload_chunk(
self,
chunk_data: bytes,
remote_path: str,
commit_message: str
) -> str:
"""
Upload a single chunk to GitHub.
Args:
chunk_data: Raw bytes to upload
remote_path: Path in repository (e.g., "chunks/file_000")
commit_message: Git commit message
Returns:
Download URL for the uploaded chunk
"""
# Encode chunk as base64 (GitHub API requirement)
content_encoded = base64.b64encode(chunk_data).decode()
payload = {
"message": commit_message,
"branch": self.branch,
"content": content_encoded
}
url = f"{self.base_url}/{remote_path}"
response = requests.put(url, headers=self.headers, json=payload)
if response.status_code not in (200, 201):
raise Exception(f"Upload failed: {response.status_code} - {response.text}")
# Return the raw download URL
return response.json()["content"]["download_url"]
def upload_large_file(
self,
filepath: Path,
remote_dir: str = "large_files"
) -> Dict:
"""
Upload a large file by chunking.
Args:
filepath: Local file path
remote_dir: Directory in repo to store chunks
Returns:
Manifest dict with chunk URLs and metadata
"""
if not filepath.exists():
raise FileNotFoundError(f"File not found: {filepath}")
file_size = filepath.stat().st_size
file_name = filepath.name
print(f"Uploading {file_name} ({file_size / (1024**2):.2f} MB)")
# Check if chunking is needed
if file_size <= self.CHUNK_SIZE:
print(" File is small enough, uploading directly...")
with open(filepath, "rb") as f:
chunk_data = f.read()
url = self.upload_chunk(
chunk_data,
f"{remote_dir}/{file_name}",
f"Upload {file_name}"
)
return {
"file_name": file_name,
"size_bytes": file_size,
"sha256": self.calculate_sha256(filepath),
"chunks": [url],
"chunk_count": 1
}
# Split into chunks
print(" Splitting into chunks...")
chunks = self.split_file(filepath)
print(f" Created {len(chunks)} chunks")
# Upload chunks sequentially (parallel would cause conflicts!)
chunk_urls = []
for i, chunk_data in enumerate(chunks):
chunk_name = f"{file_name}.chunk_{i:03d}"
print(f" Uploading chunk {i+1}/{len(chunks)} ({len(chunk_data) / (1024**2):.2f} MB)")
url = self.upload_chunk(
chunk_data,
f"{remote_dir}/{file_name}_chunks/{chunk_name}",
f"Upload {chunk_name}"
)
chunk_urls.append(url)
# Create manifest
manifest = {
"file_name": file_name,
"size_bytes": file_size,
"sha256": self.calculate_sha256(filepath),
"chunks": chunk_urls,
"chunk_count": len(chunks)
}
# Upload manifest
print(" Uploading manifest...")
manifest_json = json.dumps(manifest, indent=2)
manifest_url = self.upload_chunk(
manifest_json.encode(),
f"{remote_dir}/{file_name}.manifest.json",
f"Upload manifest for {file_name}"
)
manifest["manifest_url"] = manifest_url
print(f"Upload complete! Manifest: {manifest_url}")
return manifest
# Example usage
if __name__ == "__main__":
# Initialize uploader
uploader = GitHubChunkedUploader(
token="ghp_your_token_here", # Replace with your token
repo="username/repo", # Replace with your repo
branch="main"
)
# Upload a large file
manifest = uploader.upload_large_file(Path("large_model.pkl"))
# Save manifest locally for later retrieval
with open("manifest.json", "w") as f:
json.dump(manifest, f, indent=2)
Implementation: The Downloader
import requests
import hashlib
from pathlib import Path
from typing import Dict
class GitHubChunkedDownloader:
"""
Downloads and reassembles chunked files from GitHub.
For better performance, consider using async/await with aiohttp.
This implementation uses synchronous requests for simplicity.
"""
@staticmethod
def download_chunk(url: str) -> bytes:
"""Download a single chunk from GitHub."""
response = requests.get(url)
response.raise_for_status()
return response.content
@staticmethod
def verify_sha256(filepath: Path, expected_hash: str) -> bool:
"""Verify downloaded file matches expected SHA-256 hash."""
sha256_hash = hashlib.sha256()
with open(filepath, "rb") as f:
for byte_block in iter(lambda: f.read(4096), b""):
sha256_hash.update(byte_block)
return sha256_hash.hexdigest() == expected_hash
def download_large_file(
self,
manifest: Dict,
output_path: Path
) -> None:
"""
Download and reassemble a chunked file.
Args:
manifest: Manifest dict from upload (with chunk URLs)
output_path: Where to save the reassembled file
"""
file_name = manifest["file_name"]
chunk_urls = manifest["chunks"]
expected_size = manifest["size_bytes"]
expected_hash = manifest["sha256"]
print(f"Downloading {file_name} ({expected_size / (1024**2):.2f} MB)")
print(f" {len(chunk_urls)} chunks to download")
# Download chunks sequentially
# NOTE: Async would be much faster here (aiohttp + asyncio.gather)
chunks = []
for i, url in enumerate(chunk_urls):
print(f" Downloading chunk {i+1}/{len(chunk_urls)}...")
chunk_data = self.download_chunk(url)
chunks.append(chunk_data)
# Reassemble
print(" Reassembling file...")
with open(output_path, "wb") as f:
for chunk in chunks:
f.write(chunk)
# Verify integrity
print(" Verifying integrity...")
if not self.verify_sha256(output_path, expected_hash):
raise Exception("SHA-256 hash mismatch! File may be corrupted.")
actual_size = output_path.stat().st_size
if actual_size != expected_size:
raise Exception(f"Size mismatch! Expected {expected_size}, got {actual_size}")
print(f"Download complete and verified: {output_path}")
# Example usage
if __name__ == "__main__":
import json
# Load manifest from upload
with open("manifest.json", "r") as f:
manifest = json.load(f)
# Download and reassemble
downloader = GitHubChunkedDownloader()
downloader.download_large_file(manifest, Path("downloaded_model.pkl"))
Why Sequential Uploads, Not Parallel?
You might think: “20MB chunks × 8 threads = 8x faster uploads!” Unfortunately, no.
Git repositories are stateful. Each commit depends on the previous commit’s SHA. When you upload chunk_000, GitHub creates commit abc123. When you upload chunk_001, it needs to reference abc123 as the parent.
If you upload in parallel:
Thread 1: Upload chunk_000 → commit abc123
Thread 2: Upload chunk_001 → expects parent abc123 (not available yet!)
Thread 3: Upload chunk_002 → expects parent def456 (where is it?)
Result: Merge conflicts, failed pushes, corrupted repository history.
Sequential uploads ensure clean history. Boring, but correct.
Async Downloads: The Faster Alternative
Downloading, however, is perfectly parallelizable:
import asyncio
import aiohttp
from pathlib import Path
from typing import List, Dict
async def download_chunk_async(session: aiohttp.ClientSession, url: str) -> bytes:
"""Download a chunk asynchronously."""
async with session.get(url) as response:
response.raise_for_status()
return await response.read()
async def download_large_file(manifest: Dict, output_path: Path):
...
async with aiohttp.ClientSession() as session:
# Download all chunks concurrently
tasks = [download_chunk_async(session, url) for url in chunk_urls]
chunks = await asyncio.gather(*tasks)
# Reassemble (order preserved by asyncio.gather)
print(" Reassembling...")
with open(output_path, "wb") as f:
for chunk in chunks:
f.write(chunk)
...
# Usage
if __name__ == "__main__":
import json
with open("manifest.json", "r") as f:
manifest = json.load(f)
asyncio.run(download_large_file_async(manifest, Path("model.pkl")))
This downloads 8 chunks simultaneously, saturating your bandwidth. On a 1Gbps connection, I’ve seen 5-7x speedups compared to sequential downloads.
Scaling Up: Multiple Repositories
The single-repo approach has problems:
- Repository bloat - Hundreds of chunks in one repo is ugly
- GitHub API rate limits - 5,000 requests/hour per token
- Sequential upload bottleneck - Large files take forever
Solution: Shard across multiple repositories.
class MultiRepoUploader:
"""Upload chunks across multiple repositories for parallelism."""
def __init__(self, token: str, repos: List[str]):
self.uploaders = [
GitHubChunkedUploader(token, repo) for repo in repos
]
def upload_large_file_sharded(self, filepath: Path) -> Dict:
"""Split chunks across repositories for parallel uploads."""
chunks = self.split_file(filepath)
num_repos = len(self.uploaders)
# Distribute chunks across repos
chunk_urls = []
for i, chunk_data in enumerate(chunks):
repo_idx = i % num_repos # Round-robin distribution
uploader = self.uploaders[repo_idx]
# Now we can upload in parallel per-repo!
url = uploader.upload_chunk(chunk_data, f"chunk_{i:03d}", f"Upload chunk {i}")
chunk_urls.append(url)
return {"chunks": chunk_urls, "file_name": filepath.name}
Benefits:
- Upload to 4 repos in parallel = 4x throughput
- Distribute load across repositories
- Stay under API rate limits
Drawbacks:
- Managing multiple repos is annoying
- Still doesn’t solve the fundamental “Git isn’t storage” problem
Why This Is Still a Terrible Idea
Let’s be brutally honest about why using Git as a storage backend is bad engineering:
1. GitHub Will Notice (and May Ban You)
From GitHub’s Acceptable Use Policy:
“GitHub’s file storage is not intended to be used as a general-purpose file storage platform… Accounts in violation may have access restricted or terminated.”
If you upload hundreds of gigabytes, GitHub’s abuse detection will flag your account. At best, they’ll throttle your API access. At worst, permanent ban.
2. Performance Degrades Over Time
Each push creates a new commit. After 1,000 commits, git clone downloads 1,000 commit objects. Your repository becomes very slow.
3. No Deduplication
Upload the same 100MB file twice? Git stores it twice. Modify one byte? Entire file stored again. You’ll burn through GitHub’s storage limits fast.
4. API Rate Limits Kill You
GitHub allows 5,000 API requests/hour with authentication. A 1GB file = 50 chunks = 50 API calls. Upload 100 files and you’re rate-limited for an hour.
5. It’s Just Wrong
Git is version control. S3 is storage. Using a screwdriver as a hammer might work, but why?
When This Approach Is Actually Allowed
This technique is appropriate in specific organizational contexts:
Self-Hosted Git Instances: If your organization runs its own GitLab, Gitea, or GitHub Enterprise instance on their own infrastructure, and IT explicitly permits using it for file storage, then this approach is fair game.
Requirements for legitimate use:
- Self-hosted Git server (not github.com, not gitlab.com)
- Organization owns and operates the infrastructure
- IT policy explicitly allows file storage usage
- You have written approval from infrastructure team
- Storage quotas and limits are clearly defined
Example scenario: Your company runs GitLab on internal servers with 10TB of storage allocated for engineering artifacts. IT has approved using it for ML model storage as part of your CI/CD pipeline. In this case, the chunking technique is a valid engineering solution.
Still not recommended for:
- Public GitHub (github.com)
- Public GitLab (gitlab.com)
- Any hosted Git service you don’t control
- Circumventing organizational policies
Key Takeaways
- Git isn’t storage - It’s version control.
- Use Git LFS
- Cloud storage exists - S3.
- The hack works - But it’s educational, not production-ready.
- GitHub will notice - Abuse leads to account restrictions.
- Async downloads are fast - Sequential uploads are mandatory.
- Multiple repos help - But don’t solve the core problem.
Final Thoughts
If you learned something from this article, great. If you’re tempted to use this in production, please reconsider. Your future self (and your GitHub account) will thank you.
The right tool for the job isn’t always the one you’re already using. Sometimes, it’s worth paying 5 bucks a month for Git LFS or setting up S3. Engineering isn’t about clever hacks, it’s about sustainable systems.
Continue reading
Next article
Python and SQLite in the Real World
Related Content
Codexity Part 2: Query Rewriting with LLMs
A user types a vague question. The query rewriter transforms it into targeted search queries using a local LLM. We cover intent classification, query decomposition, and prompt engineering that actually works with small models.
Codexity Part 3: Async Web Search with DuckDuckGo
Fire multiple search queries in parallel using DuckDuckGo's Python library and asyncio. Handle rate limiting, deduplicate results, and build a resilient search layer that does not depend on paid APIs.
Codexity Part 4: Web Scraping, Proxies, and Anti-Bot Warfare
Fetch and extract content from 15 web pages concurrently. Handle JavaScript rendering with Playwright, dodge anti-bot systems, rotate proxies, and strip HTML down to clean text using readability-lxml and BeautifulSoup.