Skip to main content

On This Page

Pushing Large Files to GitHub: A Technical Deep Dive (For Educational Purposes)

13 min read
Share

This article is for educational purposes only. Using GitHub/GitLab as a general-purpose file storage backend may violate their Acceptable Use Policies. The techniques described here should only be used in self-hosted Git instances within organizations that explicitly permit such usage. The author and publisher are not responsible for any account restrictions or service disruptions. Always review and comply with your Git provider's Terms of Service.

Pushing Large Files to GitHub: A Technical Deep Dive

You’ve just trained a machine learning model, exported a massive dataset, or compiled a binary that’s north of 100MB. Now you want to push it to GitHub. Git politely refuses with an error about file size limits. Your first instinct might be to find a workaround.

Stop. This article will show you how, but also why you shouldn’t.

Why Would Anyone Want to Store Large Files in Git?

The use case is surprisingly common:

  • ML practitioners who want to version models alongside code
  • Data scientists sharing datasets with collaborators
  • Game developers with asset files that exceed 100MB
  • Researchers distributing large binary outputs

The appeal is obvious: keep everything in one repository, use the same PR workflow, and leverage GitHub’s web interface for access control.

The Problem: Git Wasn’t Built for This

Git is a version control system, not a storage engine. Here’s why large files are fundamentally incompatible with Git’s design:

1. Every Clone Downloads the Entire History

When you git clone a repository, you download every version of every file. A 50MB model file modified 10 times becomes 500MB in your .git folder. For collaborators with slow connections, this is a nightmare.

2. Git’s Delta Compression Fails on Binary Files

Git compresses file changes using delta encoding. This works beautifully for text diffs but catastrophically for binary files. A single byte change in a 100MB binary creates a new 100MB blob.

3. GitHub’s Hard Limits

GitHub enforces strict limits:

  • 50MB warning - Git will warn but allow the push
  • 100MB rejection - Git refuses the push entirely
  • Repository size - Repositories over 5GB trigger warnings, 100GB+ risk account restrictions

What You Should Actually Use

Before we dive into the “how,” here are the proper solutions:

Git LFS (Large File Storage)

# Install Git LFS
git lfs install

# Track large files
git lfs track "*.pkl"
git lfs track "*.h5"
git lfs track "models/*"

# Add and commit normally
git add .gitattributes
git commit -m "Track model files with LFS"

Pros:

  • Transparent workflow (feels like regular Git)
  • First 1GB of storage is free
  • Designed for this exact use case

Cons:

  • Costs money beyond free tier ($5/month per 50GB)
  • Requires Git LFS installation on all machines

Cloud Storage + Metadata

# Store in S3/GCS, track URL in Git
import boto3

# Upload to S3
s3 = boto3.client('s3')
s3.upload_file('model.pkl', 'my-bucket', 'models/v1.pkl')

# Store metadata in Git
metadata = {
    "model_version": "v1",
    "s3_uri": "s3://my-bucket/models/v1.pkl",
    "sha256": "abc123...",
    "size_mb": 250
}

This is what production systems actually use.

Pros:

  • Built for ML workflows
  • Works with any cloud storage
  • Versioning built-in

Cons:

  • Another tool to learn
  • Requires cloud storage setup

The Educational Hack: Chunking + GitHub API

⚠️ WARNING: This is for educational purposes only. Using GitHub as a general-purpose storage backend violates GitHub’s Acceptable Use Policy. Abuse can result in account suspension.

That said, understanding how to work around Git’s limitations teaches valuable lessons about API design, chunking strategies, and distributed systems.

Requirements

Before you start, you need:

  1. GitHub Personal Access Token with repo scope

    • Go to Settings → Developer settings → Personal access tokens → Tokens (classic)
    • Generate new token with repo permissions (full control of private repositories)
    • Save it securely (you’ll never see it again)
  2. A GitHub repository where you have write access

The Strategy: Chunk, Upload, Reassemble

Here’s the workflow in plain terms:

Saving a large file:

  1. Take your 150MB file
  2. Split it into 20MB pieces (chunk_000, chunk_001, chunk_002, etc.)
  3. Upload each chunk to GitHub one at a time using the API
  4. Create a manifest.json file that lists where each chunk is stored and in what order
  5. Upload the manifest to GitHub

Getting the file back:

  1. Download the manifest.json
  2. Download all the chunks
  3. Put the chunks back together in the right order
  4. Verify the file matches the original using a hash check

Implementation: The Uploader

import os
import base64
import hashlib
import json
import requests
from pathlib import Path
from typing import List, Dict

class GitHubChunkedUploader:
    """
    Uploads large files to GitHub by splitting into chunks.
    
    WARNING: Educational purposes only. Not for production use.
    May violate GitHub's Acceptable Use Policy if abused.
    """
    
    CHUNK_SIZE = 20 * 1024 * 1024  # 20MB chunks
    
    def __init__(self, token: str, repo: str, branch: str = "main"):
        """
        Initialize uploader.
        
        Args:
            token: GitHub personal access token (repo scope)
            repo: Repository in format "username/repo"
            branch: Target branch (default: main)
        """
        self.token = token
        self.repo = repo
        self.branch = branch
        self.base_url = f"https://api.github.com/repos/{repo}/contents"
        self.headers = {
            "Authorization": f"Bearer {token}",
            "Accept": "application/vnd.github+json",
            "X-GitHub-Api-Version": "2022-11-28"
        }
    
    def calculate_sha256(self, filepath: Path) -> str:
        """Calculate SHA-256 hash of file for integrity verification."""
        sha256_hash = hashlib.sha256()
        with open(filepath, "rb") as f:
            for byte_block in iter(lambda: f.read(4096), b""):
                sha256_hash.update(byte_block)
        return sha256_hash.hexdigest()
    
    def split_file(self, filepath: Path) -> List[bytes]:
        """
        Split file into chunks.
        
        Returns:
            List of byte chunks, each ≤ CHUNK_SIZE
        """
        chunks = []
        with open(filepath, "rb") as f:
            while True:
                chunk = f.read(self.CHUNK_SIZE)
                if not chunk:
                    break
                chunks.append(chunk)

        # for simplicity the chunks are kept in memory, but it's they should be stored in a tmp directory
        return chunks
    
    def upload_chunk(
        self, 
        chunk_data: bytes, 
        remote_path: str, 
        commit_message: str
    ) -> str:
        """
        Upload a single chunk to GitHub.
        
        Args:
            chunk_data: Raw bytes to upload
            remote_path: Path in repository (e.g., "chunks/file_000")
            commit_message: Git commit message
            
        Returns:
            Download URL for the uploaded chunk
        """
        # Encode chunk as base64 (GitHub API requirement)
        content_encoded = base64.b64encode(chunk_data).decode()
        
        payload = {
            "message": commit_message,
            "branch": self.branch,
            "content": content_encoded
        }
        
        url = f"{self.base_url}/{remote_path}"
        response = requests.put(url, headers=self.headers, json=payload)
        
        if response.status_code not in (200, 201):
            raise Exception(f"Upload failed: {response.status_code} - {response.text}")
        
        # Return the raw download URL
        return response.json()["content"]["download_url"]
    
    def upload_large_file(
        self, 
        filepath: Path, 
        remote_dir: str = "large_files"
    ) -> Dict:
        """
        Upload a large file by chunking.
        
        Args:
            filepath: Local file path
            remote_dir: Directory in repo to store chunks
            
        Returns:
            Manifest dict with chunk URLs and metadata
        """
        if not filepath.exists():
            raise FileNotFoundError(f"File not found: {filepath}")
        
        file_size = filepath.stat().st_size
        file_name = filepath.name
        
        print(f"Uploading {file_name} ({file_size / (1024**2):.2f} MB)")
        
        # Check if chunking is needed
        if file_size <= self.CHUNK_SIZE:
            print("   File is small enough, uploading directly...")
            with open(filepath, "rb") as f:
                chunk_data = f.read()
            
            url = self.upload_chunk(
                chunk_data,
                f"{remote_dir}/{file_name}",
                f"Upload {file_name}"
            )
            
            return {
                "file_name": file_name,
                "size_bytes": file_size,
                "sha256": self.calculate_sha256(filepath),
                "chunks": [url],
                "chunk_count": 1
            }
        
        # Split into chunks
        print("   Splitting into chunks...")
        chunks = self.split_file(filepath)
        print(f"   Created {len(chunks)} chunks")
        
        # Upload chunks sequentially (parallel would cause conflicts!)
        chunk_urls = []
        for i, chunk_data in enumerate(chunks):
            chunk_name = f"{file_name}.chunk_{i:03d}"
            print(f"   Uploading chunk {i+1}/{len(chunks)} ({len(chunk_data) / (1024**2):.2f} MB)")
            
            url = self.upload_chunk(
                chunk_data,
                f"{remote_dir}/{file_name}_chunks/{chunk_name}",
                f"Upload {chunk_name}"
            )
            chunk_urls.append(url)
        
        # Create manifest
        manifest = {
            "file_name": file_name,
            "size_bytes": file_size,
            "sha256": self.calculate_sha256(filepath),
            "chunks": chunk_urls,
            "chunk_count": len(chunks)
        }
        
        # Upload manifest
        print("   Uploading manifest...")
        manifest_json = json.dumps(manifest, indent=2)
        manifest_url = self.upload_chunk(
            manifest_json.encode(),
            f"{remote_dir}/{file_name}.manifest.json",
            f"Upload manifest for {file_name}"
        )
        manifest["manifest_url"] = manifest_url
        
        print(f"Upload complete! Manifest: {manifest_url}")
        return manifest

# Example usage
if __name__ == "__main__":
    # Initialize uploader
    uploader = GitHubChunkedUploader(
        token="ghp_your_token_here",  # Replace with your token
        repo="username/repo",          # Replace with your repo
        branch="main"
    )
    
    # Upload a large file
    manifest = uploader.upload_large_file(Path("large_model.pkl"))
    
    # Save manifest locally for later retrieval
    with open("manifest.json", "w") as f:
        json.dump(manifest, f, indent=2)

Implementation: The Downloader

import requests
import hashlib
from pathlib import Path
from typing import Dict

class GitHubChunkedDownloader:
    """
    Downloads and reassembles chunked files from GitHub.
    
    For better performance, consider using async/await with aiohttp.
    This implementation uses synchronous requests for simplicity.
    """
    
    @staticmethod
    def download_chunk(url: str) -> bytes:
        """Download a single chunk from GitHub."""
        response = requests.get(url)
        response.raise_for_status()
        return response.content
    
    @staticmethod
    def verify_sha256(filepath: Path, expected_hash: str) -> bool:
        """Verify downloaded file matches expected SHA-256 hash."""
        sha256_hash = hashlib.sha256()
        with open(filepath, "rb") as f:
            for byte_block in iter(lambda: f.read(4096), b""):
                sha256_hash.update(byte_block)
        return sha256_hash.hexdigest() == expected_hash
    
    def download_large_file(
        self, 
        manifest: Dict, 
        output_path: Path
    ) -> None:
        """
        Download and reassemble a chunked file.
        
        Args:
            manifest: Manifest dict from upload (with chunk URLs)
            output_path: Where to save the reassembled file
        """
        file_name = manifest["file_name"]
        chunk_urls = manifest["chunks"]
        expected_size = manifest["size_bytes"]
        expected_hash = manifest["sha256"]
        
        print(f"Downloading {file_name} ({expected_size / (1024**2):.2f} MB)")
        print(f"   {len(chunk_urls)} chunks to download")
        
        # Download chunks sequentially
        # NOTE: Async would be much faster here (aiohttp + asyncio.gather)
        chunks = []
        for i, url in enumerate(chunk_urls):
            print(f"   Downloading chunk {i+1}/{len(chunk_urls)}...")
            chunk_data = self.download_chunk(url)
            chunks.append(chunk_data)
        
        # Reassemble
        print("   Reassembling file...")
        with open(output_path, "wb") as f:
            for chunk in chunks:
                f.write(chunk)
        
        # Verify integrity
        print("   Verifying integrity...")
        if not self.verify_sha256(output_path, expected_hash):
            raise Exception("SHA-256 hash mismatch! File may be corrupted.")
        
        actual_size = output_path.stat().st_size
        if actual_size != expected_size:
            raise Exception(f"Size mismatch! Expected {expected_size}, got {actual_size}")
        
        print(f"Download complete and verified: {output_path}")

# Example usage
if __name__ == "__main__":
    import json
    
    # Load manifest from upload
    with open("manifest.json", "r") as f:
        manifest = json.load(f)
    
    # Download and reassemble
    downloader = GitHubChunkedDownloader()
    downloader.download_large_file(manifest, Path("downloaded_model.pkl"))

Why Sequential Uploads, Not Parallel?

You might think: “20MB chunks × 8 threads = 8x faster uploads!” Unfortunately, no.

Git repositories are stateful. Each commit depends on the previous commit’s SHA. When you upload chunk_000, GitHub creates commit abc123. When you upload chunk_001, it needs to reference abc123 as the parent.

If you upload in parallel:

Thread 1: Upload chunk_000 → commit abc123
Thread 2: Upload chunk_001 → expects parent abc123 (not available yet!)
Thread 3: Upload chunk_002 → expects parent def456 (where is it?)

Result: Merge conflicts, failed pushes, corrupted repository history.

Sequential uploads ensure clean history. Boring, but correct.

Async Downloads: The Faster Alternative

Downloading, however, is perfectly parallelizable:

import asyncio
import aiohttp
from pathlib import Path
from typing import List, Dict

async def download_chunk_async(session: aiohttp.ClientSession, url: str) -> bytes:
    """Download a chunk asynchronously."""
    async with session.get(url) as response:
        response.raise_for_status()
        return await response.read()

async def download_large_file(manifest: Dict, output_path: Path):
    ...

    async with aiohttp.ClientSession() as session:
        # Download all chunks concurrently
        tasks = [download_chunk_async(session, url) for url in chunk_urls]
        chunks = await asyncio.gather(*tasks)
    
    # Reassemble (order preserved by asyncio.gather)
    print("   Reassembling...")
    with open(output_path, "wb") as f:
        for chunk in chunks:
            f.write(chunk)
    
    ...

# Usage
if __name__ == "__main__":
    import json
    
    with open("manifest.json", "r") as f:
        manifest = json.load(f)
    
    asyncio.run(download_large_file_async(manifest, Path("model.pkl")))

This downloads 8 chunks simultaneously, saturating your bandwidth. On a 1Gbps connection, I’ve seen 5-7x speedups compared to sequential downloads.

Scaling Up: Multiple Repositories

The single-repo approach has problems:

  1. Repository bloat - Hundreds of chunks in one repo is ugly
  2. GitHub API rate limits - 5,000 requests/hour per token
  3. Sequential upload bottleneck - Large files take forever

Solution: Shard across multiple repositories.

class MultiRepoUploader:
    """Upload chunks across multiple repositories for parallelism."""
    
    def __init__(self, token: str, repos: List[str]):
        self.uploaders = [
            GitHubChunkedUploader(token, repo) for repo in repos
        ]
    
    def upload_large_file_sharded(self, filepath: Path) -> Dict:
        """Split chunks across repositories for parallel uploads."""
        chunks = self.split_file(filepath)
        num_repos = len(self.uploaders)
        
        # Distribute chunks across repos
        chunk_urls = []
        for i, chunk_data in enumerate(chunks):
            repo_idx = i % num_repos  # Round-robin distribution
            uploader = self.uploaders[repo_idx]
            
            # Now we can upload in parallel per-repo!
            url = uploader.upload_chunk(chunk_data, f"chunk_{i:03d}", f"Upload chunk {i}")
            chunk_urls.append(url)
        
        return {"chunks": chunk_urls, "file_name": filepath.name}

Benefits:

  • Upload to 4 repos in parallel = 4x throughput
  • Distribute load across repositories
  • Stay under API rate limits

Drawbacks:

  • Managing multiple repos is annoying
  • Still doesn’t solve the fundamental “Git isn’t storage” problem

Why This Is Still a Terrible Idea

Let’s be brutally honest about why using Git as a storage backend is bad engineering:

1. GitHub Will Notice (and May Ban You)

From GitHub’s Acceptable Use Policy:

“GitHub’s file storage is not intended to be used as a general-purpose file storage platform… Accounts in violation may have access restricted or terminated.”

If you upload hundreds of gigabytes, GitHub’s abuse detection will flag your account. At best, they’ll throttle your API access. At worst, permanent ban.

2. Performance Degrades Over Time

Each push creates a new commit. After 1,000 commits, git clone downloads 1,000 commit objects. Your repository becomes very slow.

3. No Deduplication

Upload the same 100MB file twice? Git stores it twice. Modify one byte? Entire file stored again. You’ll burn through GitHub’s storage limits fast.

4. API Rate Limits Kill You

GitHub allows 5,000 API requests/hour with authentication. A 1GB file = 50 chunks = 50 API calls. Upload 100 files and you’re rate-limited for an hour.

5. It’s Just Wrong

Git is version control. S3 is storage. Using a screwdriver as a hammer might work, but why?

When This Approach Is Actually Allowed

This technique is appropriate in specific organizational contexts:

Self-Hosted Git Instances: If your organization runs its own GitLab, Gitea, or GitHub Enterprise instance on their own infrastructure, and IT explicitly permits using it for file storage, then this approach is fair game.

Requirements for legitimate use:

  • Self-hosted Git server (not github.com, not gitlab.com)
  • Organization owns and operates the infrastructure
  • IT policy explicitly allows file storage usage
  • You have written approval from infrastructure team
  • Storage quotas and limits are clearly defined

Example scenario: Your company runs GitLab on internal servers with 10TB of storage allocated for engineering artifacts. IT has approved using it for ML model storage as part of your CI/CD pipeline. In this case, the chunking technique is a valid engineering solution.

Still not recommended for:

  • Public GitHub (github.com)
  • Public GitLab (gitlab.com)
  • Any hosted Git service you don’t control
  • Circumventing organizational policies

Key Takeaways

  1. Git isn’t storage - It’s version control.
  2. Use Git LFS
  3. Cloud storage exists - S3.
  4. The hack works - But it’s educational, not production-ready.
  5. GitHub will notice - Abuse leads to account restrictions.
  6. Async downloads are fast - Sequential uploads are mandatory.
  7. Multiple repos help - But don’t solve the core problem.

Final Thoughts

If you learned something from this article, great. If you’re tempted to use this in production, please reconsider. Your future self (and your GitHub account) will thank you.

The right tool for the job isn’t always the one you’re already using. Sometimes, it’s worth paying 5 bucks a month for Git LFS or setting up S3. Engineering isn’t about clever hacks, it’s about sustainable systems.

Continue reading

Next article

Python and SQLite in the Real World

Related Content