Build an End-to-End Single Cell RNA Sequencing Pipeline with Scanpy
These articles are AI-generated summaries. Please check the original sources for full details.
A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation
This technical guide demonstrates a modular workflow for analyzing single-cell transcriptomic data using the Scanpy library. It utilizes the PBMC 3k dataset to perform high-resolution clustering and automated cell-type inference through marker gene scoring.
Why This Matters
In biological data science, raw sequencing data contains significant technical noise, such as mitochondrial contamination and batch effects, which can lead to false biological conclusions if not properly regressed. While ideal models assume clean expression profiles, this pipeline implements robust filtering (e.g., <10% mitochondrial counts) and normalization (1e4 target sum) to ensure scalable and reproducible results for advanced immunological research.
Key Insights
- Quality control filtering: Cells with fewer than 200 genes or more than 10% mitochondrial content are excluded to ensure data integrity.
- Dimensionality reduction: PCA is used to capture major variance, followed by UMAP for 2D visualization of complex neighborhood graphs.
- Clustering algorithms: The Leiden algorithm at a resolution of 0.6 identifies distinct clusters such as B cells, T cells, and Monocytes.
- Marker gene discovery: Wilcoxon rank-sum tests identify cluster-specific genes like MS4A1 for B cells and LYZ for Monocytes.
- Automated annotation: Scoring cells against reference marker sets like NKG7 and GNLY allows for precise NK cell identification.
Working Examples
Initial quality control, filtering, and normalization of the PBMC 3k dataset.
import scanpy as sc
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
adata.var['mt'] = adata.var_names.str.upper().str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
adata = adata[adata.obs['n_genes_by_counts'] >= 200].copy()
adata = adata[adata.obs['pct_counts_mt'] < 10].copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var['highly_variable']].copy()
Dimensionality reduction, clustering, and marker gene identification.
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_neighbors=12, n_pcs=30)
sc.tl.umap(adata, min_dist=0.35)
sc.tl.leiden(adata, resolution=0.6)
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')
Practical Applications
- Use case: Immune profiling of PBMC samples to identify rare cell populations like dendritic cells using markers like FCER1A. Pitfall: Over-filtering genes with low cell counts can remove biologically relevant rare cell signals.
- Use case: Drug response studies where Leiden clustering resolution is adjusted to identify treatment-sensitive sub-populations. Pitfall: Failing to regress out technical confounders like mitochondrial percentage can lead to clustering based on artifacts.
References:
Continue reading
Next article
Andrej Karpathy Open-Sources 'Autoresearch': A 630-Line Tool for Autonomous ML Experiments
Related Content
Building a Single-Cell RNA-seq Analysis Pipeline with Scanpy: From PBMC Clustering to Trajectory Discovery
Learn to build a complete single-cell RNA-seq pipeline using Scanpy for PBMC analysis, covering quality control, doublet detection with Scrublet, and lineage trajectory discovery on benchmark datasets.
Building Django Applications with GitHub Copilot Agent Mode
Learn how to build a Django password generator in under three hours using GitHub Copilot agent mode and GPT-4.1, featuring automated setup and self-correcting code.
How Can We Build Scalable and Reproducible Machine Learning Experiment Pipelines Using Meta Research Hydra?
This article explains how to use Meta's Hydra framework to create scalable and reproducible ML experiments through structured configurations, overrides, and multirun simulations.