Build an End-to-End Single Cell RNA Sequencing Pipeline with Scanpy

A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation

This technical guide demonstrates a modular workflow for analyzing single-cell transcriptomic data using the Scanpy library. It utilizes the PBMC 3k dataset to perform high-resolution clustering and automated cell-type inference through marker gene scoring.

Why This Matters

In biological data science, raw sequencing data contains significant technical noise, such as mitochondrial contamination and batch effects, which can lead to false biological conclusions if not properly regressed. While ideal models assume clean expression profiles, this pipeline implements robust filtering (e.g., <10% mitochondrial counts) and normalization (1e4 target sum) to ensure scalable and reproducible results for advanced immunological research.

Key Insights

Quality control filtering: Cells with fewer than 200 genes or more than 10% mitochondrial content are excluded to ensure data integrity.
Dimensionality reduction: PCA is used to capture major variance, followed by UMAP for 2D visualization of complex neighborhood graphs.
Clustering algorithms: The Leiden algorithm at a resolution of 0.6 identifies distinct clusters such as B cells, T cells, and Monocytes.
Marker gene discovery: Wilcoxon rank-sum tests identify cluster-specific genes like MS4A1 for B cells and LYZ for Monocytes.
Automated annotation: Scoring cells against reference marker sets like NKG7 and GNLY allows for precise NK cell identification.

Working Examples

Initial quality control, filtering, and normalization of the PBMC 3k dataset.

import scanpy as sc
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
adata.var['mt'] = adata.var_names.str.upper().str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
adata = adata[adata.obs['n_genes_by_counts'] >= 200].copy()
adata = adata[adata.obs['pct_counts_mt'] < 10].copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var['highly_variable']].copy()

Dimensionality reduction, clustering, and marker gene identification.

sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_neighbors=12, n_pcs=30)
sc.tl.umap(adata, min_dist=0.35)
sc.tl.leiden(adata, resolution=0.6)
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')

Practical Applications

Use case: Immune profiling of PBMC samples to identify rare cell populations like dendritic cells using markers like FCER1A. Pitfall: Over-filtering genes with low cell counts can remove biologically relevant rare cell signals.
Use case: Drug response studies where Leiden clustering resolution is adjusted to identify treatment-sensitive sub-populations. Pitfall: Failing to regress out technical confounders like mitochondrial percentage can lead to clustering based on artifacts.

References:

https://www.marktechpost.com/2026/03/08/a-coding-guide-to-build-a-complete-single-cell-rna-sequencing-analysis-pipeline-using-scanpy-for-clustering-visualization-and-cell-type-annotation/

On This Page

A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Building a Single-Cell RNA-seq Analysis Pipeline with Scanpy: From PBMC Clustering to Trajectory Discovery

How Can We Build Scalable and Reproducible Machine Learning Experiment Pipelines Using Meta Research Hydra?

Building Scalable ML Pipelines on Millions of Rows with Vaex