Skip to main content

On This Page

Build an End-to-End Single Cell RNA Sequencing Pipeline with Scanpy

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

A Coding Guide to Build a Complete Single Cell RNA Sequencing Analysis Pipeline Using Scanpy for Clustering Visualization and Cell Type Annotation

This technical guide demonstrates a modular workflow for analyzing single-cell transcriptomic data using the Scanpy library. It utilizes the PBMC 3k dataset to perform high-resolution clustering and automated cell-type inference through marker gene scoring.

Why This Matters

In biological data science, raw sequencing data contains significant technical noise, such as mitochondrial contamination and batch effects, which can lead to false biological conclusions if not properly regressed. While ideal models assume clean expression profiles, this pipeline implements robust filtering (e.g., <10% mitochondrial counts) and normalization (1e4 target sum) to ensure scalable and reproducible results for advanced immunological research.

Key Insights

  • Quality control filtering: Cells with fewer than 200 genes or more than 10% mitochondrial content are excluded to ensure data integrity.
  • Dimensionality reduction: PCA is used to capture major variance, followed by UMAP for 2D visualization of complex neighborhood graphs.
  • Clustering algorithms: The Leiden algorithm at a resolution of 0.6 identifies distinct clusters such as B cells, T cells, and Monocytes.
  • Marker gene discovery: Wilcoxon rank-sum tests identify cluster-specific genes like MS4A1 for B cells and LYZ for Monocytes.
  • Automated annotation: Scoring cells against reference marker sets like NKG7 and GNLY allows for precise NK cell identification.

Working Examples

Initial quality control, filtering, and normalization of the PBMC 3k dataset.

import scanpy as sc
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
adata.var['mt'] = adata.var_names.str.upper().str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
adata = adata[adata.obs['n_genes_by_counts'] >= 200].copy()
adata = adata[adata.obs['pct_counts_mt'] < 10].copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var['highly_variable']].copy()

Dimensionality reduction, clustering, and marker gene identification.

sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_neighbors=12, n_pcs=30)
sc.tl.umap(adata, min_dist=0.35)
sc.tl.leiden(adata, resolution=0.6)
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')

Practical Applications

  • Use case: Immune profiling of PBMC samples to identify rare cell populations like dendritic cells using markers like FCER1A. Pitfall: Over-filtering genes with low cell counts can remove biologically relevant rare cell signals.
  • Use case: Drug response studies where Leiden clustering resolution is adjusted to identify treatment-sensitive sub-populations. Pitfall: Failing to regress out technical confounders like mitochondrial percentage can lead to clustering based on artifacts.

References:

Continue reading

Next article

Andrej Karpathy Open-Sources 'Autoresearch': A 630-Line Tool for Autonomous ML Experiments

Related Content