Bio ResearchAdvanced

Single-Cell RNA-seq QC

Name: Single-Cell RNA-seq QC
Author: Anthropic

Automated quality control for single-cell RNA-seq data with filtering and visualization

10 minutes

By AnthropicSource

#single-cell#rna-seq#quality-control#bioinformatics

Bad cells with high mitochondrial content or low gene counts will corrupt your entire downstream analysis if not filtered early. This playbook automates single-cell RNA-seq quality control — detecting doublets, filtering low-quality cells, and generating QC visualizations before you proceed to clustering.

Who it's for: bioinformaticians running QC on 10X Chromium single-cell datasets before analysis, wet-lab biologists validating sequencing quality of their single-cell experiments, genomics core facility staff processing scRNA-seq data for multiple research groups, immunologists ensuring clean cell populations before differential expression analysis, computational biology students learning standard scRNA-seq QC workflows

Example

"Run QC on our 10X scRNA-seq dataset and filter low-quality cells" → scRNA-seq QC pipeline: per-cell metric calculation (gene counts, UMI counts, mitochondrial percentage), QC violin plots and scatter visualizations for threshold selection, doublet detection and removal, adaptive filtering based on distribution-derived thresholds, and clean count matrix export ready for normalization and clustering

CLAUDE.md Template

New here? 3-minute setup guide → | Already set up? Copy the template below.

# Single-Cell RNA-seq Quality Control

Automated QC workflow for single-cell RNA-seq data following scverse best practices.

## When to Use This Playbook

Use when you need to:
- Perform quality control on single-cell RNA-seq data
- Filter low-quality cells or assess data quality
- Generate QC visualizations and metrics
- Follow scverse/scanpy best practices
- Apply MAD-based filtering or outlier detection

**Supported input formats:**
- `.h5ad` files (AnnData format from scanpy/Python workflows)
- `.h5` files (10X Genomics Cell Ranger output)

**Default recommendation**: Use Approach 1 (complete pipeline) unless you have specific custom requirements or explicitly need non-standard filtering logic.

## Approach 1: Complete QC Pipeline (Recommended for Standard Workflows)

For standard QC following scverse best practices, use the convenience script `scripts/qc_analysis.py`:

```bash
python3 scripts/qc_analysis.py input.h5ad
# or for 10X Genomics .h5 files:
python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5
```

The assistant automatically detects the file format and loads it appropriately.

**When to use this approach:**
- Standard QC workflow with adjustable thresholds (all cells filtered the same way)
- Batch processing multiple datasets
- Quick exploratory analysis
- You want the "just works" solution

**Requirements:** anndata, scanpy, scipy, matplotlib, seaborn, numpy

**Parameters:**

Customize filtering thresholds and gene patterns using command-line parameters:
- `--output-dir` - Output directory
- `--mad-counts`, `--mad-genes`, `--mad-mt` - MAD thresholds for counts/genes/MT%
- `--mt-threshold` - Hard mitochondrial % cutoff
- `--min-cells` - Gene filtering threshold
- `--mt-pattern`, `--ribo-pattern`, `--hb-pattern` - Gene name patterns for different species

Use `--help` to see current default values.

**Outputs:**

All files are saved to `<input_basename>_qc_results/` directory by default (or to the directory specified by `--output-dir`):
- `qc_metrics_before_filtering.png` - Pre-filtering visualizations
- `qc_filtering_thresholds.png` - MAD-based threshold overlays
- `qc_metrics_after_filtering.png` - Post-filtering quality metrics
- `<input_basename>_filtered.h5ad` - Clean, filtered dataset ready for downstream analysis
- `<input_basename>_with_qc.h5ad` - Original data with QC annotations preserved

### Workflow Steps

The pipeline performs the following steps:

1. **Calculate QC metrics** - Count depth, gene detection, mitochondrial/ribosomal/hemoglobin content
2. **Apply MAD-based filtering** - Permissive outlier detection using MAD thresholds for counts/genes/MT%
3. **Filter genes** - Remove genes detected in few cells
4. **Generate visualizations** - Comprehensive before/after plots with threshold overlays

## Approach 2: Modular Building Blocks (For Custom Workflows)

For custom analysis workflows or non-standard requirements, use the modular utility functions from `scripts/qc_core.py` and `scripts/qc_plotting.py`:

```python
import anndata as ad
from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells
from qc_plotting import plot_qc_distributions

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# ... custom analysis logic here
```

**When to use this approach:**
- Different workflow needed (skip steps, change order, apply different thresholds to subsets)
- Conditional logic (e.g., filter neurons differently than other cells)
- Partial execution (only metrics/visualization, no filtering)
- Integration with other analysis steps in a larger pipeline
- Custom filtering criteria beyond what command-line params support

**Available utility functions:**

From `qc_core.py` (core QC operations):
- `calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)` - Calculate QC metrics and annotate adata
- `detect_outliers_mad(adata, metric, n_mads, verbose=True)` - MAD-based outlier detection, returns boolean mask
- `apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)` - Apply hard cutoffs, returns boolean mask
- `filter_cells(adata, mask, inplace=False)` - Apply boolean mask to filter cells
- `filter_genes(adata, min_cells=20, min_counts=None, inplace=True)` - Filter genes by detection
- `print_qc_summary(adata, label='')` - Print summary statistics

From `qc_plotting.py` (visualization):
- `plot_qc_distributions(adata, output_path, title)` - Generate comprehensive QC plots
- `plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)` - Visualize filtering thresholds
- `plot_qc_after_filtering(adata, output_path)` - Generate post-filtering plots

**Example custom workflows:**

**Example 1: Only calculate metrics and visualize, don't filter yet**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='Initial QC')
print_qc_summary(adata, label='Before filtering')
```

**Example 2: Apply only MT% filtering, keep other metrics permissive**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

# Only filter high MT% cells
high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>')
adata_filtered = filter_cells(adata, ~high_mt)
adata_filtered.write('filtered.h5ad')
```

**Example 3: Different thresholds for different subsets**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

# Apply type-specific QC (assumes cell_type metadata exists)
neurons = adata.obs['cell_type'] == 'neuron'
other_cells = ~neurons

# Neurons tolerate higher MT%, other cells use stricter threshold
neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>')
other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')
```

## Best Practices

1. **Be permissive with filtering** - Default thresholds intentionally retain most cells to avoid losing rare populations
2. **Inspect visualizations** - Always review before/after plots to ensure filtering makes biological sense
3. **Consider dataset-specific factors** - Some tissues naturally have higher mitochondrial content (e.g., neurons, cardiomyocytes)
4. **Check gene annotations** - Mitochondrial gene prefixes vary by species (mt- for mouse, MT- for human)
5. **Iterate if needed** - QC parameters may need adjustment based on the specific experiment or tissue type

## Next Steps After QC

Typical downstream analysis steps:
- Ambient RNA correction (SoupX, CellBender)
- Doublet detection (scDblFinder)
- Normalization (log-normalize, scran)
- Feature selection and dimensionality reduction
- Clustering and cell type annotation

README.md

What This Does

This playbook automates quality control for single-cell RNA-seq data following scverse best practices. It calculates QC metrics (count depth, gene detection, mitochondrial content), applies MAD-based outlier filtering, and generates comprehensive before/after visualizations -- giving you a clean, filtered dataset ready for downstream analysis like clustering and cell type annotation.

Quick Start

Step 1: Download the Template

Click Download above to get the CLAUDE.md file.

Step 2: Set Up Your Project

Create a project folder with your data and the template:

my-scrna-project/
  CLAUDE.md
  data/
    sample.h5ad    # or raw_feature_bc_matrix.h5

Step 3: Start Working

claude

Say: "Run QC on my single-cell RNA-seq data in data/sample.h5ad"

Two Approaches

Approach 1: Complete Pipeline (Recommended)

A one-command solution for standard QC workflows. Handles metrics calculation, MAD-based filtering, gene filtering, and visualization generation automatically.

Approach 2: Modular Building Blocks

For custom workflows where you need different thresholds for cell subsets, conditional filtering logic, or partial execution (metrics only, no filtering).

Output Files

The pipeline generates these files in the results directory:

qc_metrics_before_filtering.png -- Pre-filtering visualizations
qc_filtering_thresholds.png -- MAD-based threshold overlays
qc_metrics_after_filtering.png -- Post-filtering quality metrics
*_filtered.h5ad -- Clean dataset ready for downstream analysis
*_with_qc.h5ad -- Original data with QC annotations preserved

Tips

Be permissive with filtering -- default thresholds retain most cells to avoid losing rare populations
Always review before/after plots to ensure filtering makes biological sense
Some tissues naturally have higher mitochondrial content (neurons, cardiomyocytes) -- adjust thresholds accordingly
Mitochondrial gene prefixes vary by species: mt- for mouse, MT- for human
QC parameters may need adjustment based on the specific experiment or tissue type

Supported Formats

Format	Source	Extension
AnnData	scanpy/Python workflows	`.h5ad`
10X Genomics	Cell Ranger output	`.h5`

Example Prompts

"Run QC on my single-cell data in sample.h5ad with default settings."
"Analyze the quality of my 10X Genomics h5 file and filter low-quality cells."
"Calculate QC metrics only -- don't filter yet. I want to inspect the distributions first."
"Run QC with a stricter mitochondrial threshold of 10% instead of the default."
"Apply different QC thresholds for neurons vs other cell types in my dataset."