Single-Cell RNA-seq QC
Automated quality control for single-cell RNA-seq data with filtering and visualization
Download this file and place it in your project folder to get started.
# Single-Cell RNA-seq Quality Control
Automated QC workflow for single-cell RNA-seq data following scverse best practices.
## When to Use This Playbook
Use when you need to:
- Perform quality control on single-cell RNA-seq data
- Filter low-quality cells or assess data quality
- Generate QC visualizations and metrics
- Follow scverse/scanpy best practices
- Apply MAD-based filtering or outlier detection
**Supported input formats:**
- `.h5ad` files (AnnData format from scanpy/Python workflows)
- `.h5` files (10X Genomics Cell Ranger output)
**Default recommendation**: Use Approach 1 (complete pipeline) unless you have specific custom requirements or explicitly need non-standard filtering logic.
## Approach 1: Complete QC Pipeline (Recommended for Standard Workflows)
For standard QC following scverse best practices, use the convenience script `scripts/qc_analysis.py`:
```bash
python3 scripts/qc_analysis.py input.h5ad
# or for 10X Genomics .h5 files:
python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5
```
The assistant automatically detects the file format and loads it appropriately.
**When to use this approach:**
- Standard QC workflow with adjustable thresholds (all cells filtered the same way)
- Batch processing multiple datasets
- Quick exploratory analysis
- You want the "just works" solution
**Requirements:** anndata, scanpy, scipy, matplotlib, seaborn, numpy
**Parameters:**
Customize filtering thresholds and gene patterns using command-line parameters:
- `--output-dir` - Output directory
- `--mad-counts`, `--mad-genes`, `--mad-mt` - MAD thresholds for counts/genes/MT%
- `--mt-threshold` - Hard mitochondrial % cutoff
- `--min-cells` - Gene filtering threshold
- `--mt-pattern`, `--ribo-pattern`, `--hb-pattern` - Gene name patterns for different species
Use `--help` to see current default values.
**Outputs:**
All files are saved to `<input_basename>_qc_results/` directory by default (or to the directory specified by `--output-dir`):
- `qc_metrics_before_filtering.png` - Pre-filtering visualizations
- `qc_filtering_thresholds.png` - MAD-based threshold overlays
- `qc_metrics_after_filtering.png` - Post-filtering quality metrics
- `<input_basename>_filtered.h5ad` - Clean, filtered dataset ready for downstream analysis
- `<input_basename>_with_qc.h5ad` - Original data with QC annotations preserved
### Workflow Steps
The pipeline performs the following steps:
1. **Calculate QC metrics** - Count depth, gene detection, mitochondrial/ribosomal/hemoglobin content
2. **Apply MAD-based filtering** - Permissive outlier detection using MAD thresholds for counts/genes/MT%
3. **Filter genes** - Remove genes detected in few cells
4. **Generate visualizations** - Comprehensive before/after plots with threshold overlays
## Approach 2: Modular Building Blocks (For Custom Workflows)
For custom analysis workflows or non-standard requirements, use the modular utility functions from `scripts/qc_core.py` and `scripts/qc_plotting.py`:
```python
import anndata as ad
from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells
from qc_plotting import plot_qc_distributions
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# ... custom analysis logic here
```
**When to use this approach:**
- Different workflow needed (skip steps, change order, apply different thresholds to subsets)
- Conditional logic (e.g., filter neurons differently than other cells)
- Partial execution (only metrics/visualization, no filtering)
- Integration with other analysis steps in a larger pipeline
- Custom filtering criteria beyond what command-line params support
**Available utility functions:**
From `qc_core.py` (core QC operations):
- `calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)` - Calculate QC metrics and annotate adata
- `detect_outliers_mad(adata, metric, n_mads, verbose=True)` - MAD-based outlier detection, returns boolean mask
- `apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)` - Apply hard cutoffs, returns boolean mask
- `filter_cells(adata, mask, inplace=False)` - Apply boolean mask to filter cells
- `filter_genes(adata, min_cells=20, min_counts=None, inplace=True)` - Filter genes by detection
- `print_qc_summary(adata, label='')` - Print summary statistics
From `qc_plotting.py` (visualization):
- `plot_qc_distributions(adata, output_path, title)` - Generate comprehensive QC plots
- `plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)` - Visualize filtering thresholds
- `plot_qc_after_filtering(adata, output_path)` - Generate post-filtering plots
**Example custom workflows:**
**Example 1: Only calculate metrics and visualize, don't filter yet**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='Initial QC')
print_qc_summary(adata, label='Before filtering')
```
**Example 2: Apply only MT% filtering, keep other metrics permissive**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# Only filter high MT% cells
high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>')
adata_filtered = filter_cells(adata, ~high_mt)
adata_filtered.write('filtered.h5ad')
```
**Example 3: Different thresholds for different subsets**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# Apply type-specific QC (assumes cell_type metadata exists)
neurons = adata.obs['cell_type'] == 'neuron'
other_cells = ~neurons
# Neurons tolerate higher MT%, other cells use stricter threshold
neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>')
other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')
```
## Best Practices
1. **Be permissive with filtering** - Default thresholds intentionally retain most cells to avoid losing rare populations
2. **Inspect visualizations** - Always review before/after plots to ensure filtering makes biological sense
3. **Consider dataset-specific factors** - Some tissues naturally have higher mitochondrial content (e.g., neurons, cardiomyocytes)
4. **Check gene annotations** - Mitochondrial gene prefixes vary by species (mt- for mouse, MT- for human)
5. **Iterate if needed** - QC parameters may need adjustment based on the specific experiment or tissue type
## Next Steps After QC
Typical downstream analysis steps:
- Ambient RNA correction (SoupX, CellBender)
- Doublet detection (scDblFinder)
- Normalization (log-normalize, scran)
- Feature selection and dimensionality reduction
- Clustering and cell type annotation
What This Does
This playbook automates quality control for single-cell RNA-seq data following scverse best practices. It calculates QC metrics (count depth, gene detection, mitochondrial content), applies MAD-based outlier filtering, and generates comprehensive before/after visualizations -- giving you a clean, filtered dataset ready for downstream analysis like clustering and cell type annotation.
Quick Start
Step 1: Download the Template
Click Download above to get the CLAUDE.md file.
Step 2: Set Up Your Project
Create a project folder with your data and the template:
my-scrna-project/
CLAUDE.md
data/
sample.h5ad # or raw_feature_bc_matrix.h5
Step 3: Start Working
claude
Say: "Run QC on my single-cell RNA-seq data in data/sample.h5ad"
Two Approaches
Approach 1: Complete Pipeline (Recommended)
A one-command solution for standard QC workflows. Handles metrics calculation, MAD-based filtering, gene filtering, and visualization generation automatically.
Approach 2: Modular Building Blocks
For custom workflows where you need different thresholds for cell subsets, conditional filtering logic, or partial execution (metrics only, no filtering).
Output Files
The pipeline generates these files in the results directory:
qc_metrics_before_filtering.png-- Pre-filtering visualizationsqc_filtering_thresholds.png-- MAD-based threshold overlaysqc_metrics_after_filtering.png-- Post-filtering quality metrics*_filtered.h5ad-- Clean dataset ready for downstream analysis*_with_qc.h5ad-- Original data with QC annotations preserved
Tips
- Be permissive with filtering -- default thresholds retain most cells to avoid losing rare populations
- Always review before/after plots to ensure filtering makes biological sense
- Some tissues naturally have higher mitochondrial content (neurons, cardiomyocytes) -- adjust thresholds accordingly
- Mitochondrial gene prefixes vary by species:
mt-for mouse,MT-for human - QC parameters may need adjustment based on the specific experiment or tissue type
Supported Formats
| Format | Source | Extension |
|---|---|---|
| AnnData | scanpy/Python workflows | .h5ad |
| 10X Genomics | Cell Ranger output | .h5 |
Example Prompts
"Run QC on my single-cell data in sample.h5ad with default settings."
"Analyze the quality of my 10X Genomics h5 file and filter low-quality cells."
"Calculate QC metrics only -- don't filter yet. I want to inspect the distributions first."
"Run QC with a stricter mitochondrial threshold of 10% instead of the default."
"Apply different QC thresholds for neurons vs other cell types in my dataset."