Academic ResearchIntermediate

Scientific Scvelo

Name: Scientific Scvelo
Author: K-Dense AI

RNA velocity analysis with scVelo. Estimate cell state transitions from unspliced/spliced mRNA dynamics, infer trajectory directions, compute latent time, and identify driver genes in single-cell RNA-seq data. Complements Scanpy/scVI-tools for tra...

10 minutes

By K-Dense AISource

#scientific#claude-code#scvelo#bioinformatics#visualization#genomics#single-cell#writing

Your scRNA-seq UMAP shows clusters but you can't tell which direction cells are differentiating — are they progressing or regressing? scVelo infers RNA velocity from spliced/unspliced mRNA ratios, revealing cell state transitions and trajectory directions directly from your single-cell data.

Who it's for: developmental biologists studying cell differentiation trajectories from scRNA-seq data, stem cell researchers inferring lineage commitment directions in progenitor populations, cancer biologists analyzing tumor cell state transitions and dedifferentiation, single-cell bioinformaticians adding velocity analysis to standard scanpy workflows, systems biologists identifying driver genes of cell state changes

Example

"Run RNA velocity analysis on our scRNA-seq data to infer differentiation trajectories" → scVelo pipeline: spliced/unspliced count matrix preparation, velocity estimation with stochastic or dynamical model, velocity stream visualization on UMAP embedding, latent time computation along differentiation trajectory, and driver gene identification for cell state transitions

CLAUDE.md Template

New here? 3-minute setup guide → | Already set up? Copy the template below.

# scVelo — RNA Velocity Analysis

## Overview

scVelo is the leading Python package for RNA velocity analysis in single-cell RNA-seq data. It infers cell state transitions by modeling the kinetics of mRNA splicing — using the ratio of unspliced (pre-mRNA) to spliced (mature mRNA) abundances to determine whether a gene is being upregulated or downregulated in each cell. This allows reconstruction of developmental trajectories and identification of cell fate decisions without requiring time-course data.

**Installation:** `pip install scvelo`

**Key resources:**
- Documentation: https://scvelo.readthedocs.io/
- GitHub: https://github.com/theislab/scvelo
- Paper: Bergen et al. (2020) Nature Biotechnology. PMID: 32747759

## When to Use This Skill

Use scVelo when:

- **Trajectory inference from snapshot data**: Determine which direction cells are differentiating
- **Cell fate prediction**: Identify progenitor cells and their downstream fates
- **Driver gene identification**: Find genes whose dynamics best explain observed trajectories
- **Developmental biology**: Model hematopoiesis, neurogenesis, epithelial-to-mesenchymal transitions
- **Latent time estimation**: Order cells along a pseudotime derived from splicing dynamics
- **Complement to Scanpy**: Add directional information to UMAP embeddings

## Prerequisites

scVelo requires count matrices for both **unspliced** and **spliced** RNA. These are generated by:
1. **STARsolo** or **kallisto|bustools** with `lamanno` mode
2. **velocyto** CLI: `velocyto run10x` / `velocyto run`
3. **alevin-fry** / **simpleaf** with spliced/unspliced output

Data is stored in an `AnnData` object with `layers["spliced"]` and `layers["unspliced"]`.

## Standard RNA Velocity Workflow

### 1. Setup and Data Loading

```python
import scvelo as scv
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt

# Configure settings
scv.settings.verbosity = 3       # Show computation steps
scv.settings.presenter_view = True
scv.settings.set_figure_params('scvelo')

# Load data (AnnData with spliced/unspliced layers)
# Option A: Load from loom (velocyto output)
adata = scv.read("cellranger_output.loom", cache=True)

# Option B: Merge velocyto loom with Scanpy-processed AnnData
adata_processed = sc.read_h5ad("processed.h5ad")  # Has UMAP, clusters
adata_velocity = scv.read("velocyto.loom")
adata = scv.utils.merge(adata_processed, adata_velocity)

# Verify layers
print(adata)
# obs × var: N × G
# layers: 'spliced', 'unspliced' (required)
# obsm['X_umap'] (required for visualization)
```

### 2. Preprocessing

```python
# Filter and normalize (follows Scanpy conventions)
scv.pp.filter_and_normalize(
    adata,
    min_shared_counts=20,   # Minimum counts in spliced+unspliced
    n_top_genes=2000        # Top highly variable genes
)

# Compute first and second order moments (means and variances)
# knn_connectivities must be computed first
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=30)
scv.pp.moments(
    adata,
    n_pcs=30,
    n_neighbors=30
)
```

### 3. Velocity Estimation — Stochastic Model

The stochastic model is fast and suitable for exploratory analysis:

```python
# Stochastic velocity (faster, less accurate)
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)

# Visualize
scv.pl.velocity_embedding_stream(
    adata,
    basis='umap',
    color='leiden',
    title="RNA Velocity (Stochastic)"
)
```

### 4. Velocity Estimation — Dynamical Model (Recommended)

The dynamical model fits the full splicing kinetics and is more accurate:

```python
# Recover dynamics (computationally intensive; ~10-30 min for 10K cells)
scv.tl.recover_dynamics(adata, n_jobs=4)

# Compute velocity from dynamical model
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
```

### 5. Latent Time

The dynamical model enables computation of a shared latent time (pseudotime):

```python
# Compute latent time
scv.tl.latent_time(adata)

# Visualize latent time on UMAP
scv.pl.scatter(
    adata,
    color='latent_time',
    color_map='gnuplot',
    size=80,
    title='Latent time'
)

# Identify top genes ordered by latent time
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(
    adata,
    var_names=top_genes,
    sortby='latent_time',
    col_color='leiden',
    n_convolve=100
)
```

### 6. Driver Gene Analysis

```python
# Identify genes with highest velocity fit
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
print(df.head(10))

# Speed and coherence
scv.tl.velocity_confidence(adata)
scv.pl.scatter(
    adata,
    c=['velocity_length', 'velocity_confidence'],
    cmap='coolwarm',
    perc=[5, 95]
)

# Phase portraits for specific genes
scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2'],
               ncols=3, figsize=(16, 4))
```

### 7. Velocity Arrows and Pseudotime

```python
# Arrow plot on UMAP
scv.pl.velocity_embedding(
    adata,
    arrow_length=3,
    arrow_size=2,
    color='leiden',
    basis='umap'
)

# Stream plot (cleaner visualization)
scv.pl.velocity_embedding_stream(
    adata,
    basis='umap',
    color='leiden',
    smooth=0.8,
    min_mass=4
)

# Velocity pseudotime (alternative to latent time)
scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')
```

### 8. PAGA Trajectory Graph

```python
# PAGA graph with velocity-informed transitions
scv.tl.paga(adata, groups='leiden')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
df.style.background_gradient(cmap='Blues').format('{:.2g}')

# Plot PAGA with velocity
scv.pl.paga(
    adata,
    basis='umap',
    size=50,
    alpha=0.1,
    min_edge_width=2,
    node_size_scale=1.5
)
```

## Complete Workflow Script

```python
import scvelo as scv
import scanpy as sc

def run_rna_velocity(adata, n_top_genes=2000, mode='dynamical', n_jobs=4):
    """
    Complete RNA velocity workflow.

    Args:
        adata: AnnData with 'spliced' and 'unspliced' layers, UMAP in obsm
        n_top_genes: Number of top HVGs for velocity
        mode: 'stochastic' (fast) or 'dynamical' (accurate)
        n_jobs: Parallel jobs for dynamical model

    Returns:
        Processed AnnData with velocity information
    """
    scv.settings.verbosity = 2

    # 1. Preprocessing
    scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=n_top_genes)

    if 'neighbors' not in adata.uns:
        sc.pp.neighbors(adata, n_neighbors=30)

    scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

    # 2. Velocity estimation
    if mode == 'dynamical':
        scv.tl.recover_dynamics(adata, n_jobs=n_jobs)

    scv.tl.velocity(adata, mode=mode)
    scv.tl.velocity_graph(adata)

    # 3. Downstream analyses
    if mode == 'dynamical':
        scv.tl.latent_time(adata)
        scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)

    scv.tl.velocity_confidence(adata)
    scv.tl.velocity_pseudotime(adata)

    return adata
```

## Key Output Fields in AnnData

After running the workflow, the following fields are added:

| Location | Key | Description |
|----------|-----|-------------|
| `adata.layers` | `velocity` | RNA velocity per gene per cell |
| `adata.layers` | `fit_t` | Fitted latent time per gene per cell |
| `adata.obsm` | `velocity_umap` | 2D velocity vectors on UMAP |
| `adata.obs` | `velocity_pseudotime` | Pseudotime from velocity |
| `adata.obs` | `latent_time` | Latent time from dynamical model |
| `adata.obs` | `velocity_length` | Speed of each cell |
| `adata.obs` | `velocity_confidence` | Confidence score per cell |
| `adata.var` | `fit_likelihood` | Gene-level model fit quality |
| `adata.var` | `fit_alpha` | Transcription rate |
| `adata.var` | `fit_beta` | Splicing rate |
| `adata.var` | `fit_gamma` | Degradation rate |
| `adata.uns` | `velocity_graph` | Cell-cell transition probability matrix |

## Velocity Models Comparison

| Model | Speed | Accuracy | When to Use |
|-------|-------|----------|-------------|
| `stochastic` | Fast | Moderate | Exploratory; large datasets |
| `deterministic` | Medium | Moderate | Simple linear kinetics |
| `dynamical` | Slow | High | Publication-quality; identifies driver genes |

## Best Practices

- **Start with stochastic mode** for exploration; switch to dynamical for final analysis
- **Need good coverage of unspliced reads**: Short reads (< 100 bp) may miss intron coverage
- **Minimum 2,000 cells**: RNA velocity is noisy with fewer cells
- **Velocity should be coherent**: Arrows should follow known biology; randomness indicates issues
- **k-NN bandwidth matters**: Too few neighbors → noisy velocity; too many → oversmoothed
- **Sanity check**: Root cells (progenitors) should have high unspliced/spliced ratios for marker genes
- **Dynamical model requires distinct kinetic states**: Works best for clear differentiation processes

## Troubleshooting

| Problem | Solution |
|---------|---------|
| Missing unspliced layer | Re-run velocyto or use STARsolo with `--soloFeatures Gene Velocyto` |
| Very few velocity genes | Lower `min_shared_counts`; check sequencing depth |
| Random-looking arrows | Try different `n_neighbors` or velocity model |
| Memory error with dynamical | Set `n_jobs=1`; reduce `n_top_genes` |
| Negative velocity everywhere | Check that spliced/unspliced layers are not swapped |

## Additional Resources

- **scVelo documentation**: https://scvelo.readthedocs.io/
- **Tutorial notebooks**: https://scvelo.readthedocs.io/tutorials/
- **GitHub**: https://github.com/theislab/scvelo
- **Paper**: Bergen V et al. (2020) Nature Biotechnology. PMID: 32747759
- **velocyto** (preprocessing): http://velocyto.org/
- **CellRank** (fate prediction, extends scVelo): https://cellrank.readthedocs.io/
- **dynamo** (metabolic labeling alternative): https://dynamo-release.readthedocs.io/

README.md

What This Does

scVelo is the leading Python package for RNA velocity analysis in single-cell RNA-seq data. It infers cell state transitions by modeling the kinetics of mRNA splicing — using the ratio of unspliced (pre-mRNA) to spliced (mature mRNA) abundances to determine whether a gene is being upregulated or downregulated in each cell. This allows reconstruction of developmental trajectories and identification of cell fate decisions without requiring time-course data.

Installation: pip install scvelo

Key resources:

Documentation: https://scvelo.readthedocs.io/
GitHub: https://github.com/theislab/scvelo
Paper: Bergen et al. (2020) Nature Biotechnology. PMID: 32747759

Quick Start

Step 1: Create a Project Folder

mkdir -p ~/Projects/scvelo

Step 2: Download the Template

Click Download above, then:

mv ~/Downloads/CLAUDE.md ~/Projects/scvelo/

Step 3: Start Claude Code

cd ~/Projects/scvelo
claude

Prerequisites

scVelo requires count matrices for both unspliced and spliced RNA. These are generated by:

STARsolo or kallisto|bustools with lamanno mode
velocyto CLI: velocyto run10x / velocyto run
alevin-fry / simpleaf with spliced/unspliced output

Data is stored in an AnnData object with layers["spliced"] and layers["unspliced"].

Standard RNA Velocity Workflow

1. Setup and Data Loading

import scvelo as scv
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt

# Configure settings
scv.settings.verbosity = 3       # Show computation steps
scv.settings.presenter_view = True
scv.settings.set_figure_params('scvelo')

# Load data (AnnData with spliced/unspliced layers)
# Option A: Load from loom (velocyto output)
adata = scv.read("cellranger_output.loom", cache=True)

# Option B: Merge velocyto loom with Scanpy-processed AnnData
adata_processed = sc.read_h5ad("processed.h5ad")  # Has UMAP, clusters
adata_velocity = scv.read("velocyto.loom")
adata = scv.utils.merge(adata_processed, adata_velocity)

# Verify layers
print(adata)
# obs × var: N × G
# layers: 'spliced', 'unspliced' (required)
# obsm['X_umap'] (required for visualization)

2. Preprocessing

# Filter and normalize (follows Scanpy conventions)
scv.pp.filter_and_normalize(
    adata,
    min_shared_counts=20,   # Minimum counts in spliced+unspliced
    n_top_genes=2000        # Top highly variable genes
)

# Compute first and second order moments (means and variances)
# knn_connectivities must be computed first
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=30)
scv.pp.moments(
    adata,
    n_pcs=30,
    n_neighbors=30
)

3. Velocity Estimation — Stochastic Model

The stochastic model is fast and suitable for exploratory analysis:

# Stochastic velocity (faster, less accurate)
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)

# Visualize
scv.pl.velocity_embedding_stream(
    adata,
    basis='umap',
    color='leiden',
    title="RNA Velocity (Stochastic)"
)

4. Velocity Estimation — Dynamical Model (Recommended)

The dynamical model fits the full splicing kinetics and is more accurate:

# Recover dynamics (computationally intensive; ~10-30 min for 10K cells)
scv.tl.recover_dynamics(adata, n_jobs=4)

# Compute velocity from dynamical model
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

5. Latent Time

The dynamical model enables computation of a shared latent time (pseudotime):

# Compute latent time
scv.tl.latent_time(adata)

# Visualize latent time on UMAP
scv.pl.scatter(
    adata,
    color='latent_time',
    color_map='gnuplot',
    size=80,
    title='Latent time'
)

# Identify top genes ordered by latent time
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(
    adata,
    var_names=top_genes,
    sortby='latent_time',
    col_color='leiden',
    n_convolve=100
)

6. Driver Gene Analysis

# Identify genes with highest velocity fit
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
print(df.head(10))

# Speed and coherence
scv.tl.velocity_confidence(adata)
scv.pl.scatter(
    adata,
    c=['velocity_length', 'velocity_confidence'],
    cmap='coolwarm',
    perc=[5, 95]
)

# Phase portraits for specific genes
scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2'],
               ncols=3, figsize=(16, 4))

7. Velocity Arrows and Pseudotime

# Arrow plot on UMAP
scv.pl.velocity_embedding(
    adata,
    arrow_length=3,
    arrow_size=2,
    color='leiden',
    basis='umap'
)

# Stream plot (cleaner visualization)
scv.pl.velocity_embedding_stream(
    adata,
    basis='umap',
    color='leiden',
    smooth=0.8,
    min_mass=4
)

# Velocity pseudotime (alternative to latent time)
scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')

8. PAGA Trajectory Graph

# PAGA graph with velocity-informed transitions
scv.tl.paga(adata, groups='leiden')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
df.style.background_gradient(cmap='Blues').format('{:.2g}')

# Plot PAGA with velocity
scv.pl.paga(
    adata,
    basis='umap',
    size=50,
    alpha=0.1,
    min_edge_width=2,
    node_size_scale=1.5
)

Complete Workflow Script

import scvelo as scv
import scanpy as sc

def run_rna_velocity(adata, n_top_genes=2000, mode='dynamical', n_jobs=4):
    """
    Complete RNA velocity workflow.

    Args:
        adata: AnnData with 'spliced' and 'unspliced' layers, UMAP in obsm
        n_top_genes: Number of top HVGs for velocity
        mode: 'stochastic' (fast) or 'dynamical' (accurate)
        n_jobs: Parallel jobs for dynamical model

    Returns:
        Processed AnnData with velocity information
    """
    scv.settings.verbosity = 2

    # 1. Preprocessing
    scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=n_top_genes)

    if 'neighbors' not in adata.uns:
        sc.pp.neighbors(adata, n_neighbors=30)

    scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

    # 2. Velocity estimation
    if mode == 'dynamical':
        scv.tl.recover_dynamics(adata, n_jobs=n_jobs)

    scv.tl.velocity(adata, mode=mode)
    scv.tl.velocity_graph(adata)

    # 3. Downstream analyses
    if mode == 'dynamical':
        scv.tl.latent_time(adata)
        scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)

    scv.tl.velocity_confidence(adata)
    scv.tl.velocity_pseudotime(adata)

    return adata

Key Output Fields in AnnData

After running the workflow, the following fields are added:

Location	Key	Description
`adata.layers`	`velocity`	RNA velocity per gene per cell
`adata.layers`	`fit_t`	Fitted latent time per gene per cell
`adata.obsm`	`velocity_umap`	2D velocity vectors on UMAP
`adata.obs`	`velocity_pseudotime`	Pseudotime from velocity
`adata.obs`	`latent_time`	Latent time from dynamical model
`adata.obs`	`velocity_length`	Speed of each cell
`adata.obs`	`velocity_confidence`	Confidence score per cell
`adata.var`	`fit_likelihood`	Gene-level model fit quality
`adata.var`	`fit_alpha`	Transcription rate
`adata.var`	`fit_beta`	Splicing rate
`adata.var`	`fit_gamma`	Degradation rate
`adata.uns`	`velocity_graph`	Cell-cell transition probability matrix

Velocity Models Comparison

Model	Speed	Accuracy	When to Use
`stochastic`	Fast	Moderate	Exploratory; large datasets
`deterministic`	Medium	Moderate	Simple linear kinetics
`dynamical`	Slow	High	Publication-quality; identifies driver genes

Best Practices

Start with stochastic mode for exploration; switch to dynamical for final analysis
Need good coverage of unspliced reads: Short reads (< 100 bp) may miss intron coverage
Minimum 2,000 cells: RNA velocity is noisy with fewer cells
Velocity should be coherent: Arrows should follow known biology; randomness indicates issues
k-NN bandwidth matters: Too few neighbors → noisy velocity; too many → oversmoothed
Sanity check: Root cells (progenitors) should have high unspliced/spliced ratios for marker genes
Dynamical model requires distinct kinetic states: Works best for clear differentiation processes

Troubleshooting

Problem	Solution
Missing unspliced layer	Re-run velocyto or use STARsolo with `--soloFeatures Gene Velocyto`
Very few velocity genes	Lower `min_shared_counts`; check sequencing depth
Random-looking arrows	Try different `n_neighbors` or velocity model
Memory error with dynamical	Set `n_jobs=1`; reduce `n_top_genes`
Negative velocity everywhere	Check that spliced/unspliced layers are not swapped

Additional Resources

scVelo documentation: https://scvelo.readthedocs.io/
Tutorial notebooks: https://scvelo.readthedocs.io/tutorials/
GitHub: https://github.com/theislab/scvelo
Paper: Bergen V et al. (2020) Nature Biotechnology. PMID: 32747759
velocyto (preprocessing): http://velocyto.org/
CellRank (fate prediction, extends scVelo): https://cellrank.readthedocs.io/
dynamo (metabolic labeling alternative): https://dynamo-release.readthedocs.io/