Bio ResearchAdvanced

Nextflow Pipeline Runner

Name: Nextflow Pipeline Runner
Author: Anthropic

Run nf-core bioinformatics pipelines for RNA-seq, WGS, and ATAC-seq analysis

10 minutes

By AnthropicSource

#nextflow#nf-core#rna-seq#bioinformatics#pipelines

You have RNA-seq FASTQ files and know nf-core has a pipeline for it, but the parameter combinations are overwhelming, your config keeps failing on your cluster, and the documentation assumes you already know Nextflow. Your PI wants results, not pipeline debugging.

Who it's for: bioinformaticians running nf-core pipelines on institutional clusters, wet-lab researchers analyzing their own sequencing data, graduate students learning computational biology workflows, core facility staff processing samples for multiple PIs, clinical bioinformatics teams running validated analysis pipelines

Example

"Run nf-core/rnaseq on these 24 samples with our SLURM cluster" → Complete Nextflow config for your cluster, sample sheet from your file naming convention, pipeline launch command with optimized parameters, and results interpretation guide for differential expression outputs

CLAUDE.md Template

New here? 3-minute setup guide → | Already set up? Copy the template below.

# nf-core Pipeline Deployment

Run nf-core bioinformatics pipelines on local or public sequencing data.

**Target users:** Bench scientists and researchers without specialized bioinformatics training who need to run large-scale omics analyses—differential expression, variant calling, or chromatin accessibility analysis.

## Workflow Checklist

```
- [ ] Step 0: Acquire data (if from GEO/SRA)
- [ ] Step 1: Environment check (MUST pass)
- [ ] Step 2: Select pipeline (confirm with user)
- [ ] Step 3: Run test profile (MUST pass)
- [ ] Step 4: Create samplesheet
- [ ] Step 5: Configure & run (confirm genome with user)
- [ ] Step 6: Verify outputs
```

---

## Step 0: Acquire Data (GEO/SRA Only)

**Skip this step if user has local FASTQ files.**

For public datasets, fetch from GEO/SRA first.

**Quick start:**

```bash
# 1. Get study info
python scripts/sra_geo_fetch.py info GSE110004

# 2. Download (interactive mode)
python scripts/sra_geo_fetch.py download GSE110004 -o ./fastq -i

# 3. Generate samplesheet
python scripts/sra_geo_fetch.py samplesheet GSE110004 --fastq-dir ./fastq -o samplesheet.csv
```

**DECISION POINT:** After fetching study info, confirm with user:
- Which sample subset to download (if multiple data types)
- Suggested genome and pipeline

Then continue to Step 1.

---

## Step 1: Environment Check

**Run first. Pipeline will fail without passing environment.**

```bash
python scripts/check_environment.py
```

All critical checks must pass. If any fail, provide fix instructions:

### Docker issues

| Problem | Fix |
|---------|-----|
| Not installed | Install from https://docs.docker.com/get-docker/ |
| Permission denied | `sudo usermod -aG docker $USER` then re-login |
| Daemon not running | `sudo systemctl start docker` |

### Nextflow issues

| Problem | Fix |
|---------|-----|
| Not installed | `curl -s https://get.nextflow.io \| bash && mv nextflow ~/bin/` |
| Version < 23.04 | `nextflow self-update` |

### Java issues

| Problem | Fix |
|---------|-----|
| Not installed / < 11 | `sudo apt install openjdk-11-jdk` |

**Do not proceed until all checks pass.** For HPC/Singularity, consult your system administrator.

---

## Step 2: Select Pipeline

**DECISION POINT: Confirm with user before proceeding.**

| Data Type | Pipeline | Version | Goal |
|-----------|----------|---------|------|
| RNA-seq | `rnaseq` | 3.22.2 | Gene expression |
| WGS/WES | `sarek` | 3.7.1 | Variant calling |
| ATAC-seq | `atacseq` | 2.1.2 | Chromatin accessibility |

Auto-detect from data:
```bash
python scripts/detect_data_type.py /path/to/data
```

---

## Step 3: Run Test Profile

**Validates environment with small data. MUST pass before real data.**

```bash
nextflow run nf-core/<pipeline> -r <version> -profile test,docker --outdir test_output
```

| Pipeline | Command |
|----------|---------|
| rnaseq | `nextflow run nf-core/rnaseq -r 3.22.2 -profile test,docker --outdir test_rnaseq` |
| sarek | `nextflow run nf-core/sarek -r 3.7.1 -profile test,docker --outdir test_sarek` |
| atacseq | `nextflow run nf-core/atacseq -r 2.1.2 -profile test,docker --outdir test_atacseq` |

Verify:
```bash
ls test_output/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log
```

---

## Step 4: Create Samplesheet

### Generate automatically

```bash
python scripts/generate_samplesheet.py /path/to/data <pipeline> -o samplesheet.csv
```

The assistant will:
- Discover FASTQ/BAM/CRAM files
- Pair R1/R2 reads
- Infer sample metadata
- Validate before writing

**For sarek:** The assistant prompts for tumor/normal status if not auto-detected.

### Validate existing samplesheet

```bash
python scripts/generate_samplesheet.py --validate samplesheet.csv <pipeline>
```

### Samplesheet formats

**rnaseq:**
```csv
sample,fastq_1,fastq_2,strandedness
SAMPLE1,/abs/path/R1.fq.gz,/abs/path/R2.fq.gz,auto
```

**sarek:**
```csv
patient,sample,lane,fastq_1,fastq_2,status
patient1,tumor,L001,/abs/path/tumor_R1.fq.gz,/abs/path/tumor_R2.fq.gz,1
patient1,normal,L001,/abs/path/normal_R1.fq.gz,/abs/path/normal_R2.fq.gz,0
```

**atacseq:**
```csv
sample,fastq_1,fastq_2,replicate
CONTROL,/abs/path/ctrl_R1.fq.gz,/abs/path/ctrl_R2.fq.gz,1
```

---

## Step 5: Configure & Run

### 5a. Check genome availability

```bash
python scripts/manage_genomes.py check <genome>
# If not installed:
python scripts/manage_genomes.py download <genome>
```

Common genomes: GRCh38 (human), GRCh37 (legacy), GRCm39 (mouse), R64-1-1 (yeast), BDGP6 (fly)

### 5b. Decision points

**DECISION POINT: Confirm with user:**

1. **Genome:** Which reference to use
2. **Pipeline-specific options:**
   - **rnaseq:** aligner (star_salmon recommended, hisat2 for low memory)
   - **sarek:** tools (haplotypecaller for germline, mutect2 for somatic)
   - **atacseq:** read_length (50, 75, 100, or 150)

### 5c. Run pipeline

```bash
nextflow run nf-core/<pipeline> \
    -r <version> \
    -profile docker \
    --input samplesheet.csv \
    --outdir results \
    --genome <genome> \
    -resume
```

**Key flags:**
- `-r`: Pin version
- `-profile docker`: Use Docker (or `singularity` for HPC)
- `--genome`: iGenomes key
- `-resume`: Continue from checkpoint

**Resource limits (if needed):**
```bash
--max_cpus 8 --max_memory '32.GB' --max_time '24.h'
```

---

## Step 6: Verify Outputs

### Check completion

```bash
ls results/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log
```

### Key outputs by pipeline

**rnaseq:**
- `results/star_salmon/salmon.merged.gene_counts.tsv` - Gene counts
- `results/star_salmon/salmon.merged.gene_tpm.tsv` - TPM values

**sarek:**
- `results/variant_calling/*/` - VCF files
- `results/preprocessing/recalibrated/` - BAM files

**atacseq:**
- `results/macs2/narrowPeak/` - Peak calls
- `results/bwa/mergedLibrary/bigwig/` - Coverage tracks

---

## Quick Reference

### Resume failed run

```bash
nextflow run nf-core/<pipeline> -resume
```

---

## Disclaimer

This playbook is provided as a prototype example demonstrating how to integrate nf-core bioinformatics pipelines into Claude Code for automated analysis workflows. The current implementation supports three pipelines (rnaseq, sarek, and atacseq), serving as a foundation that enables the community to expand support to the full set of nf-core pipelines.

It is intended for educational and research purposes and should not be considered production-ready without appropriate validation for your specific use case. Users are responsible for ensuring their computing environment meets pipeline requirements and for verifying analysis results.

Anthropic does not guarantee the accuracy of bioinformatics outputs, and users should follow standard practices for validating computational analyses. This integration is not officially endorsed by or affiliated with the nf-core community.

## Attribution

When publishing results, cite the appropriate pipeline. Citations are available in each nf-core repository's CITATIONS.md file (e.g., https://github.com/nf-core/rnaseq/blob/3.22.2/CITATIONS.md).

## Licenses

- **nf-core pipelines:** MIT License (https://nf-co.re/about)
- **Nextflow:** Apache License, Version 2.0 (https://www.nextflow.io/about-us.html)
- **NCBI SRA Toolkit:** Public Domain (https://github.com/ncbi/sra-tools/blob/master/LICENSE)

README.md

What This Does

This playbook helps bench scientists and researchers run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data without specialized bioinformatics training. It handles everything from acquiring public datasets from GEO/SRA, through environment setup, samplesheet creation, pipeline execution, and output verification -- giving you a structured, step-by-step workflow for differential expression, variant calling, or chromatin accessibility analysis.

Quick Start

Step 1: Download the Template

Click Download above to get the CLAUDE.md file.

Step 2: Set Up Your Project

Create a project folder and place the template inside:

my-bioinformatics-project/
  CLAUDE.md
  fastq/          # Your FASTQ files (or will be downloaded)

Step 3: Start Working

claude

Say: "I have RNA-seq FASTQ files in the fastq/ directory. Help me run the nf-core rnaseq pipeline with the GRCh38 genome."

Workflow Overview

The playbook follows a structured 7-step checklist:

Acquire Data -- Download from GEO/SRA if using public datasets (skip for local files)
Environment Check -- Verify Docker, Nextflow, and Java are properly installed
Select Pipeline -- Choose rnaseq, sarek, or atacseq based on your data type
Run Test Profile -- Validate your environment with a small test dataset
Create Samplesheet -- Auto-generate or validate your sample metadata CSV
Configure & Run -- Set genome, aligner, and resource limits, then execute
Verify Outputs -- Check for successful completion and locate key result files

Supported Pipelines

Data Type	Pipeline	Version	Goal
RNA-seq	`rnaseq`	3.22.2	Gene expression
WGS/WES	`sarek`	3.7.1	Variant calling
ATAC-seq	`atacseq`	2.1.2	Chromatin accessibility

Tips

Always run the test profile before processing real data to catch environment issues early
Use -resume to continue from checkpoints if a run fails partway through
For HPC environments, swap -profile docker for -profile singularity
Set resource limits (--max_cpus, --max_memory, --max_time) to match your machine

Key Outputs

RNA-seq: Gene count matrices and TPM values in results/star_salmon/

WGS/WES (Sarek): VCF variant files in results/variant_calling/

ATAC-seq: Peak calls in results/macs2/narrowPeak/ and coverage tracks in results/bwa/mergedLibrary/bigwig/

All pipelines produce a MultiQC report at results/multiqc/multiqc_report.html.

Example Prompts

"I have RNA-seq FASTQ files in ./data/. Run the nf-core rnaseq pipeline with GRCh38."
"Download GSE110004 from GEO and run differential expression analysis."
"Check my environment and run the sarek test profile for variant calling."
"Create a samplesheet from FASTQ files in ./fastq/ for the atacseq pipeline."
"Resume my failed rnaseq pipeline run."