Nextflow and nf-core Pipelines

class: center, middle, inverse, title-slide

.title[
# Nextflow and nf-core Pipelines
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2026-03-02
]

---

## What is Nextflow?

Nextflow is a workflow engine that simplifies writing complex pipelines.

* Separates the analysis logic (the code) from the execution environment (HPC, Cloud, or Local).

* Native support for **Docker**, **Singularity**, and **Conda**.

* If a pipeline crashes, use the `-resume` flag to pick up exactly where it left off.

---
## The nf-core Community

You rarely need to write a pipeline from scratch. **nf-core** is a community effort to collect a curated set of high-quality, peer-reviewed Nextflow pipelines.

**nf-core standards include:**

* Continuous Integration (CI) testing.
* Mandatory documentation.
* Use of containers (Docker/Singularity).
* Standardized output (MultiQC reports).

.small[ Browse 144 pipelines at: https://nf-co.re/pipelines ]

<!--
## Popular Pipelines: At a Glance

| Pipeline | Application | Primary Tools |
| --- | --- | --- |
| **rnaseq** | Bulk RNA-seq | STAR, Salmon, DESeq2 |
| **scrnaseq** | Single-cell RNA-seq | Alevin-fry, CellRanger, Starsolo |
| **sarek** | Variant Calling | GATK4, FreeBayes, Strelka2 |
| **chipseq** | ChIP-seq / ATAC-seq | BWA, MACS2, DeepTools |
| **mag** | Metagenomics | MEGAHIT, SPAdes, QUAST |
-->

---
## nf-core/rnaseq

The most used pipeline for quantifying gene expression. It handles everything from raw reads to a gene count matrix.

**Key Features:**

* Supports both **Genomic** (STAR) and **Transcriptomic** (Salmon/Kallisto) mapping.

* Extensive QC: FastQC, RSeQC, Qualimap, and Dupradar.

* Automatic generation of a **MultiQC** summary report.

.small[https://nf-co.re/rnaseq ]

---
## The Input: Samplesheet (RNA-seq)

Instead of using complex file patterns in the command line, nf-core uses a **CSV samplesheet**. This ensures that metadata (like strandedness) is explicitly defined for every file.

```csv
sample,fastq_1,fastq_2,strandedness
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto
CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz,auto
TREATED_REP1,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz,auto
```

* **sample:** Unique identifier for the replicate.
* **fastq_1 / 2:** Full paths to the raw sequencing files.
* **strandedness:** Can be `forward`, `reverse`, or `unstranded`.

.small[  If your data is single-end, simply leave the `fastq_2` column empty. ]

---
## nf-core/scrnaseq

Designed for single-cell transcriptomics. It is highly flexible to accommodate various technologies (10x Genomics, Smart-seq, etc.).

**Available Aligners/Quantifiers:**

* **Alevin-fry:** Extremely fast and memory-efficient.
* **STARsolo:** Standard for 10x data.
* **Kallisto-bustools:** Fast pseudo-alignment.
* **Cellranger:** Standard 10x processing engine.

It produces standardized objects (SingleCellExperiment or AnnData) ready for downstream analysis in R or Python.

.small[ https://nf-co.re/scrnaseq ]

---
## nf-core/chipseq

Used for analyzing **ChIP-seq** and **ATAC-seq** data to identify DNA-protein binding sites or regions of open chromatin.

* Uses **BWA** to align reads to the reference genome.
* Utilizes **MACS2** to identify enriched regions (peaks).
* Automatically creates a set of common peaks across replicates for differential analysis.

.small[ **Note:** For ChIP-seq, the samplesheet must also define the "control" (Input) sample for each treatment replicate to allow for accurate background subtraction.

If control is not available, use the `atacseq` pipeline.
]

.small[ https://nf-co.re/chipseq ]

---
## nf-core/sarek

A comprehensive pipeline for **Germline** or **Somatic** variant calling (mapping, QC, and calling).

* **Input:** FASTQ or BAM files.
* **Scope:** Supports Whole Genome (WGS) and Whole Exome (WES) sequencing.
* **Annotation:** Automatically annotates variants using VEP or SnpEff.

.small[ https://nf-co.re/sarek ]

---
## Basic Execution Syntax

To run an nf-core pipeline, you only need the pipeline name and a sample sheet.

```bash
nextflow run nf-core/rnaseq \
    --input samplesheet.csv \
    --outdir ./results \
    --genome GRCh38 \
    -profile docker

```

* `-profile`: Tells Nextflow whether to use `docker`, `singularity`, or `conda`.
* `-resume`: Use this to restart a failed or modified run without re-calculating finished steps.

<!--
## Handling Configuration (Profiles)

On an HPC like Athena, you shouldn't run heavy computations on the login node. Nextflow uses **profiles** to manage where jobs run.

```nextflow
// Example custom.config
process {
    executor = 'slurm'
    queue    = 'high_mem'
}
```

**Running with a specific config:**

```bash
nextflow run nf-core/rnaseq -c my_hpc.config -profile singularity
```

## Monitoring: Nextflow Tower

For large-scale projects, you can monitor your pipelines in real-time using a web interface.

* **Visualization:** See which processes are running, pending, or failed.
* **Resource Logging:** Track CPU and RAM usage for every single task.
* **Collaboration:** Share run results and logs with teammates easily.

.small[ Access the open-source version at: https://seqera.io/tower/ ]
-->

---
## Best Practices

* **Use a specific version:** Never run the "latest" in production. Use `-r` (e.g., `-r 3.12.0`).

* **Clean up:** Nextflow creates a `work/` directory that can grow to Terabytes. Use `nextflow clean` or delete it once the pipeline successfully completes.

* **Check MultiQC:** Always review the MultiQC report first to catch batch effects or library prep issues.

- nf-core Slack, https://nf-co.re/join
- Nextflow Documentation, https://www.nextflow.io/docs/latest/index.html