Genomic data formats

class: center, middle, inverse, title-slide

.title[
# Genomic data formats
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2026-02-09
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## The reference genome as a coordinate system

The reference genome serves as a universal **linear map**, allowing us to pinpoint any biological feature by its exact address. Just as a GPS uses latitude and longitude, bioinformatics uses **Genomic Coordinates** to define specific locations.

A genomic region is uniquely identified by three primary variables:

1. **Chromosome:** The specific scaffold or linkage group (e.g., *chr17*).
2. **Start Position:** The exact nucleotide where the feature begins.
3. **End Position:** The exact nucleotide where the feature concludes.

---
## The reference genome as a coordinate system

By combining these into a **Region ID** (e.g., `chr17:43044295-43125483`), we create a standardized language that allows different tools and researchers to reference the same sequence regardless of the file format used.

---
## Binary and "plain text" files

**"Plain text" ("flat") file formats** - Information often structured into lines and columns

* Human-readable, processable with simple command-line tools
* Space is often saved through archiving (e.g., zip) and human-readable information coding (e.g., flags)

**Binary file formats** - Not human-readable (not intended for direct reading)

* Require special software for processing (programs intended for plain text processing, such as `wc` or `grep`, will not work properly)
* Efficient storage with significant reduction in file size compared to plain text counterparts (e.g., 75% space saved)

---
## Some common file formats

* **FASTA** - Simple collections of named DNA/protein sequences (text)
* **FASTQ** - Extension of the FASTA format; contains additional quality information. Widely used for storing unaligned sequencing reads (text)
* **SAM/BAM** - Alignments of sequencing reads to a reference genome (text/binary)
* **BED** - Region-based genome annotation information (e.g., a list of genes and their genomic locations). Used for visualization or simple enumeration (text)
* **GFF/GTF** - Gene-centric annotations (text)
* **(big)WIG** - Continuous signal representation (text, binary)
* **VCF** - Variant Call Format, used to store information about genomic variants (text)

---
## FASTA format

* Developed in the 1980s for FASTP/FASTA software packages for sequence similarity searching

* Convenient format for storing and transferring simple sequence collections

* Intended for both nucleotide and protein sequences (FAST-A stands for "Fast-All")

* Common output format of sequence databases and a standard input format for various tools

---
## FASTA format

Each sequence entry consists of:

* *A single line* with sequence metadata, starting with the "greater-than" symbol (">")
* *Any number of lines* representing the actual sequence

.small[
```
>SRR493818.1.1 HWUSI-EAS519_0001_FC61TCNAAXX:2:1:960:20843 length=54
NGTAAATCCCGTTACCTCTCTGAGCCTCGGTNNCCNNNNNTGTAAAAAGGNNNN
>SRR493818.2.1 HWUSI-EAS519_0001_FC61TCNAAXX:2:1:961:7550 length=54
NACACTACAATGTAAAAGCTTGGCCTAACTTNNTTNNNNNGGCTGTTATTNNNN
```
]

---
## FASTA format

The nucleic acid codes that can be found in a FASTA file:

```
A --> adenosine           M --> A C (amino)
C --> cytidine            S --> G C (strong)
G --> guanine             W --> A T (weak)
T --> thymidine           B --> G T C
U --> uridine             D --> G A T
R --> G A (purine)        H --> A C T
Y --> T C (pyrimidine)    V --> G C A
K --> G T (keto)          N --> A G C T (any)
                          - gap of indeterminate length
```

.small[ http://zhanglab.ccmb.med.umich.edu/FASTA/ ]

---
## The Human Genome if FASTA format

The human reference genome is maintained by the **Genome Reference Consortium (GRC)**, but most bioinformaticians download "flavored" versions from one of three major resource hubs. Choosing a provider determines your chromosome naming convention (`chr1` vs `1`) and available metadata.

**1. UCSC Genome Browser**

UCSC provides "full" genome builds and is the favorite for visualization and simple command-line retrieval.

* **Naming:** Uses the `chr` prefix (e.g., `chr1`, `chrM`).
* **Best for:** Quick downloads of specific chromosomes or integrated tracks (RepeatMasker, conservation scores).
* **Format:** Typically provides `2bit` (binary) and standard FASTA.

.small[
UCSC Downloads: [hgdownload.soe.ucsc.edu/downloads.html](https://hgdownload.soe.ucsc.edu/downloads.html)
]

---
## The Human Genome if FASTA format

**2. Ensembl (EBI)**

Ensembl focuses on high-quality annotation and is the standard for European research and many automated pipelines.

* **Naming:** Uses "bare" numbers (e.g., `1`, `MT`).
* **Best for:** Downloading genomes bundled with consistent GTF/GFF gene sets.
* **Versions:** Provides "Top-level" (all sequences) and "Primary Assembly" (no haplotypes—usually best for mapping).

.small[
Ensembl FTP: [ensembl.org/info/data/ftp/index.html](https://www.ensembl.org/info/data/ftp/index.html)
]

---
## The Human Genome if FASTA format

**3. NCBI (GenBank/RefSeq)**

The official repository for the GRC assemblies.

* **Naming:** Often uses Accession numbers (e.g., `NC_000001.11`) or bare numbers.
* **Best for:** Finding the most authoritative and "raw" version of the assembly (GRCh38.p14).

**Best practices:** If you map your reads to a UCSC FASTA, you *must* use a UCSC-compatible annotation (GTF). Mixing a UCSC genome (`chr1`) with an Ensembl annotation (`1`) is the most common cause of "zero reads mapped" errors.

---
## FASTA format

* Human genome data is available here: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/
* Get chromosome Y sequence:

.small[
```bash
$URL=[http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrY.fa.gz$](http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chrY.fa.gz$) wget $URL; mv chrY.fa.gz hg38.chrY.fa.gz
# or
$ curl $URL > hg38.chrY.fa.gz
# Ungzip
$ gzip -d hg38.chrY.fa.gz
# Peek into the file
$head hg38.chrY.fa$ tail hg38.chrY.fa
# How many lines
$ wc -l hg38.chrY.fa
# Look somewhere in the middle
$ head -n 10000 hg38.chrY.fa | tail
```
]

---
## Orienting in a FASTA file

* Do not run:

```bash
$ # grep > hg38.chrY.fa
```

* What will this command do?

```bash
$ grep ">" hg38.chrY.fa
```

* What will this command do?

```bash
$ grep -v ">" hg38.chrY.fa
```

---
## chrY summary statistics

* How long is chrY?

```bash
$ grep -v ">" hg38.chrY.fa | grep -o "[ATCGatcg]" | wc -l
26415043
```

* How many adenosines are there?

```bash
$ grep -v ">" hg38.chrY.fa | grep -o -i "A" | wc -l
7886192
```

---
## Low-complexity regions

* Approximately 45% of the human genome is recognized as being derived from transposable elements

* The majority are non-long terminal repeat (LTR) retrotransposons, such as LINE-1 (L1), Alu, and SVA elements

* Other low-complexity biases: CG- and AT-rich DNA

* Represented by lower-case letters

```
GATCAAAGTGTCATACAGTAACAGCCCAGACAGACGATAGGTATGGCAGa
aaagaaaaaaactaaaaaaaaaaaaaaaaaaaaaaaTCGCATGGGAAGTT
TCCCCGCCTCCTCTTTGGCCATTCTGTGCCCGGAGATCAAAGTTCTCATT
```

---
## The conversion of sequencing signal to a nucleotide sequence

Nearly all sequencing technologies produce sequencing reads in **FASTQ** format.

Four lines: **Sequence ID** **Sequence** **Separator** **Quality scores** ]

`@SEQ_ID`  
`GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT`  
`+`  
`!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65`

---
## Sequence ID

---
## FASTQ quality scores (Phred scores)

**PHRED Base quality (Q)** – integer value derived from the estimated probability (P) of the corresponding base being determined incorrectly. (rounded to the nearest integer)

* The *Ph* in Phred comes from Phil Green, the inventor of the encoding.

* P might be computed in different ways depending on the sequencing platform, but the meaning of Q remains the same.

.small[ Peter J. A. Cock et al., "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants" Nucleic Acids Res. 2010 https://doi.org/10.1093/nar/gkp1137   
http://www.gs.washington.edu/faculty/green.htm ]

---
## FASTQ quality scores (Phred scores)

**PHRED Base quality (Q)** – integer value derived from the estimated probability (P) of the corresponding base being determined incorrectly.

* A higher quality score is better (>=20 is considered "good").

---
## ASCII table

**ASCII** (American Standard Code for Information Interchange) is a character-encoding standard that assigns a unique numerical value to letters, digits, and symbols.
<img src="img/ascii-chars-table-landscape.png" alt="" width="800px" style="display: block; margin: auto;" />
.small[ http://www.asciicharstable.com/_site_media/ascii/ascii-chars-table-landscape.pdf ]

---
## ASCII table

In sequencing, it is used to represent **Phred quality scores** as single characters (like `!`, `@`, or `h`), allowing the quality of each base to be stored efficiently in a one-to-one mapping within a single line of a **FASTQ** file.
<img src="img/ascii-chars-table-landscape.png" alt="" width="800px" style="display: block; margin: auto;" />
.small[ http://www.asciicharstable.com/_site_media/ascii/ascii-chars-table-landscape.pdf ]

---
## Older FASTQ encoding schemes

Older FASTQ versions used different encoding schemes for PHRED quality scores.

* Phred + 33 is the current standard.

---
## Standard Sequencing Data Processing

Nearly all high-throughput sequencing pipelines follow a universal workflow to move from raw FASTQ output to biological insights.

**Step 1: Quality Control (QC)**

Before processing, we must assess the raw "unprocessed" data.

* **Goal:** Identify sequencing errors, adapter contamination, or poor-quality cycles.
* **Key Tool:** **FastQC** is the industry standard. It provides reports on base quality scores, GC content, and overrepresented sequences.
* **Output:** A report (HTML) highlighting potential issues that require troubleshooting.

---
## Standard Sequencing Data Processing

**Step 2: Adapter Trimming & Cleaning**

Raw reads often contain synthetic sequences (adapters) used to attach DNA to the sequencer flow cell.

* **Goal:** Remove adapters and low-quality bases from the ends of reads (e.g., Phred score < 20).
* **Key Tools:** **Cutadapt**, **Trimmomatic**, or **fastp**.
* **Impact:** Prevents "false mismatches" during alignment caused by non-genomic sequences.

---
## Standard Sequencing Data Processing

**Step 3: Genomic Alignment (Mapping)**

The cleaned reads are aligned to a reference genome to determine their origin.

* **Goal:** Find the most likely genomic coordinate for every read.
* **Key Tools:** **BWA-MEM** or **Bowtie2** (for DNA-Seq/ChIP-Seq).
* **STAR** or **HISAT2** (for RNA-Seq, accounting for "split" reads across introns).

* **Output:** A **SAM** or **BAM** file.

---
## FASTQC - quality control

.pull-left[
* Java program with HTML output

* Essential for assessing raw data quality

* Helps identify adapter contamination and base quality drops
]
.pull-right[
<img src="img/fastqc.png" alt="" width="600px" style="display: block; margin: auto;" />
]

.small[
* http://www.bioinformatics.babraham.ac.uk/projects/fastqc/   
* Video tutorial: https://www.youtube.com/watch?v=bz93ReOv87Y 11m
]

<!--
fastx toolkit: manipulating FASTQ and FASTA files

* The **FASTX-Toolkit** is a collection of command-line utilities designed for the preprocessing and manipulation of short-read FASTA and FASTQ files. 
* It allows researchers to perform essential tasks such as quality filtering, trimming, adapter clipping, and format conversion to improve downstream mapping and alignment results.

.small[ http://hannonlab.cshl.edu/fastx_toolkit/ ]

fastx toolkit: usage examples

```bash
fastx_quality_stats -h
fastx_quality_stats -Q 0 -i file.fastq -o fastx_report.txt
head fastx_report.txt
fastx_trimmer -h
fastx_trimmer -f 1 -l 100 -i file.fastq -o file_trimmed.fastq
```
-->

---
## The Sequence Alignment/Map format (SAM)

* Intended for storing read alignments against reference sequences.
* Features a binary version with excellent software support (BAM format).

The SAM format consists of two sections:

**Header section**

* Describes the data source, reference sequence, alignment method, etc.

**Alignment section**

* Describes the read, its quality, and the nature of its alignment to a specific genomic region.

.small[ SAM format specification: https://samtools.github.io/hts-specs/SAMv1.pdf ]

---
## BAM File Format: Sorting and Indexing

To enable efficient data retrieval, BAM files undergo a two-step optimization process: **Sorting** and **Indexing**.

**1. Sorting (Required first)**

Before a BAM file can be indexed, it must be sorted. The choice of sorting method depends on the downstream analysis:

* **Coordinate Sort:** Alignments are ordered by their position on the chromosome (e.g., chr1, chr2). This is the standard for most tools like variant callers and genome browsers.
* **Name Sort:** Alignments are ordered by the read ID. This is typically used when you need to keep paired-end reads together (e.g., for certain alignment or quantification tools).

---
## BAM File Format: Sorting and Indexing

To enable efficient data retrieval, BAM files undergo a two-step optimization process: **Sorting** and **Indexing**.

**2. Indexing (`.bai`)**

Once sorted, an index file is generated (usually with a `.bai` extension) which acts as a "Table of Contents" for the binary data.

* Allows software to **jump directly** to a specific genomic region.
* Enables rapid visualization in tools like IGV without loading the entire multi-gigabyte file into memory.

---
## SAM example

---
## SAM/BAM header section

* Describes data source, reference sequence, and programs used.
* Each section begins with ‘@’ followed by a two-letter record type code and various tags.

.small[
```
@HD The header line
  VN: format version
  SO: Sorting order of alignments
@SQ Reference sequence dictionary
  SN: reference sequence name
  LN: reference sequence length
@RG Read group
  ID: read group identifier
  SM: sample name
@PG Program
  PN: program name
  VN: program version
```
]

---
## SAM/BAM alignment section

.small[ https://samtools.github.io/hts-specs/SAMv1.pdf ]

---
## Samtools: The Swiss Army Knife of Alignments

**samtools** is the industry-standard suite of programs for interacting with high-throughput sequencing data. It provides the essential interface between human-readable **SAM** files and machine-efficient **BAM/CRAM** files.

.pull-left[

* **Convert:** Seamlessly switch between SAM, BAM, and CRAM formats.
* **Manipulate:** Sort, merge, and index alignment files.
* **Filter:** Extract specific reads based on genomic coordinates or bitwise flags.
* **Analyze:** Generate depth statistics and identify variants (mpileup).
]

.pull-right[

* **Efficiency:** Built in C for high-speed processing of massive datasets.
* **Standardization:** Originally developed by Heng Li at the Sanger Institute; it defines how the community handles alignment data.
* **Integration:** Acts as the foundation for almost all bioinformatics pipelines.
]

.small[
Documentation: [https://www.htslib.org/doc/samtools.html](https://www.htslib.org/doc/samtools.html)
]

---
## SAMTOOLS Commands: Editing & Viewing

**Indexing**
- `index` - index alignment
- `faidx` - index/extract FASTA
- `dict` - create a sequence dictionary file

**Editing**
- `fixmate` - fix mate information
- `reheader` - replace BAM header
- `rmdup` - remove PCR duplicates

**Viewing**
- `flags` - explain BAM flags
- `tview` - text alignment viewer
- `view` - SAM<->BAM<->CRAM conversion

---
## SAMTOOLS Commands: Files & Stats

**File operations**
- `cat` - concatenate BAMs
- `merge` - merge sorted alignments
- `mpileup` - multi-way pileup
- `sort` - sort alignment file
- `fastq/fasta` - convert BAM back to raw formats
    
**Statistics**
- `bedcov` - read depth per BED region
- `flagstat` - simple summary stats
- `idxstats` - BAM index stats
- `stats` - comprehensive stats

---
## Samtools Cheatsheet: Essential Commands

| Category | Command | Description |
| --- | --- | --- |
| **Viewing** | `view` | Convert between binary (BAM) and text (SAM) |
| **Ordering** | `sort` | Sort alignments by coordinate or read name |
| **Access** | `index` | Create `.bai` index for random access |
| **Subsetting** | `view [region]` | Extract reads from a specific genomic window |
| **Summary** | `flagstat` | Provide a simple count of alignment types |
| **Metrics** | `stats` | Generate comprehensive alignment statistics |
| **Coverage** | `depth` | Compute the sequencing depth for every base |
| **Sequencing** | `fastq` / `fasta` | Convert BAM back into raw sequence formats |

.small[ https://github.com/samtools/samtools ]

---
## Using SAM flags to filter subsets of reads

* 12 bitwise flags describe the alignment.

* "Bitwise" means values are powers of two (representing a single bit in a binary number).

* All combinations can be represented as a number from 1 to 2048 ().

* You can specify "required" (`-f`) or "filter" (`-F`) flags in the `samtools view` command.

---
## Understanding Bitwise FLAGs

* Unambiguous: there is exactly one possible decomposition for each FLAG sum.

* Each value contributing to the sum represents one fulfilled condition.

.small[
https://broadinstitute.github.io/picard/explain-flags.html  
https://samtools.github.io/hts-specs/SAMv1.pdf
]

---
## SAM – bitwise FLAG (example)

---
## SAM – bitwise FLAG decomposition

**FLAG 83** - In binary: FLAG 1010011 is a sum of 1000000, 10000, 10, and 1.

* In decimal: 83 = 64 + 16 + 2 + 1.
* Meaning:

---
## CIGAR string

* A sequence of base lengths and "operations" indicating alignment status (match, mismatch, deletion, insertion, intron).
* Example: `81M859N19M`
* Interpretation: 81 bases align, 859 bases are skipped (intron), followed by 19 aligned bases.

.small[ https://samtools.github.io/hts-specs/SAMv1.pdf ]

---
## Other Tools for SAM/BAM Files

* **sambamba** – A high-performance alternative to `samtools`. Written in D, it is highly parallelized and significantly faster for indexing and marking duplicates on multi-core systems.

* **Mosdepth** – The current standard for rapid BAM/CRAM depth calculation. It is many times faster than `samtools depth` and works exceptionally well for both whole-genome and exome data.

* **Picard Tools** – A Java-based suite essential for advanced manipulation. It is the "gold standard" for validating SAM file structures and is a core component of the GATK best practices pipeline.

.small[
**Sambamba:** [https://lomereiter.github.io/sambamba/](https://lomereiter.github.io/sambamba/)  
**Mosdepth:** [https://github.com/brentp/mosdepth](https://github.com/brentp/mosdepth)  
**Picard:** [https://broadinstitute.github.io/picard/](https://broadinstitute.github.io/picard/)  
]

---
## BED format

* Text-based, tab-separated list of genomic regions.

* Each region is defined by a reference sequence and start/end positions.

* Optional properties: strand, name, score, color.

* Designed for visualizing annotations in IGV or the UCSC Genome Browser.

.small[ http://genome.ucsc.edu/FAQ/FAQformat.html#format1 ]

---
## BED format: Columns

3 mandatory columns:

* **chrom**: chromosome.
* **chromStart**: first base of the region (0-based counting).
* **chromEnd**: first base after the region.
* `[chromStart, chromEnd)` notation allows for simple length calculation (`End - Start`).

.small[
```text
chr1  115263684 115263685 rs10489525  0 +
chr12 97434219  97434220  rs6538761   0 +
chr14 102360744 102360745 rs7142002   0 +
```
]

---
## BED format – optional fields

* 9 additional fields with binding order.
* All regions in a file must contain the same optional fields.

**Important fields:** - **name**: region identifier.

* **score**: value between 0 and 1000 (often used for grayscale shading).
* **strand**: "+" or "-" (defines directionality).

.small[ https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7 ]

---
## bedtools: The Swiss Army Knife for Genome Arithmetic

**bedtools** allows for lightning-fast comparison and manipulation of genomic features across multiple formats (BED, GFF, VCF, BAM). It is built on the principle of "Genome Arithmetic"—treating genomic intervals as mathematical sets.
.small[
.pull-left[

**Key Commands**

* **`intersect`**: Find overlapping regions between two files (the "workhorse").
* **`merge`**: Combine overlapping or nearby intervals into a single record.
* **`subtract`**: Remove regions in file A that overlap with file B.
* **`slop`**: Increase the size of intervals (e.g., adding 1kb flanking regions).
* **`getfasta`**: Extract DNA sequences from a FASTA file using BED coordinates.
]

.pull-right[

**Advanced Utilities**

* **`coverage`**: Summarize the "depth" of features in A that overlap with B.
* **`closest`**: Find the nearest feature in B for every feature in A (even if they don't overlap).
* **`genomecov`**: Calculate coverage across the entire genome.
* **`bamtobed`**: Convert BAM alignments into BED intervals for easier processing.
]
]

---
## bedtools: The Swiss Army Knife for Genome Arithmetic

**The Power of Piping**

Almost all commands support `stdin` and `stdout`, allowing you to chain operations:

```bash
bedtools intersect -a A.bed -b B.bed | bedtools merge -i stdin > result.bed
```

.small[ Documentation: [https://bedtools.readthedocs.io/](https://bedtools.readthedocs.io/) ]

---
## BEDOPS: High-Performance Genomic Feature Operations

**BEDOPS** is a highly scalable suite of tools optimized for performance and "embarrassingly easy" parallelization. It excels at processing large-scale genomic data by relying on strictly sorted BED input.

.small[
.pull-left[
**Core Set Operations**

* **`bedops`**: The primary tool for Boolean set operations (union, intersection, difference, complement). It can handle many input files simultaneously.
* **`bedextract`**: Performs lightning-fast retrieval of specific features (e.g., by chromosome) from sorted BED files.
* **`closest-features`**: Efficiently finds the nearest upstream/downstream features between two sets.
]
.pull-right[
**Mapping & Statistics**

* **`bedmap`**: Maps "source" data (like signal or scores) onto "target" regions. It can compute statistical summaries (mean, max, count) during the mapping process.
* **`sort-bed`**: The critical preprocessing tool. BEDOPS requires files to be lexicographically sorted to achieve its high speeds.
]
]

.small[ Documentation: [https://bedops.readthedocs.io/](https://bedops.readthedocs.io/) ]

---
## GFF/GTF: Genomic Annotation Formats

These formats are used to describe the "features" of a genome, such as where genes, exons, and introns are located. While similar, they serve slightly different roles in bioinformatics pipelines.

* **GFF (General Feature Format)**: A flexible, hierarchical format used to describe any genomic feature. It is often used for general genome annotations and can represent complex relationships between biological elements (e.g., a gene "owning" multiple transcript variants).
* **GTF (Gene Transfer Format)**: A specialized and more restrictive version of GFF (specifically GFF version 2.2). It is the **industry standard** for defining gene models because it enforces strict rules on how exons, start codons, and stop codons are linked to specific Gene and Transcript IDs.

---
## GFF/GTF: Genomic Annotation Formats

| Feature | GFF (v3) | GTF (v2) |
| --- | --- | --- |
| **Hierarchy** | Uses `ID` and `Parent` tags | Uses `gene_id` and `transcript_id` |
| **Strictness** | Very flexible | Highly standardized for gene sets |
| **Common Use** | General annotation / Sequence Ontology | Transcriptomics / RNA-Seq |

.small[

* GFF specs: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
* GTF specs: https://www.gencodegenes.org/pages/data_format.html
]

---
## GFF / GTF structure

Tab-delimited text file with 9 columns:

1. seqid (chromosome / contig)
2. source
3. feature type
4. start (1-based)
5. end (1-based, inclusive)
6. score
7. strand
8. phase (CDS only)
9. attributes (key–value pairs)

.small[ `wget ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.gtf.gz` ]
.small[ GTF gene annotations: http://www.ensembl.org/info/data/ftp/index.html ]

---
## GTF Gene Annotations: GENCODE and Ensembl

Accurate gene annotation is the foundation of transcriptomics.

For human and mouse genomes, **GENCODE** and **Ensembl** are the primary providers of the GTF files that define where genes and transcripts begin and end.

---
## GTF Gene Annotations: GENCODE

**GENCODE** is the official gene set for the ENCODE project and is widely considered the "gold standard" for human and mouse research.

* **Curation:** Combines manual curation (from the **Havana** team) with automated computational pipelines.

* **Comprehensive:** It provides the most detailed mapping of **lncRNAs**, antisense transcripts, and pseudogenes.

* **Community Standard:** Used by major projects like **TCGA** (The Cancer Genome Atlas) and **GTEx** (Genotype-Tissue Expression) to ensure consistent results across the field.

---
## GTF Gene Annotations: Ensembl

**Ensembl** provides annotations for thousands of species using an automated annotation pipeline, supplemented by manual curation for key organisms.

* **Human/Mouse Synchronization:** Since Ensembl release 79 (GENCODE 22), the two sets are functionally identical for human and mouse.

* **Chromosome Naming:** A critical difference is the coordinate naming. Ensembl GTFs use **bare numbers** (`1`, `2`, `X`), while GENCODE typically uses the **chr prefix** (`chr1`, `chr2`, `chrX`) to match UCSC standards.

---
## GTF Gene Annotations: Best practices

* **Check Your FASTA:** Your GTF must match your reference genome's chromosome naming (e.g., if your FASTA has `chr1`, your GTF must have `chr1`).

* **Version Control:** Always record the version number (e.g., GENCODE v44). Gene boundaries and IDs change between releases, which can significantly impact your read counting and differential expression results.

.small[
Download GENCODE: [https://www.gencodegenes.org/](https://www.gencodegenes.org/)

Download Ensembl: [http://www.ensembl.org/info/data/ftp/index.html](http://www.ensembl.org/info/data/ftp/index.html)
]

---
## WIG (Wiggle) Format: Representing Continuous Signal

The **Wiggle (WIG)** format is designed for displaying continuous-valued data, such as GC content, probability scores, or sequencing depth, as a track in a genome browser. It is specifically optimized for dense, "signal-like" data.

.small[
.pull-left[
**1. variableStep** Used for data with **irregular intervals** between data points. It is more efficient when data is sparse.

* Requires a `chrom` definition.
* Each line lists a `position` and a `value`.

```text
variableStep chrom=chr1
41196312 10.5
41196350 12.8
```
]
.pull-right[
**2. fixedStep** Used for data with **regular, constant intervals**. It is more compact because the positions are calculated automatically.

* Defines a `start` and a constant `step` size.
* Only the `values` are listed.

```text
fixedStep chrom=chr1 start=41196312 step=20
10.5
12.8
```
]
]
<br><br><br><br><br><br><br><br>
.small[ UCSC WIG Documentation: [https://genome.ucsc.edu/goldenpath/help/wiggle.html](https://genome.ucsc.edu/goldenpath/help/wiggle.html) ]

---
## Working with WIG format

**Optimization: BigWig**

Because WIG files are plain text and can become enormous, they are usually converted into **BigWig** format—an indexed, binary version that allows for near-instantaneous zooming and scrolling in genome browsers (like IGV).

* **WiggleTools**: performs math (mean, sum, etc.) on genomic signal files.

.small[
https://genome.ucsc.edu/goldenpath/help/wiggle.html
https://github.com/Ensembl/WiggleTools
]

---
## VCF format

* Variant Call Format (VCF) is a flexible and extendable format for variation data.

---
## VCF Column Definitions: The "Fixed" Fields

The **Variant Call Format (VCF)** is a tab-delimited text file containing meta-information lines, a header line, and then data lines. Each data line represents a single genomic variant with eight mandatory fixed columns.

.small[
| Column | Name | Description |
| --- | --- | --- |
| **CHROM** | Chromosome | The identifier of the reference sequence (e.g., `chr1` or `1`). |
| **POS** | Position | The **1-based** coordinate of the first base in the variant. |
| **ID** | Identifier | Unique semi-colon separated identifiers (e.g., dbSNP `rs` numbers). |
| **REF** | Reference | The reference allele (the string of bases found in the reference genome). |
| **ALT** | Alternate | Comma-separated list of non-reference alleles (the variants found). |
| **QUAL** | Quality | Phred-scaled probability that the variant is incorrect (higher is better). |
| **FILTER** | Filter | `PASS` if all quality filters are met; otherwise, lists specific failure codes. |
| **INFO** | Information | A "key=value" field for extended annotations (e.g., allele frequency, depth). |
]

.small[
VCF v4.2 Specification: [https://samtools.github.io/hts-specs/VCFv4.2.pdf](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
]

---
## VCF Format Application

Beyond these eight, VCF files often contain a **FORMAT** column and one or more **SAMPLE** columns, which hold the genotype data (`GT`), genotype quality (`GQ`), and read depth (`DP`) for individual subjects.

.small[ Petr Danecek, 1000 Genomes Project Analysis Group, et al. "The variant call format and VCFtools", Bioinformatics, Volume 27, Issue 15, August 2011, Pages 2156–2158, https://doi.org/10.1093/bioinformatics/btr330 ]

---
## Mutation Annotation Format (MAF): Cancer Variant Reporting

The **Mutation Annotation Format (MAF)** is a tab-delimited file specifically designed to aggregate somatic mutations across many samples.

While VCF files are usually "per-sample" or "per-cohort," MAF files are the standard for large-scale cancer studies like **The Cancer Genome Atlas (TCGA)**.

---
## Mutation Annotation Format (MAF): Cancer Variant Reporting

* **Protected MAFs**: These are raw files containing all sequenced mutations, including both **somatic** (cancer-specific) and **germline** (inherited) variants. Because they contain germline data, access is usually restricted to protect patient privacy.

* **Somatic (Public) MAFs**: These are filtered files intended for general research. They retain only high-confidence somatic mutations, typically restricted to **protein-coding regions** or splice sites, and are cleaned of sequencing errors and germline "noise."

---
## Mutation Annotation Format (MAF): Cancer Variant Reporting

* **Project-Wide Integration:** MAF files make it easy to see mutation patterns (e.g., "Which genes are most frequently mutated in this lung cancer cohort?") across hundreds of patients.

* **Rich Annotation:** Each variant is linked to its functional effect (e.g., Missense, Nonsense, Frame_Shift) and its specific gene symbol (HUGO).

.small[
GDC MAF Documentation: [https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/)  
https://github.com/mskcc/vcf2maf
]

---
## Indexing for rapid data access

To handle massive genomic datasets efficiently, almost every format requires an **index file**.

* **Fasta Index (`.fai`)**: Allows tools to jump to specific chromosomal coordinates instantly.
* **BAM Index (`.bai`)**: Enables random access to alignments in specific regions (e.g., for viewing a single gene in IGV).
* **Tabix (`.tbi`)**: Generic index for compressed, sorted tab-delimited files like BED, GFF, or VCF.

*Requirement:* Files must be **sorted** by coordinate before they can be indexed.

---
## Genomic Coordinate Systems

Bioinformatics uses two distinct ways of counting positions. Understanding the difference is critical to avoid "off-by-one" errors.

.small[ **Note:** Bioconductor's `GenomicRanges` (GRanges) standardizes data into **1-based** coordinates internally, regardless of the input format. ]

---
## Formats use different chromosome naming conventions

* The UCSC and Ensembl databases name their chromosomes differently.

* By convention, UCSC chromosomes start with **chr** (e.g., `chr5`).

* Ensembl chromosome names usually consist of just the number (e.g., `5`).

* Mixing these conventions in a single analysis pipeline will cause tools to fail!

**Bioconductor packages like `GenomeInfoDb` help translate between these styles.**

---
## IGV: Visualizing all genomics formats

The **Integrative Genomics Viewer (IGV)** is the industry-standard, high-performance visualization tool designed for the interactive exploration of large, integrated genomic datasets. It acts as the visual "hub" that can render almost every format we discussed.

* **SAM/BAM/CRAM**: Visualizes individual sequencing reads, showing mismatches, indels, and structural variations (requires a `.bai` or `.crai` index).
* **BED/GFF/GTF**: Displays genomic features like genes, exons, and custom regions of interest.
* **VCF**: Visualizes variant calls, highlighting genotypes across multiple samples.
* **WIG/BigWig**: Renders continuous signal data as bar charts or line graphs (ideal for ChIP-seq or RNA-seq coverage).
* **FASTA**: Provides the underlying nucleotide sequence for the reference genome.

---
## IGV: Visualizing all genomics formats

* **Speed**: Uses indexed file formats to load only the data for the specific window you are viewing, making it possible to navigate multi-gigabyte files instantly.

* **Integration**: Allows you to load data from your **local machine** or directly from **remote URLs** (like the UCSC or Ensembl servers).

* **Verification**: It is the final "sanity check" for any bioinformatics pipeline—nothing beats looking at the raw alignments to confirm a variant call.

.small[ Download IGV: [software.broadinstitute.org/software/igv/](https://software.broadinstitute.org/software/igv/) ]

---
## Genome Assembly Versions

A **reference genome** is a representative digital sequence of an organism's DNA, serving as a standard "map" for alignment and comparison. Acts as a common coordinate system so researchers can say "Gene X is at chr1:100-200" and others know exactly where to look.

* **Assembly Versions:** Genomes are periodically updated (e.g., Human `hg19` → `hg38`). New versions fix sequence gaps, correct errors, and improve accuracy.

---
## Liftover: Converting Between Assembly Versions

**Liftover** is the process and the tool for remapping genomic coordinates from one assembly version to another (e.g., mapping an old GWAS study in `hg19` to the current `hg38`).

* **Chain Files:** Specialized mapping files that describe how segments of the old genome align to the new one.

* **Bioconductor Implementation:** The `rtracklayer` package provides the `liftOver()` function.

* **Potential Outcomes:**
    * **Successful Map:** The range moves to the new position.
    * **Split Map:** One range in the old build maps to multiple regions in the new one (returns a `GRangesList`).
    * **Unmapped:** The sequence was deleted or significantly changed in the new build.

.small[ https://www.youtube.com/watch?v=942WLiy3eb4 1h 7m ]

---
## Liftover: Converting Between Assembly Versions

```r
library(rtracklayer)
# 1. Import the chain file
chain <- import.chain("hg19ToHg38.over.chain")

# 2. Perform the liftover
hg38_coords <- liftOver(hg19_granges, chain)

# 3. Handle the output (returns a GRangesList)
hg38_final <- unlist(hg38_coords)
```

.small[ UCSC liftOver web tool: https://genome.ucsc.edu/cgi-bin/hgLiftOver  
[Liftover Analysis with rtracklayer](https://www.youtube.com/watch?v=942WLiy3eb4) video 1h 6m]