Early Sequencing Technologies

class: center, middle, inverse, title-slide

.title[
# Early Sequencing Technologies
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2026-01-14
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## Why sequence a reference genome?

- Determine the "complete" sequence of a human haploid genome.

- Identify the sequence and location of every protein coding gene.

- Use as a "map" with which to track the location and frequency of genetic variation in the human genome.

- Unravel the genetic architecture of inherited and somatic human diseases.

- Understand genome and species evolution.

---
## DNA sequencing: Maxam-Gilbert, Sanger

.pull-left[
1) Sequencing by synthesis (not degradation)

2) Radioactive primers hybridize to DNA

3) Polymerase + dNTPs (normal dNTPs) + ddNTP (dideoxynucleotides terminators) at low concentration

4) 1 lane per base, visually interpret ladder

.small[https://en.wikipedia.org/wiki/Maxam%E2%80%93Gilbert_sequencing]
.small[https://www.youtube.com/watch?v=bEFLBf5WEtc]
]

.pull-right[
<img src="img/sanger_sequencing1.jpg" width="200px" style="display: block; margin: auto;" />
]

---
## Shotgun genome sequencing milestones

- **1977**: Bacteriophage `\(\Phi X 174\)` (5.4 kb) - first DNA genome sequenced

- **1995**: _H. influenzae_ (1.8 Mb) - first free-living organism

- **1996**: _S. cerevisiae_ (yeast, 12 Mb) - first eukaryote

- **1998**: _C. elegans_ (worm, 97 Mb) - first multicellular animal

- **2000**: _Drosophila melanogaster_ (fruit fly, 165 Mb)

- **2000**: _Arabidopsis thaliana_ (mustard plant, 125 Mb) - first plant

- **2001**: _Homo sapiens_ (human, 3 Gb) - draft sequence

- **2004**: Human genome - "finished" sequence (92% complete)

- **2022**: Human genome - truly complete (T2T-CHM13, all gaps closed)

.small[Sergey Nurk et al. ,The complete sequence of a human genome. Science (2022) https://doi.org/10.1126/science.abj6987 ]

---
class: middle,center

# Sequencing on a scale

---
## Sequencing in a nutshell

- Cut the long DNA into smaller segments (several hundreds to several thousand bases).

- Sequence each segment: start from one end and sequence along the chain, base by base.

- The process stops after a while because the noise level is too high.

- Results from sequencing are many sequence pieces. The lengths vary, usually a few thousands from Sanger, and several hundreds from NGS.

- The sequence pieces are called "reads" for NGS data.

---
## Shotgun genome sequencing (Sanger, 1979)

.pull-left[
1) Fragment the genome

2) Clone 2-10kb fragments into plasmids (Bacterial artificial chromosome (BAC) clones); pick lots of colonies; purify DNA from each

3) Use a primer to plasmid to sequence into genomic DNA

4) Assemble the genome from overlapping "reads"
]
.pull-right[
<img src="img/shotgun.png" width="800px" style="display: block; margin: auto;" />
]

---
## How to sequence a human genome: Lee Hood

.pull-left[
- Lee Hood and colleagues developed the first automated DNA sequencer (1986-1987)
- Replaced radioactive labels with four fluorescent dyes (one per base)
- Automated capillary electrophoresis replaced manual gel reading
- Computer-based base calling eliminated human interpretation ]
.pull-right[
<img src="img/lee_hood_sequencing.png" width="400px" style="display: block; margin: auto;" />
- Increased throughput from ~700 bases/day to thousands of bases/day per machine
- Made the Human Genome Project feasible ]

.small[ Hood LE, Hunkapiller MW, Smith LM. Automated DNA sequencing and analysis of the human genome. Genomics. 1987 https://doi.org/10.1016/0888-7543(87)90046-2 ]

---
## Massively Parallel DNA sequencing instruments

- All MPS platforms require a library obtained either by amplification or ligation with custom linkers (adapters)

- Each library fragment is amplified on a solid surface (either bead or flat _Si_-derived surface) with covalently attached adapters that hybridize the library adapters

- Direct step-by-step detection of the nucleotide base incorporated by each amplified library fragment set

- Hundreds of thousands to hundreds of millions of reactions detected per instrument run = "massively parallel sequencing"

- A "digital" read type that enables direct quantitative comparisons

- Shorter read lengths than capillary sequencers

---
## Library Construction for MPS

.pull-left[
- Shear high molecular weight DNA with sonication

- Enzymatic treatments to blunt ends

- Ligate synthetic DNA adapters (each with a DNA barcode), PCR amplify

- Quantitate library

- Proceed to WGS, or do exome or specific gene hybrid capture
]
.pull-right[
<img src="img/library.png" width="400px" style="display: block; margin: auto;" />
]

---
## PCR-related Problems in MPS

- PCR is an effective vehicle for amplifying DNA, however...

- In MPS library construction, PCR can introduce preferential amplification ("jackpotting") of certain fragments

- Duplicate reads with exact start/stop alignments

- Need to "de-duplicate" after alignment and keep only one pair

- Low input DNA amounts favor jackpotting due to lack of complexity in the fragment population

---
## PCR-related Problems in MPS

- PCR also introduces false positive artifacts due to substitution errors by the polymerase

- If substitution occurs in early PCR cycles, error appears as a true variant

- If substitution occurs in later cycles, error typically is drowned out by correctly copied fragments in the cluster

- Cluster formation is a type of PCR ("bridge amplification")

- Introduces bias in amplifying high and low G+C fragments

- Reduced coverage at these loci is a result

---
## Hybrid Capture

- Selectively enrich specific genomic regions from a whole genome library

- Biotinylated DNA/RNA probes hybridize to target sequences

- Streptavidin magnetic beads capture probe-bound fragments

- Wash away non-target DNA, elute enriched library for sequencing

---
## Hybrid Capture

- **Exome sequencing**: Probes target ~1-2% of genome containing all annotated exons
    - "Exome" = all exons of protein-coding genes in the reference genome
    - Captures ~98% of disease-causing mutations at fraction of WGS cost

- **Custom panels**: Probes designed for specific clinical targets
    - Cancer gene panels, cardiovascular panels, etc.
    - Enables deeper coverage of clinically relevant loci

**Advantages:** Reduces sequencing cost while increasing coverage depth in regions of interest

**Disadvantages:** Capture efficiency is typically 50-70%; not all target regions are captured equally