Alignment Introduction

class: center, middle, inverse, title-slide

.title[
# Alignment Introduction
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2026-02-11
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## Protein sequencing

.pull-left[

* Fred Sanger and colleagues sequenced insulin, the first complete protein sequence (1945–1955).
* Established that every protein has a characteristic primary structure.
* Protein sequencing established the concept of a biological sequence as an analyzable object.
* This directly motivated early computational approaches to sequence comparison.

]

.pull-right[
<img src="img/sanger.png" alt="" width="300px" style="display: block; margin: auto;" />
.small[ Frederick Sanger. 1958 – first Nobel Prize. https://www.nobelprize.org/uploads/2018/06/sanger-lecture.pdf ]
]

---
## 1960 — the dawn of computational biology

* Rapid growth of amino-acid sequence collections in the 1960s
* Computational methods became essential to analyze similarity and infer function
* Academic computing resources made large-scale analysis feasible

.small[ Hagen, J. The origins of bioinformatics. Nat Rev Genet 1, 231–236 (2000). https://doi.org/10.1038/35042090 ]

---
## Pioneer of computational biology — Margaret Dayhoff

.pull-left[

* Trained in mathematics and quantum chemistry
* Associate director, National Biomedical Research Foundation
* Developed FORTRAN programs to assemble protein sequences from overlaps
* Created the first protein sequence databases and family classifications
* Introduced PAM matrices for protein alignment (still used)
  ]

.pull-right[
<img src="img/Margaret_Oakley_Dayhoff_cropped.jpg" alt="" width="300px" style="display: block; margin: auto;" />
]

???

* PAM matrices explicitly model evolutionary change.
* Contrast PAM with later BLOSUM matrices in protein alignment.

---
## The genomics workflow

* Alignment is an early and critical computational step.
* Errors at this stage propagate downstream.

---
## Alignment — definition and goals

**Alignment** is the process of determining how or where a read sequence matches a reference sequence.

* Human genome: ~3.2 billion bases

* Modern sequencers: hundreds of millions to billions of short reads

* Goal: assign each read to its most likely genomic origin

* Read alignment is different from classical global/local sequence alignment.

---
## Alignment — visual example

---
## Sequencing coverage

* Coverage = average number of reads covering each genomic base

* Coverage = (reads × read length) / genome size.

Example:
If the genome is 100 Mbp, sequencing 1M × 100 bp reads gives ~1× coverage.

* Coverage is an expectation; local coverage varies substantially.

---
## Library complexity

* Library complexity `$C$` = number of unique molecules present in the sequencing library
* Low complexity → many duplicates, limited new information

<img src="img/library_complexity.png" alt="" width="800px" style="display: block; margin: auto;" />
???

* PCR duplication reduces effective sequencing depth.
* Motivates duplicate marking and complexity estimation.

---

## Modeling approach

- Assume we have `$C$` unique molecules in the library and we obtain `$N$` sequencing reads

- The probability distribution of the number of times we sequence a particular molecule is binomial (individual success probability `$p=1/C$`, `$N$` trials in total)

- Assume Poisson sampling as a tractable approximation (rate `$\lambda = N/C$`)

- Finally, truncate the Poisson process: we only see events that happened between `$L$` (the minimum number of times a molecule must be observed to appear in the data) and `$R$` (the maximum number of times a molecule is allowed or observed) times. We don’t know how many molecules were observed `$0$` times.

---
## Poisson distribution — refresher

- The probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.

- Resembles a normal distribution, but over the positive values, and with only a single parameter.

Key properties

- The standard deviation is the square root of the mean.
- For mean > 5, well approximated by a normal distribution

`$$P(k)=\frac{\lambda^k}{k!}e^{-\lambda}$$`

---
## Poisson discribution

``` r
x <- seq(0, 25, 1)
y <- dpois(x, 10)
plot(x, y)
```

![](01a_Alignment_intro_xaringan_files/figure-html/unnamed-chunk-7-1.png)

---
## Poisson discribution

``` r
y5  <- dpois(x, 5); y10 <- dpois(x, 10); y15 <- dpois(x, 15)
plot(x, y5, col = 1); lines(x, y10); lines(x, y15)
```

![](01a_Alignment_intro_xaringan_files/figure-html/unnamed-chunk-8-1.png)

---
## Estimating library complexity with a Poisson model

- For Poisson sampling, we can write the (truncated) distribution over `$x_i$` the times we sequence the `$i^{th}$` molecule as:
`$$Pr(x_i|\lambda)=\frac{1}{K_{L,R}(\lambda)}*\frac{\lambda^{x_i}}{x_i!}e^{-\lambda}$$`
`$$K_{L,R}(\lambda)=\sum_{x=L}^RPr(x_i|\lambda)$$`
`$K_{L,R}(\lambda)$` - Normalization constant, accounts for the fact that we can only observe molecules sequenced between `$L$` and `$R$` times. It's the sum of probabilities for all observable outcomes, ensuring the truncated distribution sums to 1. The probability is `$0$` if `$x_i$` is less than `$L$` or greater than `$R$`.

- We can estimate the maximum likelihood rate parameter `$\lambda$` from a vector of observations `$x$`

---
## Maximum likelihood library size

- `$K_{L,R}(\lambda)$` represents the *fraction* of molecules we expect to observe (those with counts in `$[L,R]$`). 

- `$M$` unique sequences observed.
- Dividing our observed count `$M$` by this fraction gives us the total maximum likelihood library size `$\hat{C}$`:

`$$\hat{C}=\frac{M}{K_{L,R}(\lambda)}$$`

- Approximate solution when `$L=1$` and `$R=\infty$`

`$$\hat{C}=\frac{M}{1-Poisson(0,\lambda)}$$`

---
## Limitations of Poisson model

* Real data: variance > mean (overdispersion)
* Poisson underestimates complexity from subsampled data
* Empirical evidence against pure Poisson sampling.
* Motivates Gamma–Poisson (NB) model.

---
## Gamma–Poisson intuition

.pull-left[

- Gamma distributed sampling rates describe the entire population (library preparation)

- Poisson sampling to form a smaller sample (sequencing)

- Negative binomial distribution characterizes the resulting occurrence histogram
  ]

.pull-right[

<img src="img/gamma_poisson_negative_binomial.png" alt="" width="300px" style="display: block; margin: auto;" />
]

???

* Gamma captures library preparation variability.
* NB naturally models overdispersion.

---
## The Biological Intuition

Think of the NB model as arising from biological heterogeneity:

- Different molecules have different "capture efficiencies" - some are easier to sequence than others due to GC content, secondary structure, PCR bias, etc.

- The heterogeneity creates overdispersion: Instead of all molecules having the same rate `$\lambda$` (Poisson), each molecule `$i$` has its own rate `$\lambda_i$` drawn from a Gamma distribution

- `$k$` controls this heterogeneity:
  - Small k → huge variation in capture efficiencies (some molecules sequenced 100x, others never)
  - Large k → all molecules roughly equally likely to be sequenced (approaches Poisson)

---
## The gamma distribution is a "conjugate prior" for the Poission distribution

.pull-left[
`$$Poisson(x;\lambda)=\frac{\lambda^x e^{-\lambda}}{x!}$$`

`$$Gamma(x,\alpha,\beta)=\frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)}$$`
]
.pull-right[
`$Poisson(x; \lambda)$` - Individual molecule's sequencing counts

`$Gamma(x; \alpha, \beta)$` - Distribution of sequencing rates across molecules

`$NB(y; \alpha, \beta)$` - The resulting distribution of observed counts


]

`$$NB(y;\alpha,\beta)=\int_0^\infty Poisson(y;x) \, Gamma(x;\alpha,\beta) \, dx$$`
???

* Integration yields NB closed form.
* Hierarchical interpretation is key.

---
## Negative Binomial model for sequence occurrences

- `$C$` - library complexity (latent, fit to observed data)
- `$N$` - number of reads
- `$M$` - total number of unique sequences
- `$\lambda=N/C$`
- `$k$` - dispersion (latent, fit to observed data)

`$$Pr(x_i,\lambda,k)=NB(x_i|\lambda,k)=NB(x_i|n,p)$$`
Reparameterize for classical NB interpretation  
- `$p=\lambda/(\lambda + 1/k)$` - "success probability" of each trial, rescales `$\lambda$` into a probability
- `$n=1/k$` - number of "successes", inversely related to dispersion

???

* NB generalizes Poisson.
* Widely used in RNA-seq and genomics.

---
## Simulation results show that the Gamma Poisson works well for non-uniform libraries

- True library complexity `$C$`: 1M unique molecules

- Vary `$k$` (controls sampling rate variance)

- Given 100K reads (`$\lambda=0.1$`), assess estimates from both models

| k    | `$\hat{C}$` using Poisson | `$\hat{C}$` using Gamma-Poisson |
|------|:-------:|:-------------:|
| 0.1  | 0.93M   | 0.96M         |
| 1    | 0.52M   | 1.01M         |
| 10   | 0.12M   | 1.10M         |
| 20   | 0.07M   | 0.68M         |

* Poisson collapses under heterogeneity. GP/NB remains stable.

---
## Negative Binomial Library Complexity better model 150 '1000 Genome' Datasets

<img src="img/library_complexity2.png" alt="" width="800px" style="display: block; margin: auto;" />
- Data are "overdispersed" (variance greater than mean)

* Overdispersion is the rule, not the exception.

---
## Marginal value of additional sequencing

- `$C$` – library complexity (latent – estimated) 
- `$N$` – number of reads
- `$M$` – number of unique sequences

`$M$` can be estimated by `$(1 - NegativeBinomial(0 | \lambda, k)) * C$`

- Assume we have `$r$` more reads `$s = (N + r) / N$`

- Replace `$\lambda$` by `$s * \lambda$` to estimate `$M'$` achieved with `$r$` more reads

---
## Marginal value of additional sequencing

.pull-left[
- Setup: Library with `$C = 10^6$` (1 million) unique molecules (dashed line)
- X-axis: Sequencing reads (total number `$N$`)
- Y-axis: Observed molecules (`$M$`, the number of unique sequences)
- Blue (k=0.005) & Green (k=0.050) - Very low `$k$` = HIGH heterogeneity, helps observing unique molecules
- Red (k=0.500) & Cyan (k=1.000) - Moderate/low heterogeneity
]
.pull-right[
<img src="img/marginal_sequencing_value.png" alt="" width="500px" style="display: block; margin: auto;" />
- Diminishing returns guide experimental design
]

???

even at 5 million reads, you've only seen ~92% of the library