Burrows–Wheeler Transform (BWT) for Genomic Alignment

class: center, middle, inverse, title-slide

.title[
# Burrows–Wheeler Transform (BWT) for Genomic Alignment
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2026-02-18
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

<!--
# Outline

* Intuition: why BWT helps with compression & search

* Constructing the BWT (rotations / suffix array view)

* Example: `banana$` (step-by-step)

* LF-mapping and inverse BWT (reconstruct text)

* FM-index: C array + Occ (rank) — backward search for exact matching

* Complexity, implementation notes, and role in aligners (BWA, Bowtie)

* References / further reading

* Presenter note: Tell students this is one of the foundational ideas used in modern short-read aligners (indexing reference, fast searches).
* Presenter note: Emphasize difference between BWT (a transform) and FM-index (a search index built on BWT).
-->

# Intuition — what BWT does

* BWT permutes characters of a string and sorts them - equal characters cluster together (good for compression).

* Clustering also makes **rank (count) queries** well behaved and cheap with supporting data structures.

* Importantly: BWT itself is reversible (no information lost) if we keep an end-of-string marker `$`.

Key property:

* BWT enables **backward search** (search the pattern from right → left) using only rank/occurrence queries and a small array `C`.

???

* Presenter note: Use the phrase "cluster similar contexts" — BWT sorts rotations so characters preceding similar contexts end up together.

---
# Two equivalent constructions

1. **Rotation table + sort**

* List all rotations of the string `T` (append special terminal `$`).
   * Sort rotations lexicographically.
   * BWT is the *last column* of this sorted table.

2. **Suffix array (SA) viewpoint**

* Build suffix array: sorted list of starting positions of suffixes.
   * BWT[i] = character preceding suffix SA[i] (character at index `SA[i]-1`, with wrap to last char for SA element 0).

---
# Example: start text `banana$`

We will compute BWT for `banana$` step by step and then use it for exact pattern search.

Text `T = "banana$"` (note `$` is lexicographically smallest and unique).

???

* Presenter note: Remind students `$` is an end marker not present elsewhere; it simplifies uniqueness and reversibility.

---
# Example: sort rotations (lexicographically)

Sorted rotations (lexicographic order):

Take the **last column** of this table → that is the BWT.

???

* Presenter note: If students ask "why last column?" — explain rotations shift preceding characters to last column, clustering contexts that precede similar suffixes.

---
# BWT result (banana$)

Sorted rotations (with indexes):

| Rotation | Last character |
| -------- | :---------: |
| $banana  | a         |
| a$banan  | n         |
| ana$ban  | n         |
| anana$b  | b         |
| banana$  | $         |
| na$bana  | a         |
| nana$ba  | a         |

---
# BWT result (banana$)

The suffix array (SA) of `banana$` is the starting indices of the sorted rotations:

The string `banana$` is indexed starting from **0**:
.small[
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| **Character** | b | a | n | a | n | a | $ |
]
* **SA[0] = 6**: The first sorted rotation starts with the character at index **6** (`$`).
* **SA[1] = 5**: The second sorted rotation starts with the character at index **5** (`a`).
* **SA[4] = 0**: The fifth sorted rotation starts with the character at index **0** (`b`).

The BWT Connection

The BWT is the "character preceding the suffix." If a suffix starts at index , the character preceding it is at index .

* For **SA[1] = 5** (starts with `a`), the BWT character is `banana$[5-1]`, which is `banana$[4] = 'n'`.
* For **SA[4] = 0** (starts with `b`), the BWT character wraps around to the end: `banana$[6] = '$'`.

This is why the **last column** of your sorted rotations table and the **character before the suffix** always yield the same result: **"annb$aa"**.

???

* Presenter note: Check students can verify counts of characters (a:3, n:2, b:1, $:1) in both T and BWT.

---
# LF-mapping (Last → First)

Two important observations:

* The **first column** of the sorted rotation table is just the characters of T sorted (i.e., all `$` then `a` characters then `b` then `n` characters).

* Each row corresponds to a rotation; the last column character precedes the first column character in the original string.

---
# LF-mapping (Last → First)

* If `i` is a row index in the sorted table, then `LF(i) = C[ BWT[i] ] + Occ(BWT[i], i)`

* `LF(i)` maps the last-column position `i` to the corresponding first-column position which represents the previous character in the original text.

Definitions:

* `C[c]` = total number of characters in T strictly smaller than `c` (prefix sums over character counts).
* `Occ(c, i)` = number of occurrences of `c` in BWT[0..i-1] (a rank operation).

???

* Presenter note: LF is the basis for inverse BWT and for stepping backward through text: follow LF repeatedly to reconstruct the text backward.

---
# Step 1 — Start character in the sorted rotations

* Row index **i = 5**
* Character **BWT[5] = 'a'** (this is the second `'a'` in the BWT string `annb$aa`)

`$LF(i) = C[BWT[i]] + Occ(BWT[i], i)$`

| i | First (F) | Rotation   | Last (L = BWT) |
|:-:|:---------:|:-----------|:--------------:|
| 0 | $ | $banana  | a |
| 1 | a | a$banan  | n |
| 2 | a | ana$ban  | n |
| 3 | a | anana$b  | b |
| 4 | b | banana$  | $ |
| 5 | n | na$bana  | **a** |
| 6 | n | nana$ba  | a |

---
# Step 2 — Compute C[c] (C[a] in our case)

* `C[c]` = total number of characters in T strictly smaller than `c`

| Character | Count |
| --------- | ----- |
| $         | 1     |
| a         | 3     |
| b         | 1     |
| n         | 2     |

Sorted alphabet: `$ < a < b < n`

So prefix sums:

```
C['$'] = 0
*C['a'] = 1
C['b'] = 1 + 3 = 4
C['n'] = 1 + 3 + 1 = 5
```

For our case:

```
C['a'] = 1
```

---
# Step 3 — Compute Occ('a', i)

* `Occ(c, i)` = number of occurrences of `c` in BWT[0..i-1]

```
BWT = a n n b $ a a
Index = 0 1 2 3 4 5 6
```

For `i = 5` we look at `BWT[0..4] = a n n b $`.

Number of `'a'` in that prefix: `Occ('a', 5) = 1`

---
# Step 4 — Apply LF formula

`$LF(5) = C['a'] + Occ('a', 5)$`

`$LF(5) = 1 + 1 = 2$`

---
# Interpretation

```
LF(5) = 2
```

* Row 5 (character `'a'` in last column)

* Maps to row 2 in the first column

* That row represents the suffix that is preceded by this `'a'` in the original text

```markdown
Row 5  (L = a)
   ↓ LF
Row 2  (F = a)
```

This is the **Last → First** correspondence:
the 2nd `'a'` in the last column corresponds to the 2nd `'a'` in the first column.

---
# LF ntuition

* The first column (F) is just all characters sorted.

* Characters in L are grouped in the same relative order.

* `Occ(c,i)` determines *which copy* of character `c` we are referring to.

* `C[c]` tells us where the block of `c` begins in F.

* Adding them jumps to the matching row in F.

---
# Inverse BWT via LF mapping

To reconstruct `T` from BWT `L`:

1. Start from row `r` that contains `$` in L (unique).

2. Repeatedly apply ` r ← LF(r)`; collecting characters `L[r]` gives text in reverse order.

3. Stop after `n` steps.

This recovers the original `T`.

???

* Presenter note: If students want code, show simple O(n) algorithm using LF with precomputed Occ and C arrays.

---
# FM-index — BWT plus rank support

FM-index stores:

* BWT (string `L`)
* `C` array (prefix counts)
* A data structure to answer `Occ(c, i)` (rank queries) quickly (often a sampled checkpoint + bitmaps or wavelet trees).

Key property:

* **Backward search**: find all positions where pattern `P` occurs in `T` in O(|P|) time (rank ops O(1) each), independent of |T|.

---

# FM-index — BWT plus rank support

Backward search recurrence (pattern processed right → left):

```
sp = 0; ep = n-1   # current SA interval (inclusive)
for i from m-1 downto 0:
    c = P[i]
    sp = C[c] + Occ(c, sp)
    ep = C[c] + Occ(c, ep+1) - 1
    if sp > ep: no match
```

Final interval `[sp, ep]` gives SA indices (SA[sp..ep]) of all matches.

???

* Presenter note: Explain meaning of `[sp, ep]`: it is the contiguous block in the suffix array corresponding to suffixes that start with the current pattern suffix.

---                                                                                                   
# Worked backward search example (`banana$`)

We have:

* `T = "banana$"` (n = 7)
* `BWT = "annb$aa"`
* Counts: `$`:1, `a`:3, `b`:1, `n`:2
* `C` values (chars sorted `$` < `a` < `b` < `n`):

```
  C['$'] = 0
  C['a'] = 1   (# of chars < 'a' is just '$')
  C['b'] = 1+3 = 4
  C['n'] = 1+3+1 = 5
  ```

We want to find pattern `P = "ana"` (exact occurrences).

???

* Presenter note: We'll compute Occ as prefix table; show small table to make Occ lookups concrete.

---

# Occurrence table (prefix counts for BWT)

Build `Occ(c, i)` as counts in prefixes of BWT (index i from 0..n):

```
BWT indices:   0 1 2 3 4 5 6
BWT chars:     a n n b $ a a
i (prefix):    0 1 2 3 4 5 6 7  (prefix length)
Occ('$', i):   0 0 0 0 0 1 1 1
Occ('a', i):   0 1 1 1 1 1 2 3
Occ('b', i):   0 0 0 0 1 1 1 1
Occ('n', i):   0 0 1 2 2 2 2 2
```

Interpretation: `Occ('a', 1) = 1` because BWT[0] = 'a'.

???

* Presenter note: These small tables are why BWT + rank is practical: Occ lookups are fast with precomputation and sampling.

---

# Backward search steps for `P = "ana"`

Initialize `sp = 0`, `ep = 6` (full SA range).

Process pattern right → left (`'a'`, `'n'`, `'a'`):

1. Character `a`:

```
   sp = C['a'] + Occ('a', sp)   = 1 + Occ('a',0) = 1
   ep = C['a'] + Occ('a', ep+1) - 1 = 1 + Occ('a',7) - 1 = 1 + 3 -1 = 3
   ```

Interval: `[1, 3]`

---

# Backward search steps for `P = "ana"`

Initialize `sp = 0`, `ep = 6` (full SA range).

Process pattern right → left (`'a'`, `'n'`, `'a'`):

2. Character `n`:

```
   sp = C['n'] + Occ('n', 1) = 5 + 0 = 5
   ep = C['n'] + Occ('n', 3+1) - 1 = 5 + Occ('n',4) -1 = 5 + 2 -1 = 6
   ```

Interval: `[5, 6]`

---

# Backward search steps for `P = "ana"`

Initialize `sp = 0`, `ep = 6` (full SA range).

Process pattern right → left (`'a'`, `'n'`, `'a'`):

3. Character `a`:

```
   sp = C['a'] + Occ('a', 5) = 1 + 1 = 2
   ep = C['a'] + Occ('a', 6+1) - 1 = 1 + 3 -1 = 3
   ```

Final interval: `[2, 3]`

Result: SA[2..3] correspond to starting positions `{3,1}` in `T` → matches at positions 1 and 3 (0-based) → `'ana'` occurs at `T[1:4]` and `T[3:6]`.

???

* Presenter note: Highlight that number of Occ/C lookups equals pattern length; search cost O(m) ignoring rank implementation cost.

<!--
# LF mapping & inverse (concrete on `banana$`)

* Example LF formula checked on index `i=0` where `BWT[0]='a'`:

```
  LF(0) = C['a'] + Occ('a', 0) = 1 + 0 = 1
  ```
* Repeated LF iterations move backward through text; starting from row with `$` in BWT (row where BWT char is `$`) reconstructs original string.

* Presenter note: If time allows, walk full inverse iteratively to reconstruct `banana$`.

# Complexity & implementation notes

* Construction:
  * Naive rotations: O(n^2 log n) (impractical).
  * Practical: build suffix array in O(n log n) or O(n); derive BWT from SA in O(n).

* Search:
  * Backward search (exact): O(m) rank queries.
  * With proper rank data structures (bitvectors + sampled checkpoints, or wavelet trees), `Occ` is O(1) or O(log σ).

# Complexity & implementation notes

* Space:
  * FM-index is compact — near entropy of text; stores sampled SA values for locating hits (not just counting).

* Real aligners:
  * Use seeded/backtracking strategies for approximate matching (allow mismatches/indels) using FM-index; then extend with local alignment (e.g. Smith–Waterman).

* Presenter note: Mention practical libraries: SDSL, SeqAn; aligners: BWA, Bowtie, Bowtie2, minimap2 (different tradeoffs). No need for exhaustive list.
-->

---
# How BWT/FM-index gets used in aligners

* Index the **reference genome** once (BWT + FM-index + sampled SA).
* For each read:
  * Use backward search to find exact matches (seed).
  * If no exact seed, backtrack allowing mismatches (bounded) or use spaced seeds.
  * When candidate locations are found, refine with alignment (banded dynamic programming / Smith–Waterman).

* Advantages:
  * Low memory footprint for large references.
  * Fast searches for short patterns (reads).

???

* Presenter note: Optionally show BWA workflow diagram: index → seed → extend → output SAM.

<!--

# Practical caveats

* Exact matching via FM-index is fast, but approximate search requires backtracking which can be expensive if allowed mismatches > few.

* Handling indels commonly requires additional logic (seed + extension, or graph indexes).

* Choice of sampling rate for SA affects memory/time tradeoffs (locating hits from SA interval may need repeated LF steps).

???

* Presenter note: Emphasize engineering tradeoffs crucial to aligner performance.

# Short code snippet (conceptual) — backward search in pseudo-R

```r
# pseudo-code; not optimized
backward_search <- function(bwt, C, occ, pattern) {
  sp <- 0; ep <- length(bwt) - 1
  for (i in nchar(pattern):1) {
    c <- substr(pattern, i, i)
    sp <- C[c] + occ(c, sp)
    ep <- C[c] + occ(c, ep+1) - 1
    if (sp > ep) return(integer(0))
  }
  return(SA[sp:(ep)])  # SA must be accessible or sampled and reconstructed
}
```

???

* Presenter note: Make clear `occ(c,i)` returns prefix counts and `SA` locating requires additional structures; real code uses bitvectors/wavelet trees.

# Further reading

* Original BWT paper and descriptions (Burrows & Wheeler, 1994) — concept & compression.
* Manzini, Ferragina — FM-index papers / tutorials.
* Pachter & Salzberg tutorials and aligner papers (BWA, Bowtie) for practical implementations.

???

* Presenter note: Suggest students read implementation details in BWA or Bowtie2 papers and try small exercises implementing BWT and backward search.

# In-class exercise (optional)

* Compute BWT for `GATTACA$` by hand (rotations → sort → last column).
* Build `C` and `Occ` and search pattern `TA` using backward search.

???

* Presenter note: Provide solution later or use it as a quick group activity.

-->