class: center, middle, inverse, title-slide .title[ # Burrows–Wheeler Transform (BWT) for Genomic Alignment ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2026-02-18 ] --- <!-- HTML style block --> <style> .large { font-size: 130%; } .small { font-size: 70%; } .tiny { font-size: 40%; } </style> <!-- # Outline * Intuition: why BWT helps with compression & search * Constructing the BWT (rotations / suffix array view) * Example: `banana$` (step-by-step) * LF-mapping and inverse BWT (reconstruct text) * FM-index: C array + Occ (rank) — backward search for exact matching * Complexity, implementation notes, and role in aligners (BWA, Bowtie) * References / further reading * Presenter note: Tell students this is one of the foundational ideas used in modern short-read aligners (indexing reference, fast searches). * Presenter note: Emphasize difference between BWT (a transform) and FM-index (a search index built on BWT). --> # Intuition — what BWT does * BWT permutes characters of a string and sorts them - equal characters cluster together (good for compression). * Clustering also makes **rank (count) queries** well behaved and cheap with supporting data structures. * Importantly: BWT itself is reversible (no information lost) if we keep an end-of-string marker `$`. Key property: * BWT enables **backward search** (search the pattern from right → left) using only rank/occurrence queries and a small array `C`. ??? * Presenter note: Use the phrase "cluster similar contexts" — BWT sorts rotations so characters preceding similar contexts end up together. --- # Two equivalent constructions 1. **Rotation table + sort** * List all rotations of the string `T` (append special terminal `$`). * Sort rotations lexicographically. * BWT is the *last column* of this sorted table. 2. **Suffix array (SA) viewpoint** * Build suffix array: sorted list of starting positions of suffixes. * BWT[i] = character preceding suffix SA[i] (character at index `SA[i]-1`, with wrap to last char for SA element 0). <!-- Both produce the same BWT output. --> <!-- * Presenter note: For long genomes we avoid naive rotation table — use suffix arrays or suffix automata. Practical construction algorithms exist in O(n) or O(n log n). --> --- # Example: start text `banana$` We will compute BWT for `banana$` step by step and then use it for exact pattern search. Text `T = "banana$"` (note `$` is lexicographically smallest and unique). | Rotations| | :--- | | `banana$` | | `anana$b` | | `nana$ba` | | `ana$ban` | | `na$bana` | | `a$banan` | | `$banana` | ??? * Presenter note: Remind students `$` is an end marker not present elsewhere; it simplifies uniqueness and reversibility. --- # Example: sort rotations (lexicographically) Sorted rotations (lexicographic order): | Sorted rotations | | :--- | | `$banana` | | `a$banan` | | `ana$ban` | | `anana$b` | | `banana$` | | `na$bana` | | `nana$ba` | Take the **last column** of this table → that is the BWT. ??? * Presenter note: If students ask "why last column?" — explain rotations shift preceding characters to last column, clustering contexts that precede similar suffixes. --- # BWT result (banana$) Sorted rotations (with indexes): | Rotation | Last character | | -------- | :---------: | | $banana | a | | a$banan | n | | ana$ban | n | | anana$b | b | | banana$ | $ | | na$bana | a | | nana$ba | a | --- # BWT result (banana$) The suffix array (SA) of `banana$` is the starting indices of the sorted rotations: .small[ | Index (SA) | Sorted Rotation | Suffix starting at index... | | :--- | :--- | :--- | | **6** | `$banana` | `$` (the 6th index of "banana$") | | **5** | `a$banan` | `a$` (the 5th index) | | **3** | `ana$ban` | `ana$` (the 3rd index) | | **1** | `anana$b` | `anana$` (the 1st index) | | **0** | `banana$` | `banana$` (the 0th index) | | **4** | `na$bana` | `na$` (the 4th index) | | **2** | `nana$ba` | `nana$` (the 2nd index) | ] The string `banana$` is indexed starting from **0**: .small[ | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | | --- | --- | --- | --- | --- | --- | --- | --- | | **Character** | b | a | n | a | n | a | $ | ] * **SA[0] = 6**: The first sorted rotation starts with the character at index **6** (`$`). * **SA[1] = 5**: The second sorted rotation starts with the character at index **5** (`a`). * **SA[4] = 0**: The fifth sorted rotation starts with the character at index **0** (`b`). The BWT Connection The BWT is the "character preceding the suffix." If a suffix starts at index , the character preceding it is at index . * For **SA[1] = 5** (starts with `a`), the BWT character is `banana$[5-1]`, which is `banana$[4] = 'n'`. * For **SA[4] = 0** (starts with `b`), the BWT character wraps around to the end: `banana$[6] = '$'`. This is why the **last column** of your sorted rotations table and the **character before the suffix** always yield the same result: **"annb$aa"**. ??? * Presenter note: Check students can verify counts of characters (a:3, n:2, b:1, $:1) in both T and BWT. --- # LF-mapping (Last → First) Two important observations: * The **first column** of the sorted rotation table is just the characters of T sorted (i.e., all `$` then `a` characters then `b` then `n` characters). * Each row corresponds to a rotation; the last column character precedes the first column character in the original string. --- # LF-mapping (Last → First) * If `i` is a row index in the sorted table, then `LF(i) = C[ BWT[i] ] + Occ(BWT[i], i)` * `LF(i)` maps the last-column position `i` to the corresponding first-column position which represents the previous character in the original text. Definitions: * `C[c]` = total number of characters in T strictly smaller than `c` (prefix sums over character counts). * `Occ(c, i)` = number of occurrences of `c` in BWT[0..i-1] (a rank operation). ??? * Presenter note: LF is the basis for inverse BWT and for stepping backward through text: follow LF repeatedly to reconstruct the text backward. --- # Step 1 — Start character in the sorted rotations * Row index **i = 5** * Character **BWT[5] = 'a'** (this is the second `'a'` in the BWT string `annb$aa`) `\(LF(i) = C[BWT[i]] + Occ(BWT[i], i)\)` | i | First (F) | Rotation | Last (L = BWT) | |:-:|:---------:|:-----------|:--------------:| | 0 | $ | $banana | a | | 1 | a | a$banan | n | | 2 | a | ana$ban | n | | 3 | a | anana$b | b | | 4 | b | banana$ | $ | | 5 | n | na$bana | **a** | | 6 | n | nana$ba | a | --- # Step 2 — Compute C[c] (C[a] in our case) * `C[c]` = total number of characters in T strictly smaller than `c` | Character | Count | | --------- | ----- | | $ | 1 | | a | 3 | | b | 1 | | n | 2 | Sorted alphabet: `$ < a < b < n` So prefix sums: ``` C['$'] = 0 *C['a'] = 1 C['b'] = 1 + 3 = 4 C['n'] = 1 + 3 + 1 = 5 ``` For our case: ``` C['a'] = 1 ``` --- # Step 3 — Compute Occ('a', i) * `Occ(c, i)` = number of occurrences of `c` in BWT[0..i-1] ``` BWT = a n n b $ a a Index = 0 1 2 3 4 5 6 ``` For `i = 5` we look at `BWT[0..4] = a n n b $`. Number of `'a'` in that prefix: `Occ('a', 5) = 1` --- # Step 4 — Apply LF formula `\(LF(5) = C['a'] + Occ('a', 5)\)` `\(LF(5) = 1 + 1 = 2\)` --- # Interpretation ``` LF(5) = 2 ``` * Row 5 (character `'a'` in last column) * Maps to row 2 in the first column * That row represents the suffix that is preceded by this `'a'` in the original text ```markdown Row 5 (L = a) ↓ LF Row 2 (F = a) ``` This is the **Last → First** correspondence: the 2nd `'a'` in the last column corresponds to the 2nd `'a'` in the first column. --- # LF ntuition * The first column (F) is just all characters sorted. * Characters in L are grouped in the same relative order. * `Occ(c,i)` determines *which copy* of character `c` we are referring to. * `C[c]` tells us where the block of `c` begins in F. * Adding them jumps to the matching row in F. --- # Inverse BWT via LF mapping To reconstruct `T` from BWT `L`: 1. Start from row `r` that contains `$` in L (unique). 2. Repeatedly apply ` r ← LF(r)`; collecting characters `L[r]` gives text in reverse order. 3. Stop after `n` steps. This recovers the original `T`. ??? * Presenter note: If students want code, show simple O(n) algorithm using LF with precomputed Occ and C arrays. --- # FM-index — BWT plus rank support FM-index stores: * BWT (string `L`) * `C` array (prefix counts) * A data structure to answer `Occ(c, i)` (rank queries) quickly (often a sampled checkpoint + bitmaps or wavelet trees). Key property: * **Backward search**: find all positions where pattern `P` occurs in `T` in O(|P|) time (rank ops O(1) each), independent of |T|. --- # FM-index — BWT plus rank support Backward search recurrence (pattern processed right → left): ``` sp = 0; ep = n-1 # current SA interval (inclusive) for i from m-1 downto 0: c = P[i] sp = C[c] + Occ(c, sp) ep = C[c] + Occ(c, ep+1) - 1 if sp > ep: no match ``` Final interval `[sp, ep]` gives SA indices (SA[sp..ep]) of all matches. ??? * Presenter note: Explain meaning of `[sp, ep]`: it is the contiguous block in the suffix array corresponding to suffixes that start with the current pattern suffix. --- # Worked backward search example (`banana$`) We have: * `T = "banana$"` (n = 7) * `BWT = "annb$aa"` * Counts: `$`:1, `a`:3, `b`:1, `n`:2 * `C` values (chars sorted `$` < `a` < `b` < `n`): ``` C['$'] = 0 C['a'] = 1 (# of chars < 'a' is just '$') C['b'] = 1+3 = 4 C['n'] = 1+3+1 = 5 ``` We want to find pattern `P = "ana"` (exact occurrences). ??? * Presenter note: We'll compute Occ as prefix table; show small table to make Occ lookups concrete. --- # Occurrence table (prefix counts for BWT) Build `Occ(c, i)` as counts in prefixes of BWT (index i from 0..n): ``` BWT indices: 0 1 2 3 4 5 6 BWT chars: a n n b $ a a i (prefix): 0 1 2 3 4 5 6 7 (prefix length) Occ('$', i): 0 0 0 0 0 1 1 1 Occ('a', i): 0 1 1 1 1 1 2 3 Occ('b', i): 0 0 0 0 1 1 1 1 Occ('n', i): 0 0 1 2 2 2 2 2 ``` Interpretation: `Occ('a', 1) = 1` because BWT[0] = 'a'. ??? * Presenter note: These small tables are why BWT + rank is practical: Occ lookups are fast with precomputation and sampling. --- # Backward search steps for `P = "ana"` Initialize `sp = 0`, `ep = 6` (full SA range). Process pattern right → left (`'a'`, `'n'`, `'a'`): 1. Character `a`: ``` sp = C['a'] + Occ('a', sp) = 1 + Occ('a',0) = 1 ep = C['a'] + Occ('a', ep+1) - 1 = 1 + Occ('a',7) - 1 = 1 + 3 -1 = 3 ``` Interval: `[1, 3]` --- # Backward search steps for `P = "ana"` Initialize `sp = 0`, `ep = 6` (full SA range). Process pattern right → left (`'a'`, `'n'`, `'a'`): 2. Character `n`: ``` sp = C['n'] + Occ('n', 1) = 5 + 0 = 5 ep = C['n'] + Occ('n', 3+1) - 1 = 5 + Occ('n',4) -1 = 5 + 2 -1 = 6 ``` Interval: `[5, 6]` --- # Backward search steps for `P = "ana"` Initialize `sp = 0`, `ep = 6` (full SA range). Process pattern right → left (`'a'`, `'n'`, `'a'`): 3. Character `a`: ``` sp = C['a'] + Occ('a', 5) = 1 + 1 = 2 ep = C['a'] + Occ('a', 6+1) - 1 = 1 + 3 -1 = 3 ``` Final interval: `[2, 3]` Result: SA[2..3] correspond to starting positions `{3,1}` in `T` → matches at positions 1 and 3 (0-based) → `'ana'` occurs at `T[1:4]` and `T[3:6]`. ??? * Presenter note: Highlight that number of Occ/C lookups equals pattern length; search cost O(m) ignoring rank implementation cost. <!-- # LF mapping & inverse (concrete on `banana$`) * Example LF formula checked on index `i=0` where `BWT[0]='a'`: ``` LF(0) = C['a'] + Occ('a', 0) = 1 + 0 = 1 ``` * Repeated LF iterations move backward through text; starting from row with `$` in BWT (row where BWT char is `$`) reconstructs original string. * Presenter note: If time allows, walk full inverse iteratively to reconstruct `banana$`. # Complexity & implementation notes * Construction: * Naive rotations: O(n^2 log n) (impractical). * Practical: build suffix array in O(n log n) or O(n); derive BWT from SA in O(n). * Search: * Backward search (exact): O(m) rank queries. * With proper rank data structures (bitvectors + sampled checkpoints, or wavelet trees), `Occ` is O(1) or O(log σ). # Complexity & implementation notes * Space: * FM-index is compact — near entropy of text; stores sampled SA values for locating hits (not just counting). * Real aligners: * Use seeded/backtracking strategies for approximate matching (allow mismatches/indels) using FM-index; then extend with local alignment (e.g. Smith–Waterman). * Presenter note: Mention practical libraries: SDSL, SeqAn; aligners: BWA, Bowtie, Bowtie2, minimap2 (different tradeoffs). No need for exhaustive list. --> --- # How BWT/FM-index gets used in aligners * Index the **reference genome** once (BWT + FM-index + sampled SA). * For each read: * Use backward search to find exact matches (seed). * If no exact seed, backtrack allowing mismatches (bounded) or use spaced seeds. * When candidate locations are found, refine with alignment (banded dynamic programming / Smith–Waterman). * Advantages: * Low memory footprint for large references. * Fast searches for short patterns (reads). ??? * Presenter note: Optionally show BWA workflow diagram: index → seed → extend → output SAM. <!-- # Practical caveats * Exact matching via FM-index is fast, but approximate search requires backtracking which can be expensive if allowed mismatches > few. * Handling indels commonly requires additional logic (seed + extension, or graph indexes). * Choice of sampling rate for SA affects memory/time tradeoffs (locating hits from SA interval may need repeated LF steps). ??? * Presenter note: Emphasize engineering tradeoffs crucial to aligner performance. # Short code snippet (conceptual) — backward search in pseudo-R ```r # pseudo-code; not optimized backward_search <- function(bwt, C, occ, pattern) { sp <- 0; ep <- length(bwt) - 1 for (i in nchar(pattern):1) { c <- substr(pattern, i, i) sp <- C[c] + occ(c, sp) ep <- C[c] + occ(c, ep+1) - 1 if (sp > ep) return(integer(0)) } return(SA[sp:(ep)]) # SA must be accessible or sampled and reconstructed } ``` ??? * Presenter note: Make clear `occ(c,i)` returns prefix counts and `SA` locating requires additional structures; real code uses bitvectors/wavelet trees. # Further reading * Original BWT paper and descriptions (Burrows & Wheeler, 1994) — concept & compression. * Manzini, Ferragina — FM-index papers / tutorials. * Pachter & Salzberg tutorials and aligner papers (BWA, Bowtie) for practical implementations. ??? * Presenter note: Suggest students read implementation details in BWA or Bowtie2 papers and try small exercises implementing BWT and backward search. # In-class exercise (optional) * Compute BWT for `GATTACA$` by hand (rotations → sort → last column). * Build `C` and `Occ` and search pattern `TA` using backward search. ??? * Presenter note: Provide solution later or use it as a quick group activity. -->