UMAP for Single-Cell RNA-seq

class: center, middle, inverse, title-slide

.title[
# UMAP for Single-Cell RNA-seq
]
.subtitle[
## Modern Dimensionality Reduction via Manifold Learning
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2026-04-08
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

# UMAP: What's Different?

**UMAP** = Uniform Manifold Approximation and Projection (McInnes et al., 2018)

.pull-left[
**t-SNE recap**:
- Probabilistic framework
- Preserves local neighborhoods
- Heavy-tailed distributions
- Slow: `$O(n^2)$` or `$O(n \log n)$`
- Poor global structure
]

.pull-right[
**UMAP**:
- **Topological framework**
- Preserves both local + global structure
- Manifold learning theory
- Fast: `$O(n^{1.14})$` 
- Better scalability
]

**Key innovation**: Assumes data lies on Riemannian manifold (a topological space where each tangent space is equipped with a smoothly varying inner product), uses algebraic topology

???

**Presenter's Notes:**

UMAP takes a fundamentally different mathematical approach than t-SNE, though they solve similar problems.

**Core difference**: 
- t-SNE: "Make nearby points stay nearby using probabilities"
- UMAP: "Preserve the topological structure of the manifold"

**Why topology?**
- Topology studies properties preserved under continuous deformations
- Perfect for dimensionality reduction - we're "bending" high-D space into low-D
- More principled mathematical foundation than t-SNE

**Practical impact**: UMAP is now the default in many tools (Scanpy, Seurat v4+) because it's faster and preserves more structure.

---
# Mathematical Foundation

**1. High-Dimensional Manifold**

Assume data lies on Riemannian manifold `$\mathcal{M}$` embedded in `$\mathbb{R}^d$`

**Local metric**: Each point has local distance structure
- Varies by density (like t-SNE's adaptive `$\sigma_i$`)
- Fuzzy simplicial complex represents topology

---
# Mathematical Foundation

**2. Fuzzy Set Representation**

For each point `$x_i$`, compute fuzzy set membership:

`$$w_{ij} = \exp\left(-\frac{\max(0, d(x_i, x_j) - \rho_i)}{\sigma_i}\right)$$`

where:
- `$d(x_i, x_j)$` - the distance between data points in the high-dimensional space
- `$\rho_i$` = distance to nearest neighbor
- `$\sigma_i$` = bandwidth (from `n_neighbors`)

**Union** of all fuzzy sets → high-dimensional topological structure

???

**Presenter's Notes:**

Don't worry if the topology feels abstract - the practical intuition is more important.

**Intuition**:
- Each point defines a "neighborhood" of nearby points
- These neighborhoods overlap to form a structure
- UMAP tries to preserve this structure in low dimensions

**Key difference from t-SNE**:
- t-SNE: Gaussian similarity → probabilities
- UMAP: Exponential membership → fuzzy topology

**Fuzzy sets**: Instead of "point j is or isn't a neighbor," we have "point j is 0.7 in the neighborhood"

**Parameters encoded**:
- `$\rho_i$`: Distance to closest neighbor (ensures connectivity)
- `$\sigma_i$`: Bandwidth (determined by `n_neighbors` parameter)
- These adapt to local density, like t-SNE's perplexity

---
# Mathematical Foundation

**3. Low-Dimensional Optimization**

Similar fuzzy set in low-dimensional space `$\mathbb{R}^k$`:

`$$v_{ij} = \left(1 + a(||y_i - y_j||^2)^b\right)^{-1}$$`

Default: `$a \approx 1.93$`, `$b = 0.79$` (learned from `min_dist`)

**Objective**: Cross-entropy between fuzzy sets (symmetric)

`$$C = \sum_{ij} w_{ij} \log\frac{w_{ij}}{v_{ij}} + (1-w_{ij})\log\frac{1-w_{ij}}{1-v_{ij}}$$`

**Optimization**: Stochastic gradient descent with negative sampling

???

**Presenter's Notes:**

The low-dimensional similarity function looks different from both t-SNE's t-distribution and the high-D exponential.

**Why this form?**
- Approximates actual manifold distance in low dimensions
- `$a$` and `$b$` control spread (set by `min_dist` parameter)
- More flexible than t-SNE's fixed `$(1 + d^2)^{-1}$`

**Cross-entropy vs KL divergence**:
- UMAP uses symmetric cross-entropy
- t-SNE uses asymmetric KL divergence
- Cross-entropy: balanced attention to attractions and repulsions
- More stable optimization

**Negative sampling**:
- Can't compute all `$n^2$` pairs
- Sample random "non-neighbor" pairs for repulsion
- Dramatically speeds up optimization
- Key to UMAP's `$O(n^{1.14})$` complexity

**The math is complex, but implementation is straightforward** - most users never need to understand the topology!

---
# Key Hyperparameters

**1. n_neighbors (like perplexity)**

Controls local vs. global structure balance

![](03_UMAP_files/figure-html/unnamed-chunk-1-1.png)

- **Low (5-10)**: Emphasizes fine structure, more clusters
- **Default (15)**: Good balance for most applications
- **High (30-50)**: Emphasizes global structure, fewer clusters

**Rule**: Larger than t-SNE perplexity typically works well

???

**Presenter's Notes:**

`n_neighbors` is UMAP's most important parameter, analogous to t-SNE's perplexity.

**Relationship to perplexity**:
- Similar conceptual role
- But UMAP typically uses smaller values (15 vs 30)
- UMAP is less sensitive to this parameter than t-SNE is to perplexity

**Effects**:
- **n_neighbors = 5**: 
  - Very local, picks up fine details
  - May fragment coherent populations
  - Good for detecting rare cell types
  
- **n_neighbors = 15** (default):
  - Works well for most scRNA-seq
  - Balances detail and structure
  
- **n_neighbors = 50**:
  - More global view
  - May merge related populations
  - Good for hierarchical structure

**Practical advice**:
- Start with 15
- Decrease if you suspect rare cell types are being missed
- Increase if structure looks overly fragmented
- Less critical than perplexity in t-SNE (UMAP is more robust)

---
# Key Hyperparameters

**2. min_dist**

Controls tightness of clusters in embedding

![](03_UMAP_files/figure-html/unnamed-chunk-2-1.png)

- **Low (0.0-0.05)**: Dense, compact clusters (discrete cell types)
- **Default (0.1)**: Balanced for most cases
- **High (0.3-0.5)**: More dispersed, continuous structure (trajectories)

**Use low min_dist** for discrete cell types, **high for trajectories**

???

**Presenter's Notes:**

`min_dist` controls the minimum distance between points in the embedding - essentially how tightly packed the embedding is.

**Technical meaning**:
- Minimum distance allowed between points in low-D
- Controls the `$a$` and `$b$` parameters in the similarity function
- Affects both cluster tightness and spacing

**Effects**:
- **min_dist = 0.0**: 
  - Very tight clusters
  - Clear separation between groups
  - Good for discrete cell types (immune cell classification)
  - May create artificial gaps
  
- **min_dist = 0.1** (default):
  - Good balance
  - Works for most scRNA-seq applications
  
- **min_dist = 0.5**:
  - Loose, spread out
  - Better reveals continuous structure
  - Good for developmental trajectories
  - Less dramatic cluster separation

**Interaction with n_neighbors**:
- High n_neighbors + high min_dist → very global, smooth
- Low n_neighbors + low min_dist → very local, fragmented

**Recommendation**:
- Use default (0.1) initially
- Adjust based on biology: discrete types → lower, trajectories → higher

---
# UMAP vs. t-SNE

| Feature | t-SNE | UMAP |
|---------|-------|------|
| **Speed** | Slow (`$O(n \log n)$`) | Fast (`$O(n^{1.14})$`) |
| **Scalability** | <50k cells practical | 100k+ cells routine |
| **Global structure** | Poor | Better preserved |
| **Local structure** | Excellent | Very good |
| **Determinism** | Stochastic | More stable |
| **Cluster separation** | Often clearer | Sometimes more realistic |
| **Distances** | Not meaningful | More meaningful (but still limited) |
| **Theory** | Probabilistic | Topological |
| **Trajectories** | Can break | Better preserved |

???

**Presenter's Notes:**

This is the practical comparison your students need.

**Speed and scalability** - UMAP wins decisively:
- t-SNE: 10k cells = minutes, 50k cells = hours
- UMAP: 10k cells = seconds, 100k cells = minutes
- For modern large datasets, this is critical

**Structure preservation**:
- t-SNE: Excellent local, poor global
- UMAP: Good local, decent global
- UMAP's global structure still not perfect, but much better than t-SNE

**Cluster separation**:
- t-SNE often creates cleaner gaps between clusters
- UMAP may show more continuous structure
- Which is "better" depends on biology - are cell types discrete or continuous?

**Practical experience**:
- UMAP results are more consistent across runs
- Less sensitive to parameter changes
- Easier to get "good" results quickly

**Cultural/field considerations**:
- Immunology community: both widely accepted
- Neuroscience: increasingly UMAP
- Developmental biology: UMAP preferred for trajectories
- Some reviewers still expect t-SNE - know your audience

**My recommendation**: Start with UMAP, compare with t-SNE if needed. If results are similar, use UMAP (faster, better properties). If very different, investigate why.

---
# Implementation Example

``` r
library(Seurat)

# Standard preprocessing (same as t-SNE)
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj, nfeatures = 2000)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj, npcs = 50)

# UMAP with default parameters
seurat_obj <- RunUMAP(seurat_obj, 
                      dims = 1:30,          # Use first 30 PCs
                      n.neighbors = 15,     # Default
                      min.dist = 0.1)       # Default

# Compare with t-SNE
seurat_obj <- RunTSNE(seurat_obj, dims = 1:30, perplexity = 30)

# Visualize both
p1 <- DimPlot(seurat_obj, reduction = "umap") + ggtitle("UMAP")
p2 <- DimPlot(seurat_obj, reduction = "tsne") + ggtitle("t-SNE")
p1 | p2
```

???

**Presenter's Notes:**

Implementation is straightforward - very similar workflow to t-SNE.

**Key points**:

1. **Same preprocessing**: UMAP uses same QC, normalization, HVG selection, PCA
2. **Runs on PCs**: Like t-SNE, almost always run on PC space, not raw data
3. **Faster**: You'll notice UMAP completes much quicker than t-SNE

**Parameter choices**:
- `dims = 1:30`: Use same PC selection as you would for t-SNE
- `n.neighbors = 15`: Good starting point, roughly equivalent to perplexity ~15-20
- `min.dist = 0.1`: Standard default

**Seurat vs Scanpy**:
- Seurat: Uses uwot (R implementation)
- Scanpy: Uses umap-learn (Python implementation)
- Results should be similar but not identical
- Both are well-maintained

**Side-by-side comparison**:
- Always good to generate both UMAP and t-SNE initially
- Check if they show consistent structure
- If very different, investigate parameters and data quality
- Choose based on which better represents known biology

**Modern trend**: Many papers now show only UMAP, especially for large datasets

<!---
# What UMAP Does Better

**1. Preserves Global Structure**
- Distances between clusters more meaningful
- Can infer cluster relationships (with caution)
- Better for hierarchical structure

**2. Trajectories**
- Maintains continuous transitions
- Better for developmental processes
- Doesn't artificially fragment temporal processes

**3. Scalability**
- 100,000+ cells routine
- Modern datasets often this large
- Essential for atlas-scale projects

**4. Speed**
- 10-100× faster than t-SNE
- Enables interactive parameter exploration
- Faster iterations during analysis

**5. Reproducibility**
- More stable across runs
- Less sensitive to initialization
- Easier to reproduce results

???

**Presenter's Notes:**

These are UMAP's genuine advantages - situations where it performs objectively better than t-SNE.

**1. Global structure**:
Example: You have HSCs, progenitors, and mature cells. In UMAP, their relative positions are more meaningful - you can see HSCs → progenitors → mature as a progression. In t-SNE, this relationship is lost.

**2. Trajectories**:
Real case: Developmental time course
- t-SNE might show: Day0 cluster | Day2 cluster | Day4 cluster (disconnected)
- UMAP shows: Smooth progression Day0 → Day2 → Day4
- UMAP doesn't break the trajectory into pieces

**3. Scalability**:
Human Cell Atlas projects have millions of cells. Only UMAP is practical.

**4. Speed example**:
10,000 cells, 30 PCs:
- t-SNE: 5-10 minutes
- UMAP: 10-30 seconds
This matters when you're trying different parameters

**5. Reproducibility**:
- Run t-SNE 5 times with same seed → can get noticeably different results
- Run UMAP 5 times → very consistent
- Better for collaborative projects and publications

**When these matter most**:
- Atlas projects: need speed and scale
- Developmental biology: need trajectory preservation
- Comparative studies: need global structure
- High-throughput screening: need speed for many samples

# What t-SNE Still Does Better

**1. Cluster Separation**
- Often creates clearer visual gaps
- Easier to identify distinct populations
- Better for discrete cell types

**2. Publication Precedent**
- More established in literature
- Reviewers familiar with interpretation
- Some fields expect it

**3. Local Detail**
- Can reveal finer substructure
- May detect rare populations more clearly
- When you care only about local neighborhoods

**When to still use t-SNE**:
- Small-medium datasets (<10,000 cells)
- Need maximum cluster clarity for presentation
- Established pipelines in your lab/field
- Want to show both methods for comparison

**Best practice**: Generate both, see which tells the biological story better

???

**Presenter's Notes:**

t-SNE isn't obsolete - there are still cases where it's the better choice.

**1. Cluster separation**:
t-SNE often creates more dramatic gaps between clusters. For a presentation where you want to show "Look, 5 distinct cell types!" t-SNE may make this more visually striking.

**Example**:
- UMAP might show: clusters with some overlap/continuity
- t-SNE shows: clear islands with water between them
- If clusters truly are discrete, t-SNE's representation may be more intuitive

**2. Publication considerations**:
Some reviewers (especially older generation) are more comfortable with t-SNE:
- It's been around longer (2008 vs 2018)
- More papers have used it
- Some may question UMAP as "too new"

**3. Local detail**:
When you specifically care about "Are these cells neighbors?" and don't care about global relationships, t-SNE's strong local focus is an advantage.

**Real decision tree**:
```
Is dataset >50k cells? → Use UMAP
Are you studying trajectories? → Use UMAP
Do you need it fast? → Use UMAP
Is this exploratory? → Try both
Is field conservative? → Show t-SNE (and UMAP in supplement)
Do results match? → Use whichever looks better for your story
```

**Honest assessment**: UMAP is becoming the default for good reasons, but t-SNE isn't wrong - they're tools with different strengths.
-->

---
# Practical Recommendations

1. **Always run PCA first** (20-50 components). And compare with UMAP with PCA.

2. **Generate UMAP** (default parameters)
   - n_neighbors = 15
   - min_dist = 0.1

3. **Check structure** against known biology

4. **If needed, adjust**:
   - Fragmented? → Increase n_neighbors to 30
   - Over-merged? → Decrease n_neighbors to 10
   - Need tighter clusters? → Decrease min_dist to 0.01
   - Have trajectories? → Increase min_dist to 0.3

5. **Compare with t-SNE** if uncertain

6. **Validate** all findings in PC space

???

**Presenter's Notes:**

A simple, systematic workflow for your analyses.

**Step 1 - PCA**:
This is non-negotiable. Always do PCA first:
- Dimensionality reduction
- Denoising
- Computational efficiency
- Makes UMAP results better

**Step 2 - Default UMAP**:
Start with defaults because:
- They work well for most datasets
- Easier to explain/reproduce
- Only adjust if there's a clear problem

**Step 3 - Biology check**:
Ask: "Do known cell types separate?"
- If yes: parameters are good
- If no: adjust or check preprocessing

**Step 4 - Systematic adjustment**:

Problem: "My T cells are split into many tiny clusters"
→ Increase n_neighbors (15 → 30)

Problem: "My CD4 and CD8 T cells are merged"
→ Decrease n_neighbors (15 → 10)

Problem: "I can't see where one cluster ends and another begins"
→ Decrease min_dist (0.1 → 0.05)

Problem: "My differentiation trajectory looks disconnected"
→ Increase min_dist (0.1 → 0.3)

**Step 5 - t-SNE comparison**:
If UMAP gives unexpected results, compare with t-SNE:
- Similar? → Structure is robust
- Different? → Investigate which is more biologically accurate

**Step 6 - Validation**:
Never forget: dimensionality reduction is for visualization. All biological conclusions must be validated in the original high-dimensional space.

<!---
# Common Pitfalls

**Don't**:

1. **Skip PCA preprocessing** → Same as t-SNE, always use PCA first

2. **Over-interpret distances** → Better than t-SNE, but still limited

3. **Ignore batch effects** → UMAP will show them clearly

4. **Use only UMAP for clustering** → Cluster in PC space, visualize with UMAP

5. **Assume global distances are perfect** → Better ≠ perfect

6. **Compare embeddings across datasets** → Each embedding is independent

**Do**:

1. **Report parameters** (n_neighbors, min_dist, n_pcs)

2. **Set random seed** for reproducibility

3. **Try multiple parameter values** if results seem off

4. **Validate with marker genes** in original expression space

???

**Presenter's Notes:**

Many pitfalls are shared with t-SNE, but some are UMAP-specific.

**Pitfall 1 - Skip PCA**:
I still see papers running UMAP directly on normalized counts. Don't do this. Results will be:
- Dominated by technical noise
- Computationally expensive
- Poor quality

**Pitfall 2 - Over-interpret distances**:
UMAP's global structure is better than t-SNE, but it's not PCA. Don't conclude "Cluster A is 2× more similar to B than to C" from UMAP distances. Better than t-SNE doesn't mean perfect.

**Pitfall 3 - Batch effects**:
UMAP will clearly show batch effects as separate clusters. This is actually helpful (easy to detect), but you must correct them before biological interpretation.

**Pitfall 4 - Clustering**:
```r
# Wrong:
clusters <- kmeans(umap_coords, k=5)  # ✗

# Right:
clusters <- FindClusters(seurat_obj, resolution=0.8)  # ✓ (uses PC space)
```

**Pitfall 5 - Global distances**:
Yes, UMAP preserves more global structure than t-SNE. No, you still can't measure distances quantitatively and claim statistical significance.

**Pitfall 6 - Cross-dataset comparison**:
Can't overlay UMAP from dataset A onto dataset B. Each UMAP is specific to its training data. For integration, use proper integration methods first.

**Remember**: UMAP is a visualization tool with better properties than t-SNE, but it's still a visualization tool.
-->

---
# Advanced: Supervised UMAP

**Standard UMAP**: Unsupervised dimensionality reduction

**Supervised UMAP**: Incorporate label information

``` r
# In Python with umap-learn
reducer = umap.UMAP(n_neighbors=15, target_metric='categorical')
embedding = reducer.fit_transform(X, y=cell_labels)
```

**Use cases**:
- Emphasize separation of known cell types
- Focus embedding on specific biological variation
- Guided dimensionality reduction

**Warning**: Can over-emphasize labeled variation, hide novel biology

**Semi-supervised UMAP**: Partial labels (some cells labeled, others not)

???

**Presenter's Notes:**

Supervised UMAP is an advanced feature not available in standard t-SNE.

**What it does**:
- Incorporates known labels into the optimization
- Pulls labeled groups apart
- Makes separation clearer for known categories

**When to use**:
1. **Validation studies**: You know some cell types, want to see if they separate
2. **Annotation transfer**: Label some cells, want to separate groups clearly
3. **Focused analysis**: Have specific hypothesis about certain populations

**Example**:
You've FACS-sorted CD4 and CD8 T cells separately, then profiled. You know the labels. Supervised UMAP will ensure these separate maximally.

**Dangers**:
- Forces separation even if not present in data
- May hide interesting biology (what if CD4 and CD8 overlap in your condition?)
- Can lead to confirmation bias
- Should not be used for discovery

**My recommendation**:
- Run unsupervised UMAP first (standard analysis)
- If needed for visualization, try supervised UMAP
- Never use supervised for discovery
- Always disclose in methods if you use it

**Semi-supervised**:
- Some cells labeled, most unlabeled
- Guides embedding based on known structure
- Interesting for atlas projects where some cells are well-characterized

Most users never need supervised UMAP - unsupervised works well.

---

# Summary

.pull-left[
**UMAP advantages**:
- Faster than t-SNE (10-100×)
- Better global structure preservation
- More scalable (100k+ cells)
- Better for trajectories
- More reproducible

**Key parameters**:
- **n_neighbors** (15 default): local vs global balance
- **min_dist** (0.1 default): cluster tightness
]
.pull-right[
**When to use UMAP**:
- Large datasets (>10,000 cells)
- Developmental/trajectory analysis
- Need speed for parameter exploration
- Modern default choice

**When to use t-SNE**:
- Need maximum cluster separation
- Small-medium datasets
- scRNA-seq gold standard
]

**Remember**: Both are visualization tools. Always validate in expression space.

???

**Presenter's Notes:**

**Bottom line for your research**:

**Default workflow**:
1. Preprocessing + PCA (same for both)
2. Run UMAP with defaults
3. Check against biology
4. Adjust if needed
5. Optionally compare with t-SNE
6. Choose method that best represents your biology

**For your papers**:
- UMAP is now widely accepted
- Can show only UMAP for main figures
- Consider t-SNE in supplement for comparison
- Always report parameters and versions
- Emphasize that clustering/analysis done in PC space

**The field is moving toward UMAP**:
- Scanpy default
- Seurat increasingly UMAP
- Most new papers use UMAP
- Some still show both

**Don't be dogmatic**: 
- Neither method is "correct"
- Both are lossy projections
- Choose based on your question and data
- Be transparent about choices

**Most important**: Understand what each method does, what it preserves, what it distorts, and interpret accordingly.

---
# References

McInnes, L., et al. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. *arXiv*.
https://doi.org/10.48550/arXiv.1802.03426

Becht, E., et al. (2019). Dimensionality reduction for visualizing single-cell data using UMAP. *Nat. Biotechnol.*, *37*(1), 38–44.
https://doi.org/10.1038/nbt.4314

StatQuest explanatory video https://www.youtube.com/watch?v=NEaUSP4YerM

**Software**:
- **R**: umap, uwot packages, Seurat::RunUMAP
- **Python**: umap-learn, scanpy.tl.umap

**Online resources**:
- UMAP documentation: https://umap-learn.readthedocs.io/
- Interactive examples: https://pair-code.github.io/understanding-umap/