class: center, middle, inverse, title-slide .title[ # t-SNE: Nonlinear Dimensionality Reduction ] .subtitle[ ## t-Distributed Stochastic Neighbor Embedding ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 2026-04-08 ] --- <!-- HTML style block --> <style> .large { font-size: 130%; } .small { font-size: 70%; } .tiny { font-size: 40%; } </style> ## Outline 1. Motivation: Why Dimensionality Reduction? 2. What is t-SNE? 3. The t-SNE Algorithm: Math & Intuition 4. The Perplexity Parameter 5. t-SNE Properties and Characteristics 6. Examples in R 7. Hyperparameters & Practical Considerations 8. Limitations and Common Pitfalls 9. t-SNE vs PCA vs MDS vs UMAP 10. Practical Tips & Best Practices --- ## Motivation: The Curse of Dimensionality **Key Problem**: In high dimensions, all points become equidistant! `$$\lim_{d \to \infty} \frac{\text{dist}_{\max} - \text{dist}_{\min}}{\text{dist}_{\min}} \to 0$$` **Implications:** 1. Distance-based clustering becomes unreliable 2. Nearest neighbors become less meaningful 3. Volume of space grows exponentially — data becomes increasingly sparse **Solution**: Project to lower dimensions while preserving structure --- ## Motivation: Single-Cell RNA-seq Data **Typical scRNA-seq dataset:** - **Cells**: 10,000 – 100,000+ - **Genes**: 3,000 – 5,000 - **Challenges**: Sparse data (60–90% zeros), technical noise, batch effects **The key insight**: Data likely lies on a lower-dimensional manifold embedded in high-dimensional space. Most variation is explained by a smaller number of biological processes. **Goal**: Find a low-dimensional representation that preserves cell-cell similarities, cluster structure, and trajectory relationships. --- ## What is t-SNE? **t-SNE (t-Distributed Stochastic Neighbor Embedding):** - **Nonlinear** dimensionality reduction technique - Particularly effective for visualizing high-dimensional data in 2D or 3D - Excels at preserving **local structure** and revealing clusters - Widely used in genomics, especially single-cell RNA-seq analysis --- ## t-SNE Key Innovation - Uses different probability distributions in high vs. low dimensional space - Student's t-distribution in low dimensions prevents "crowding problem" - Maintains local neighborhoods while separating distinct clusters **Primary Goal:** Create a low-dimensional map where similar points stay close and dissimilar points stay far apart --- ## t-SNE: Core Idea **Principle**: Preserve **local** neighborhood structure 1. **High-dimensional space**: Define probability distribution over pairs based on similarity - Similar points → High probability of being neighbors 2. **Low-dimensional space**: Define similar probability distribution 3. **Optimization**: Minimize divergence between these distributions **Key Innovation**: Use Student's t-distribution (heavy tails) in low-dimensional space --- ## Stochastic Neighbor Embedding (SNE) **SNE is the predecessor to t-SNE** SNE minimizes the **Kullback-Leibler (KL) divergence** between: - `\(p_{ij}\)`: scaled similarities of points `\(i\)` and `\(j\)` in **high-dimensional** space - `\(q_{ij}\)`: scaled similarities of points `\(i\)` and `\(j\)` in **low-dimensional** space `$$KL(P||Q) = \sum_{i \ne j}p_{ij}\log\frac{p_{ij}}{q_{ij}}$$` --- ## Stochastic Neighbor Embedding (SNE) **How SNE computes similarities:** - Uses a **Gaussian kernel** in both high and low dimensional space: `$$\exp\left(-\frac{||x_i - x_j||^2}{2\sigma^2}\right)$$` - `\(\sigma\)`: length scale parameter accounting for kernel width **Problem with SNE:** Crowding problem in low dimensions --- ## From SNE to t-SNE **The Crowding Problem:** - In high dimensions, many points can be equidistant from a central point - In 2D/3D, there's not enough space to accommodate all these distances - Result: Points are forced too close together, obscuring structure **t-SNE's Solution:** - Use a **t-distribution** (heavy-tailed) in low-dimensional space instead of Gaussian - Keep Gaussian kernel in high-dimensional space --- ## From SNE to t-SNE **Benefits of t-distribution:** - Allows moderate distances in high-D to become larger distances in low-D - Better separates dissimilar points - Maintains local neighborhoods more effectively - Penalizes wrong embeddings of dissimilar points - Especially suitable for representing **clustered data** and **complex structures** --- ## t-SNE Algorithm Overview **Step-by-step process:** **1. Compute pairwise distances** in high-dimensional space **2. Transform to similarity matrix** using varying Gaussian kernel - Similarity between `\(X_i\)` and `\(X_j\)` represents joint probability that `\(X_i\)` chooses `\(X_j\)` as neighbor (or vice versa). - Based on Euclidean distance and local density --- ## t-SNE Algorithm Overview **3. Create random low-dimensional mapping** (initial configuration) **4. Compute pairwise similarities** in low-dimensional space - Uses **Student's t-distribution** (not Gaussian!) **5. Optimize using gradient descent** - Minimize KL divergence between high-D and low-D distributions - Iteratively adjust point positions --- ## Mathematical Details: High-Dimensional Similarities For points `\(\mathbf{x}_i\)` and `\(\mathbf{x}_j\)` in high-dimensional space, define conditional probability: `$$p_{j|i} = \frac{\exp(-||\mathbf{x}_i - \mathbf{x}_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-||\mathbf{x}_i - \mathbf{x}_k||^2 / 2\sigma_i^2)}$$` **Symmetrized joint probability:** `$$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$$` **Interpretation**: `\(p_{ij}\)` = probability that `\(x_i\)` would pick `\(x_j\)` as its neighbor. The bandwidth `\(\sigma_i\)` is **adaptive** — smaller in dense regions, larger in sparse regions — controlled by the perplexity parameter. --- ## Mathematical Details: Low-Dimensional Similarities For low-dimensional points `\(\mathbf{y}_i\)` and `\(\mathbf{y}_j\)`, use a **Student's t-distribution with 1 degree of freedom**: `$$q_{ij} = \frac{(1 + ||\mathbf{y}_i - \mathbf{y}_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||\mathbf{y}_k - \mathbf{y}_l||^2)^{-1}}$$` **Why t-distribution (df=1)?** - Heavy tails → moderate distances in high-D can become larger distances in low-D - Solves the crowding problem - `\((1 + d^2)^{-1}\)` decays slower than Gaussian `\(\exp(-d^2)\)`, giving more "breathing room" --- ## Mathematical Details: Optimization **Objective**: Minimize Kullback-Leibler divergence `$$C = KL(P||Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}$$` **Gradient** (used in gradient descent): `$$\frac{\delta C}{\delta \mathbf{y}_i} = 4\sum_j (p_{ij} - q_{ij})(\mathbf{y}_i - \mathbf{y}_j)(1 + ||\mathbf{y}_i - \mathbf{y}_j||^2)^{-1}$$` - If `\(p_{ij} > q_{ij}\)`: points should be closer → **attractive force** - If `\(p_{ij} < q_{ij}\)`: points should be farther → **repulsive force** **Optimization**: Gradient descent with momentum --- ## The Perplexity Parameter **Perplexity:** The main tuning parameter for t-SNE - Determines the **neighborhood size** of the kernels (the adaptive `\(\sigma_i\)`) - Balances attention between local and global structure - Roughly interpreted as number of "effective nearest neighbors" `$$\text{Perplexity}(P_i) = 2^{H(P_i)}$$` where `\(H(P_i) = -\sum_j p_{j|i} \log_2 p_{j|i}\)` is Shannon entropy **Typical values:** - Range: 5 to 50, Common default: 30 - Small perplexity (5–10): Focus on very local structure - Large perplexity (30–50): Consider broader neighborhoods --- ## The Perplexity Parameter **Effect on `\(\sigma_i\)`:** - Binary search for each point finds `\(\sigma_i\)` that achieves target perplexity - Dense regions → smaller `\(\sigma_i\)` - Sparse regions → larger `\(\sigma_i\)` **Too Low (5–10):** May break apart coherent clusters; sensitive to local noise **Appropriate (20–50):** Balances local and global structure; smooth transitions **Too High (>100):** May merge distinct populations; PCA-like behavior - Different perplexity values can give very different results - Recommended: Try multiple values (e.g., 5, 30, 50) - Should be smaller than the number of data points - Rule of thumb: perplexity < 3% of n --- ## Spring Analogy: How t-SNE Works **Physical interpretation of the optimization:** Each pair of points `\(Y_i\)` and `\(Y_j\)` is connected by a **spring**: - **Attractive force:** When similarity in projection < similarity in high-D space - Spring pulls points closer together - **Repulsive force:** When similarity in projection > similarity in high-D space - Spring pushes points farther apart --- ## Spring Analogy: How t-SNE Works **Gradient descent:** - Reduces each point's springs into a single force vector - Moves points to minimize total KL divergence **Heavy-tailed t-distribution advantage:** - Exerts **stronger force** when pushing distant points further apart - Alleviates the crowding problem - Creates clearer separation between clusters --- ## t-SNE Properties and Characteristics **Strengths:** - Excellent at revealing **local structure** and clusters - Handles nonlinear relationships - Creates visually interpretable 2D/3D maps - Robust to noise in high-dimensional data **Important Properties:** - **Axes have no inherent meaning** (arbitrary units) - **Distances between clusters** are not meaningful - **Cluster sizes** do not necessarily reflect true sizes - **Stochastic:** Different runs produce different results (set seed!) **Key Insight:** t-SNE is primarily a **visualization tool**, not for quantitative distance analysis --- ## Example 1: Iris Dataset Basic t-SNE example with the iris dataset ``` r library(Rtsne) library(ggplot2) # Load iris data data(iris) iris_data <- as.matrix(iris[, 1:4]) # Add small amount of noise to break ties iris_data <- iris_data + matrix(rnorm(nrow(iris_data) * ncol(iris_data), mean = 0, sd = 0.0001), nrow = nrow(iris_data)) # Run t-SNE set.seed(42) # For reproducibility tsne_result <- Rtsne(iris_data, dims = 2, perplexity = 30, verbose = FALSE, max_iter = 500) # Create data frame for plotting tsne_df <- data.frame( tSNE1 = tsne_result$Y[, 1], tSNE2 = tsne_result$Y[, 2], Species = iris$Species ) ``` --- ## Example 1: Visualization .pull-left[ ``` r # Plot t-SNE result ggplot(tsne_df, aes(x = tSNE1, y = tSNE2, color = Species)) + geom_point(size = 3, alpha = 0.7) + labs(title = "t-SNE of Iris Dataset", x = "t-SNE Dimension 1", y = "t-SNE Dimension 2") + theme_minimal() + theme(legend.position = "right") ``` <img src="02_tSNE_xaringan_files/figure-html/unnamed-chunk-2-1.png" alt="" style="display: block; margin: auto auto auto 0;" /> ] .pull-right[**Observation:** - Clear separation of species - Setosa well separated from others - Some overlap between Versicolor and Virginica (biological reality!)] --- ## Example 2: Effect of Perplexity Comparing different perplexity values ``` r set.seed(42) # Try different perplexity values perplexities <- c(5, 15, 20, 40) tsne_list <- lapply(perplexities, function(perp) { result <- Rtsne(iris_data, dims = 2, perplexity = perp, verbose = FALSE, max_iter = 500) data.frame( tSNE1 = result$Y[, 1], tSNE2 = result$Y[, 2], Species = iris$Species, Perplexity = paste("Perplexity =", perp) ) }) # Combine all results tsne_combined <- do.call(rbind, tsne_list) ``` --- ## Example 2: Perplexity Comparison .pull-left[ ``` r # Plot all perplexity values ggplot(tsne_combined, aes(x = tSNE1, y = tSNE2, color = Species)) + geom_point(size = 2, alpha = 0.7) + facet_wrap(~ Perplexity, scales = "free", ncol = 2) + labs(title = "Effect of Perplexity on t-SNE Results") + theme_minimal() ``` <img src="02_tSNE_xaringan_files/figure-html/unnamed-chunk-4-1.png" alt="" style="display: block; margin: auto;" /> ] .pull-right[ **Key Observations:** - Low perplexity: Focuses on very local structure, may fragment clusters - High perplexity: Broader view, smoother boundaries - Optimal perplexity depends on data structure and goals ] --- ## Example 3: Simulated Gene Expression More complex example with simulated gene expression data ``` r # Simulate gene expression data with 4 cell types set.seed(123) n_cells <- 200 n_genes <- 500 # Create 4 distinct cell types expr_data <- matrix(rnorm(n_cells * n_genes), nrow = n_cells, ncol = n_genes) # Add cell type-specific expression patterns expr_data[1:50, 1:100] <- expr_data[1:50, 1:100] + 3 expr_data[51:100, 101:200] <- expr_data[51:100, 101:200] + 3 expr_data[101:150, 201:300] <- expr_data[101:150, 201:300] + 3 expr_data[151:200, 301:400] <- expr_data[151:200, 301:400] + 3 cell_types <- rep(c("Type1", "Type2", "Type3", "Type4"), each = 50) ``` --- ## Example 3: t-SNE on Gene Expression .pull-left[ ``` r # Run t-SNE set.seed(42) tsne_expr <- Rtsne(expr_data, dims = 2, perplexity = 30, verbose = FALSE, max_iter = 1000) # Create plot data tsne_expr_df <- data.frame( tSNE1 = tsne_expr$Y[, 1], tSNE2 = tsne_expr$Y[, 2], CellType = cell_types ) ``` **Result:** Clear separation of 4 cell type clusters ] .pull-right[ ``` r # Plot ggplot(tsne_expr_df, aes(x = tSNE1, y = tSNE2, color = CellType)) + geom_point(size = 2.5, alpha = 0.7) + labs(title = "t-SNE of Simulated Gene Expression Data", subtitle = "4 distinct cell types") + theme_minimal() ``` <img src="02_tSNE_xaringan_files/figure-html/unnamed-chunk-7-1.png" alt="" style="display: block; margin: auto;" /> ] --- ## scRNA-seq Pipeline: Where Does t-SNE Fit? ``` Raw Counts ↓ Quality Control (filter low-quality cells/genes) ↓ Normalization (e.g., log(CPM+1), scran, etc.) ↓ Feature Selection (highly variable genes) ↓ Scaling/Centering ↓ PCA (typically 20–50 PCs) ↓ *** t-SNE *** ↓ Visualization + Clustering ``` **Key Point**: t-SNE typically operates on **PCA-reduced data**, NOT raw gene expression! Running t-SNE on 20,000 genes directly is computationally expensive and dominated by noise. --- ## t-SNE Parameters in R **Key parameters in `Rtsne()`:** ``` r Rtsne(X, dims = 2, # Number of output dimensions perplexity = 30, # Neighborhood size (5-50) theta = 0.5, # Speed/accuracy tradeoff (0-1) max_iter = 1000, # Number of iterations pca = TRUE, # Whether an initial PCA step should be performed pca_center = TRUE, # Center data before PCA pca_scale = FALSE, # Scale data before PCA normalize = TRUE, # Normalize input Y_init = NULL, # Matrix, initial locations of the objects verbose = TRUE) # Print progress ``` **theta parameter:** - 0: Exact t-SNE (slow, accurate) - 0.5: Barnes-Hut approximation (default, fast) - Higher values: Faster but less accurate --- ## Hyperparameters: Number of Iterations & Early Exaggeration **Typical**: 1000 iterations; check convergence by monitoring KL divergence **Early exaggeration phase** (first ~250 iterations): - `\(p_{ij}\)` values are multiplied by a factor (default ×12) - Helps form tight clusters and prevents poor local minima - Exaggeration is then removed for fine-tuning ``` r # Monitor convergence tsne_result <- Rtsne(pca_data, perplexity = 30, max_iter = 1000, verbose = TRUE) # Prints KL divergence ``` **Warning**: Different runs give different results (non-convex optimization!) — always set a seed. --- ## Hyperparameters: Initialization **Options:** - **Random** (default): Points start at random positions in low-dimensional space - **PCA**: Initialize with first 2–3 PCs - Often faster convergence, more stable results - Can bias toward linear (PCA) structure **Recommendation:** - Use PCA initialization for large datasets (>10,000 cells) - Use random for smaller datasets - Compare both if concerned about bias --- ## Hyperparameters: Learning Rate & Momentum **Learning rate** (`\(\eta\)`): - Default: `\(\eta = 200\)` (or `\(n/12\)` for large datasets) - Too low → slow convergence; too high → unstable results **Momentum:** - Standard optimization technique; helps escape poor local minima - Default settings usually work well **Practical advice**: Usually don't need to adjust these unless working with very large datasets (>100,000 cells) or using specialized implementations (FIt-SNE, openTSNE). --- ## Advanced: PCA + t-SNE Pipeline For very high-dimensional data (e.g., 20,000 genes): ``` r # Step 1: PCA for initial dimensionality reduction pca_result <- prcomp(expr_data, center = TRUE, scale. = FALSE) # Use first 50 PCs (capturing most variance) pca_data <- pca_result$x[, 1:50] # Step 2: t-SNE on PCA-reduced data set.seed(42) tsne_result <- Rtsne(pca_data, dims = 2, perplexity = 30, pca = FALSE, # Already did PCA max_iter = 1000) # This is much faster and often gives better results! ``` **Benefits:** - Removes noise from minor components - Speeds up computation dramatically (`\(O(n^2 d)\)` where `\(d\)` is now 50, not 2000 — a 40× speedup) - Often improves cluster separation --- ## Critical Limitations: Distances Not Meaningful **IMPORTANT**: t-SNE preserves local neighborhoods, NOT global distances! ❌ **Don't conclude**: "Cluster A is closer to B than C, indicating more similarity" ✅ **Do conclude**: "Red and blue are distinct clusters" **Why this happens**: The heavy-tailed t-distribution allows moderate distances in high-D to expand greatly in low-D. Two clusters separated by similar distances in high-D may end up at very different distances in t-SNE space. **Practical implication**: Don't use t-SNE to infer relationships *between* clusters. For that, use original high-dimensional distances, PCA, or other methods that preserve global structure. --- ## Critical Limitations: Cluster Sizes & Non-Determinism **Cluster sizes are not meaningful!** - Dense clusters may get "expanded" to preserve internal structure - Sparse clusters may get "compacted" - Cannot use t-SNE coordinates to compare cluster sizes or proportions - Always count cells in original data **Non-deterministic results:** - Different seeds → different embeddings - What *should* be consistent: number of distinct clusters, which cells cluster together, overall topology - What varies: exact positions, orientation, spacing between clusters - **Never do statistical tests on t-SNE coordinates** — always go back to the original space <!--- ## Critical Limitations: The Crowding Problem **Cannot perfectly represent high-D geometry in 2D/3D** - In n-dimensional space, you can have n+1 equidistant points - In 2D, you can only have 3 equidistant points (equilateral triangle) - If a cell has 50 equally similar neighbors in 20,000-D space, they simply cannot all be equidistant in 2D **t-SNE's choice**: Prioritize local structure, allow global distortion - Within-cluster structure is preserved well - Between-cluster relationships are not reliable - Use PCA or UMAP if global distances matter ## Common Pitfalls and Solutions **Problem 1: "Crowded" or unclear clusters** - **Solution:** Decrease perplexity or increase iterations **Problem 2: Overly fragmented clusters** - **Solution:** Increase perplexity **Problem 3: Different results each run** - **Solution:** Set random seed, or run multiple times and check consistency ## Common Pitfalls and Solutions **Problem 4: Very slow for large datasets** - **Solution:** Use Barnes-Hut approximation (theta = 0.5), reduce dimensions with PCA first, or use FIt-SNE for >50,000 cells **Problem 5: Interpreting cluster distances** - **Solution:** Don't! Focus on local structure and cluster membership only **Problem 6: Small dataset (< 100 points)** - **Solution:** Consider using PCA or MDS instead --> --- ## Acceleration Methods **Problem**: Standard t-SNE is `\(O(n^2)\)` — very slow for large `\(n\)` **Solutions:** 1. **Barnes-Hut t-SNE** (default in Rtsne): - Approximates far-field interactions using quadtree/octree - Complexity: `\(O(n \log n)\)` - Handles 10,000–50,000 cells comfortably 2. **FIt-SNE** (Fast Interpolation-based t-SNE): - Interpolates gradients on a grid - Complexity: `\(O(n)\)` for fixed perplexity - Can handle 1M+ cells --- ## Acceleration Methods **Problem**: Standard t-SNE is `\(O(n^2)\)` — very slow for large `\(n\)` **Solutions:** 3. **Subsampling**: Run t-SNE on a subset, project remaining cells **Practical recommendation**: <10k cells → standard; 10k–50k → Barnes-Hut; 50k–500k → FIt-SNE or UMAP; >500k → UMAP. --- ## t-SNE for Single-Cell Analysis **ViSNE Application (2013):** - Enabled visualization of high-dimensional single-cell data - Revealed phenotypic heterogeneity in leukemia <!-- - Revolutionized flow cytometry and mass cytometry analysis --> <!-- **Modern Single-Cell RNA-seq:** --> <!-- - t-SNE became standard for visualizing cell types --> <!-- - Helps identify novel cell populations --> <!-- - Reveals developmental trajectories --> <!-- - Now often complemented by UMAP --> **Key Use Cases:** - Cell type identification - Quality control visualization - Batch effect assessment - Trajectory inference (with caution) .small[Amir, Ea., Davis, K., Tadmor, M. et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat Biotechnol 31, 545–552 (2013). https://doi.org/10.1038/nbt.2594] --- ## t-SNE vs PCA vs MDS **Comparison:** | Method | Type | Preserves | Best For | |:-------|:-----|:----------|:---------| | **PCA** | Linear | Global variance | Overall data structure, feature importance | | **MDS** | Can be linear/nonlinear | Pairwise distances | Distance relationships, custom metrics | | **t-SNE** | Nonlinear | Local neighborhoods | Clusters, complex structures, visualization | <!--- ## t-SNE vs PCA vs MDS **When to use t-SNE:** - Discovering clusters in high-dimensional data - Visualizing complex, nonlinear structures - Single-cell genomics data - When local structure is more important than global distances **When NOT to use t-SNE:** - Need to interpret axes - Need precise distance measurements - Small datasets (< 100 points) - Need deterministic results ## t-SNE vs UMAP **UMAP (Uniform Manifold Approximation and Projection):** - Newer method, increasingly popular in single-cell analysis - Often faster than t-SNE - Better preserves global structure - More consistent across runs .small[ | Feature | t-SNE | UMAP | |:--------|:------|:-----| | Speed | Slower (`\(O(n \log n)\)`) | Faster | | Global structure | Poor | Better | | Local structure | Excellent | Excellent | | Determinism | Low (stochastic) | Higher | | Parameters | Perplexity | n_neighbors, min_dist | | Scalability | <50k cells | 100k+ cells | ] **Current trend:** Many researchers use both and compare results --> --- ## Practical Tips for t-SNE 1. **Always set a seed** for reproducibility (`set.seed()`) 2. **Try multiple perplexity values** (e.g., 5, 30, 50) 3. **Run sufficient iterations** (at least 1000 for complex data; check KL convergence) 4. **Don't over-interpret:** - Cluster distances are not meaningful - Cluster sizes may be distorted - Axes have no inherent meaning --- ## Practical Tips for t-SNE 5) **Preprocessing matters:** - Scale/normalize data appropriately - Consider feature selection for very high-dimensional data - Correct batch effects before running t-SNE 6) **Combine with other methods:** - Use PCA for initial dimensionality reduction (e.g., to 50 dims) - Validate clusters with other methods (graph-based clustering in PC space) - Use t-SNE for visualization; do statistical analysis in original space 7) **Sanity checks:** Compare with PCA; overlay known biology (e.g., marker genes); check if batches mix; confirm clustering done in PC space matches visual clusters <!--- ## Best Practices: Reporting **Essential information for reproducibility:** 1. **Preprocessing**: Normalization method, number of HVGs selected, number of PCs used, batch correction method (if any) 2. **t-SNE parameters**: Perplexity, number of iterations, random seed, software/package version 3. **Interpretation caveats**: Note that distances are not meaningful; report clustering done in PC space; describe marker gene validation **Example good methods statement:** > "We visualized cells using t-SNE (perplexity=30, seed=42, Rtsne v0.16) on the first 30 principal components. Clusters were identified using Louvain clustering (resolution=0.8) on PC-space distances. Cell types were annotated based on marker gene expression." --> --- ## Summary **1. t-SNE is nonlinear** dimensionality reduction for visualization **2. Heavy-tailed t-distribution** in low-D space prevents crowding **3. Perplexity** is the key parameter (try multiple values!) **4. Local structure** is preserved, but global distances are not meaningful **5. Always set random seed** for reproducibility --- ## Summary **6. Excellent for discovering clusters** in high-dimensional data **7. Standard tool** for single-cell genomics **8. Combine with other methods** (PCA, clustering, UMAP) for robust analysis **9. Don't over-interpret** distances between clusters or cluster sizes **10. Computationally intensive** - consider PCA pre-processing for large datasets, and FIt-SNE or UMAP for very large datasets --- ## Resources **R Packages:** - `Rtsne`: Main R implementation - `tsne`: Alternative implementation - `umap`: For UMAP as comparison **Online Resources:** - https://distill.pub/2016/misread-tsne/: "How to Use t-SNE Effectively" (interactive!) - Seurat tutorials: https://satijalab.org/seurat/ - OSCA: http://bioconductor.org/books/OSCA/ --- ## References .small[ van der Maaten, L., et al. (2008). Visualizing data using t-SNE. *JMLR*, *9*, 2579–2605. https://www.jmlr.org/papers/v9/vandermaaten08a.html Visualizing Data Using t-SNE - GoogleTechTalk by Laurens van der Maaten presenting t-SNE https://www.youtube.com/watch?v=RJVL80Gg3lA Amir, E. D., et al. (2013). viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. *Nat. Biotechnol.*, *31*(6), 545–552. https://doi.org/10.1038/nbt.2594 Kobak, D., et al. (2019). The art of using t-SNE for single-cell transcriptomics. *Nat. Commun.*, *10*, 5416. https://doi.org/10.1038/s41467-019-13056-x`` Wattenberg, M., et al. (2016). How to use t-SNE effectively. *Distill*, *1*(10). https://doi.org/10.23915/distill.00002 Luecken, M. D., et al. (2019). Current best practices in single-cell RNA-seq analysis: A tutorial. *Mol. Syst. Biol.*, *15*(6), e8746. https://doi.org/10.15252/msb.20188746 Linderman, G. C., et al. (2019). Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. *Nat. Methods*, *16*(3), 243–245. https://doi.org/10.1038/s41592-018-0308-4 <!-- Chari, T., et al. (2023). The specious art of single-cell genomics. *PLOS Comput. Biol.*, *19*(8), e1011288. --> <!-- https://doi.org/10.1371/journal.pcbi.1011288 --> ]