Data representation in R / Bioconductor

class: center, middle, inverse, title-slide

.title[
# Data representation in R / Bioconductor
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2026-03-18
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

## SummarizedExperiment: The Core Data Container

The `SummarizedExperiment` class is the gold standard for coordinating high-throughput data with its associated metadata. It synchronizes three main components:
.pull-left[
* **Assays:** A list of matrices (e.g., `counts`, `logcounts`) where **rows are features** (genes, proteins) and **columns are samples**.
* **colData:** A dataframe containing sample-level metadata (e.g., treatment, age, clinical data).
* **rowData:** A dataframe containing feature-level metadata (e.g., gene symbols, GC content).
]
.pull-right[
<img src="img/SummarizedExperiment1.png" alt="" width="500px" style="display: block; margin: auto;" />
]

---
## SummarizedExperiment: The Core Data Container

The `SummarizedExperiment` class is the gold standard for coordinating high-throughput data with its associated metadata. It synchronizes three main components:
.pull-left[
**The "Locking" Rule:**

Subsetting a `SummarizedExperiment` by sample or feature automatically subsets all associated assays and metadata, preventing data misalignment.
]
.pull-right[
<img src="img/SummarizedExperiment1.png" alt="" width="500px" style="display: block; margin: auto;" />
]

---
## RangedSummarizedExperiment

A `RangedSummarizedExperiment` is a specialized version of the container where feature metadata is grounded in physical genomic locations.

* **`rowRanges()`:** Replaces or extends `rowData` with a `GRanges` or `GRangesList` object.

* **Spatial Context:** Each row is now linked to a specific chromosome, start/end position, and strand.

* **Integration:** Allows for immediate spatial queries, such as "Find all features overlapping a specific SNP" or "Extract promoter sequences for these rows."

.small[ https://bioconductor.org/packages/SummarizedExperiment/ ]

---
## SingleCellExperiment: Specialized for scRNA-seq

The `SingleCellExperiment` (SCE) class inherits from `SummarizedExperiment` but adds specialized "slots" to address the unique challenges of single-cell analysis, such as sparsity and high dimensionality.

* **Reduced Dimensionality (`reducedDims`):** A dedicated slot to store low-dimensional embeddings like **PCA**, **t-SNE**, and **UMAP**. This keeps coordinates synchronized with the main expression data.

* **Alternative Experiments (`altExps`):** Allows you to store data from different "modalities" (e.g., CITE-seq protein counts or CRISPR tags) for the exact same cells.

* **Size Factors:** Includes native support for storing scaling factors used to normalize for differences in sequencing depth between individual cells.

.small[ https://bioconductor.org/packages/SingleCellExperiment ]

---
## MultiAssayExperiment: Multi-Omics Integration

The `MultiAssayExperiment` (MAE) is designed to manage complex datasets where multiple types of biological assays (e.g., RNA-seq, Methylation, Proteomics) are performed on the same set of patients or biological samples.

* **Unified Interface:** Provides a single object to store diverse data types that may have different dimensions (e.g., 20,000 genes vs. 500,000 methylation sites).

* **Sample Map:** A robust internal "map" that tracks which assay belongs to which patient, even if some patients are missing data for certain experiments.

* **Coordinated Subsetting:** Much like `SummarizedExperiment`, you can subset an MAE by patient ID or clinical characteristic (e.g., "Stage IV Cancer"), and it will automatically update all underlying experiments.

---
## MultiAssayExperiment Details

<img src="img/multiassayexperiment1.png" alt="" width="900px" style="display: block; margin: auto;" />
.small[ https://bioconductor.org/packages/MultiAssayExperiment ]