---
title: "Bioconductor Annotation Resources"
author: "Mikhail Dozmorov"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_float: true
    theme: flatly
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
# Global chunk options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

## Introduction

In this demonstration, we will explore several powerful Bioconductor packages used for genomic annotation:
* **OrgDb**: Gene-centric annotations for a specific organism.
* **TxDb**: Transcript-centric annotations (genomic coordinates).
* **biomaRt**: R interface to query Ensembl and other BioMart databases.
* **AnnotationHub & ExperimentHub**: Cloud-based repositories for standard annotations and curated experimental datasets.

## Setup and Libraries

First, we load the necessary packages. If you don't have these installed, you will need to use `BiocManager::install()`.

```{r load-libraries}
library(org.Hs.eg.db)                         # Human gene annotations
library(TxDb.Hsapiens.UCSC.hg38.knownGene)    # Human transcript models (hg38)
library(biomaRt)                              # Web-based database queries
# library(KEGGREST)                             # API for KEGG pathways
library(AnnotationHub)                        # Hub for annotation objects
library(GenomicRanges)                        # Core structure for genomic intervals
library(ExperimentHub)                        # Hub for experimental datasets
library(SummarizedExperiment)                 # Core container for assay data
library(dplyr)                                # Data manipulation
```

## OrgDb: Organism-Level Annotations

`OrgDb` packages provide gene-centric annotations (like mapping Ensembl IDs to Gene Symbols) for specific organisms. The `org.Hs.eg.db` package is specifically for *Homo sapiens*.

```{r orgdb-demo}
# View package documentation
# ?org.Hs.eg.db

# See what types of data (columns) we can extract from this database
columns(org.Hs.eg.db)

# Map between different identifier types
# Here, we map HGNC Gene Symbols (BRCA1, BRCA2) to their corresponding Ensembl IDs.
mapIds(
  x = org.Hs.eg.db, 
  keys = c("BRCA1", "BRCA2"), # The IDs we are querying
  column = "ENSEMBL",         # The type of ID we want back
  keytype = "SYMBOL",         # The type of ID we are passing in
  multiVals = "first"         # What to do if there are multiple matches
)
```

## TxDb: Transcript-Level Annotations

`TxDb` packages store genomic coordinates for transcripts, exons, and CDS (coding sequences). Here we use the UCSC hg38 genome build.

```{r txdb-demo}
# Extract all exons from the database as a GRanges object
all_exons <- exons(TxDb.Hsapiens.UCSC.hg38.knownGene)
head(all_exons, 3)

# Group exons by their corresponding gene
# This returns a GRangesList where each list element is a gene, 
# and the GRanges inside are the exons for that gene.
exons_by_gene <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, by = "gene")
head(exons_by_gene, 2)
```

## biomaRt: Querying Web Databases

`biomaRt` allows us to query major databases (like Ensembl) directly from R. It works using three main concepts:
1. **Mart**: The database you want to connect to.
2. **Filters**: The input criteria (e.g., "Find genes on chromosome 21").
3. **Attributes**: The information you want retrieved (e.g., "Give me their gene symbols").

```{r biomart-discovery, eval=FALSE}
# 1. Discover available marts and datasets
head(listMarts(), 7)                               # List available databases
head(listDatasets(useMart("ensembl")), 7)          # List datasets within Ensembl

# 2. Connect to the fully specified mart (Human Ensembl)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
```

Now let's perform some queries using our `ensembl` connection.

```{r biomart-queries, eval=FALSE}
# Query 1: Find Gene symbols associated with a specific GO term (MAP kinase activity)
# Note: GO:0004707 is MAP kinase activity; GO:0004704 is a related parent term.
go_query <- getBM(
  attributes = c('entrezgene_id', 'hgnc_symbol'), # Note: 'entrezgene' is deprecated in newer versions
  filters = 'go', 
  values = 'GO:0004707',
  mart = ensembl
)
head(go_query, 5)

# ---------------------------------------------------------
# Exploring available filters and attributes
# ---------------------------------------------------------

# Look at available filters and search for ones related to "entrez"
head(listFilters(ensembl), 7)
all_filters <- listFilters(ensembl)
all_filters[grep("entrez", all_filters[, "description"], ignore.case = TRUE), ]

# Look at the specific options for a particular filter (e.g., chromosome_name)
myFilter <- "chromosome_name"
substr(filterOptions(myFilter, ensembl), 1, 50) # Truncated for easy viewing

# Look at available attributes (what we can ask the database to return)
head(listAttributes(ensembl), 3)

# ---------------------------------------------------------
# Assemble and execute a custom query
# ---------------------------------------------------------
myAttributes <- c("ensembl_gene_id", "chromosome_name")
myValues <- c("21", "22") # Look on chromosomes 21 and 22

res <- getBM(
  attributes = myAttributes, 
  filters = myFilter,
  values = myValues, 
  mart = ensembl
)

# View a summary of the results
table(res$chromosome_name)
```

## AnnotationHub: Cloud-Based Annotations

`AnnotationHub` is a massive cloud repository of standard annotation files (BED, GTF, GRanges, etc.). Instead of downloading and parsing flat files, you retrieve them as ready-to-use R objects.

```{r annotationhub-init}
# Initialize the hub
ah <- AnnotationHub()

# See who provides data to the hub
head(unique(ah$dataprovider))

# Subset the hub to only include Human data to make it manageable
ah <- subset(ah, species == "Homo sapiens")
ah
```

Let's look for specific data, like UCSC tracks for hg19.

```{r annotationhub-query}
# Query the hub for specific terms
ah_query <- query(ah, c("ucsc", "hg19"))

# Quick look at the metadata fields available
head(ah_query$title, 3)
head(ah_query$tags, 2)

# See the most common file types available in our subset
ah_query$description %>% table() %>% sort(decreasing = TRUE) %>% head()
```

Let's retrieve an actual dataset from the hub. *Note: In practice, it's safer to extract objects using their AH ID (e.g., "AH5012") rather than their index number, as hub contents can update.*

```{r annotationhub-extract}
# Extracting a specific item (CpG Islands) using its index from our query
ah_query$title[75]
cpg_granges <- ah_query[[75]] 
summary(width(cpg_granges))

# reduce() merges overlapping genomic ranges
summary(width(reduce(cpg_granges)))

# Extracting an item using its exact AH ID (Chromosome Bands)
ah_query$title[ah_query$ah_id == "AH5012"]
chromosome_bands <- ah_query[["AH5012"]]
head(chromosome_bands, 3)

# View the metadata attached to this object
metadata(chromosome_bands)
```

## ExperimentHub: Curated Datasets

While `AnnotationHub` provides reference annotations, `ExperimentHub` provides curated, analysis-ready datasets from publications and large consortia (like TCGA).

```{r experimenthub}
# Initialize the hub
eh <- ExperimentHub()

# Query for datasets related to The Cancer Genome Atlas (TCGA)
tcga_data <- query(eh, "TCGA")

# View the first few matching datasets
head(tcga_data, 5)
```
