Genomic Resources

class: center, middle, inverse, title-slide

.title[
# Genomic Resources
]
.author[
### Mikhail Dozmorov
]
.institute[
### Virginia Commonwealth University
]
.date[
### 2026-03-04
]

---

<style>
.large { font-size: 130%; }
.small { font-size: 70%; }
.tiny { font-size: 40%; }
</style>

<!--## Key Resource Categories

1. **Cancer Genomics** - TCGA, cBioPortal, COSMIC, ICGC/PCAWG

2. **Variant Interpretation** - ClinVar, COSMIC

3. **Gene & Genome Browsers** - UCSC, Ensembl, NCBI Gene

4. **Bulk Expression Data** - GTEx, GEO, SRA, ArrayExpress

5. **Single-Cell Expression** - CELLxGENE, Human Cell Atlas, Tumor Atlas

6. **Functional Genomics** - DepMap, ENCODE, Roadmap

7. **Protein & Pathways** - Human Protein Atlas, STRING, Reactome-->

## TCGA: The Cancer Genome Atlas

- **33 cancer types**, >11,000 patients

- Multi-omic data: genomics, transcriptomics, epigenomics, proteomics

- Gold standard for cancer genomics research

- **GDC Data Portal**: https://portal.gdc.cancer.gov

- **Broad Firehose**: https://gdac.broadinstitute.org

- R/Bioconductor packages (TCGAbiolinks)

---
## cBioPortal

* Rich set of tools for visualization, analysis and download of large-scale cancer genomics data sets.
  * Mutations (OncoPrint display)
  * Mutual exclusivity of genetic events (log-odds ratio)
  * Correlations among genetic events (boxplots)
  * Survival (Kaplan-Meier plots)

* The Onco Query Language (OQL) to fine-tune queries

.small[
http://www.cbioportal.org/index.do

Gao, Jianjiong, Bülent Arman Aksoy, Ugur Dogrusoz, Gideon Dresdner, Benjamin Gross, S. Onur Sumer, Yichao Sun, et al. “Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the CBioPortal.” Science Signaling 6, no. 269 (April 2, 2013): pl1. https://doi.org/10.1126/scisignal.2004088.
]

---
## cBioPortal: Programmatic Access

- **Web REST API:** Standardized Swagger/OpenAPI interface allowing access to all studies, clinical data, and molecular profiles.

- **R Clients:**
    - `cBioPortalData`: Modern Bioconductor package that returns data as **MultiAssayExperiment** objects for high-level integration.
    - `cbioportalR`: A tidyverse-compatible wrapper designed for easy data retrieval and clinical research workflows.

- **Python Clients:** - `pyBioPortal`: Streamlines data acquisition directly into **Pandas DataFrames** for rapid analysis.
    - `cbio_py`: A simple wrapper for basic API interactions and dictionary-based data retrieval.

.small[ https://docs.cbioportal.org/web-api-and-clients/ ]

---
## Therapeutically Applicable Research To Generate Effective Treatments (TARGET)

- A comprehensive genomic approach to determine molecular changes that drive childhood cancers.

- Includes extensive data on pediatric cancers such as Acute Myeloid Leukemia (AML) and Neuroblastoma.

.small[ https://ocg.cancer.gov/programs/target ]

---
## Cancer Cell Line Encyclopedia (CCLE)

- Provides genome-wide information of ~1000 cell lines under baseline conditions.

- Includes deep pharmacologic response profiles (IC50 - Half Maximal Inhibitory Concentration).

- Features comprehensive mutation status analysis for identifying drug targets.

.small[ https://portals.broadinstitute.org/ccle ]

---
## Stand Up To Cancer (SU2C)

- Detailed molecular profiling of 50 Breast cancer cell lines.

- Incorporates drug sensitivity metrics (GI50) to 77 diverse therapeutic compounds.

.small[ http://www.standuptocancer.org/ ]

---
## NCI's Genomic Data Commons (GDC)

* Launched on June 6, 2016

* Provides standardized genomic and clinical data from
  * **The Cancer Genome Atlas (TCGA)**
  * **Therapeutically Applicable Research To Generate Effective Treatments (TARGET)**
  * **Cancer Cell Line Encyclopedia (CCLE)**
  * **Stand Up To Cancer (SU2C)** 
  * Many more

.small[https://gdc.cancer.gov/]

---
## Accessing GDC

* The GDC Application Programming Interface (API)

* `GenomicDataCommons` - GDC access in R

.small[ https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/#api-endpoints ]

.small[ https://bioconductor.org/packages/GenomicDataCommons/ ]

- `GDCRNATools`- Downloading, organizing, and integrative analyzing RNA data in the GDC

.small[ https://github.com/rli012/GDCRNATools ]

.small[ Li, Ruidong, Han Qu, Shibo Wang, Julong Wei, Le Zhang, Renyuan Ma, Jianming Lu, Jianguo Zhu, Wei-De Zhong, and Zhenyu Jia. “GDCRNATools: An R/Bioconductor Package for Integrative Analysis of LncRNA, MiRNA, and MRNA Data in GDC,” December 11, 2017. https://doi.org/10.1101/229799. ]

---
## LINCS: Library of Integrated Network-based Cellular Signatures

- A NIH Common Fund program that generates a massive library of molecular "signatures" (patterns).

- Measures cellular responses to chemical, genetic, and environmental stressors across multiple cell types.

- Integrates diverse data types including transcriptomics, proteomics, and high-content imaging.

.small[ https://lincsproject.org/ ]

---
## Connectivity Map (CMap)

- A biological resource for finding connections between genes, drugs, and diseases via gene expression.

- Utilizes the **L1000 assay**, a high-throughput technology measuring ~1,000 landmark genes to infer the full transcriptome.

- Contains over **1.3 million perturbational profiles**, significantly expanded from its original 2006 pilot.

- Enables researchers to identify drugs that can mimic or reverse a specific disease state.

.small[ https://www.broadinstitute.org/connectivity-map-cmap ]

---
## CLUE: CMap & LINCS Unified Environment

- The modern, cloud-based central hub to analyze CMap and LINCS data.

- Provides user-friendly applications like **Query** for pattern matching and **Morpheus** for heatmap visualization.

.small[
.pull-left[
**Core Analysis Tools**

* **Touchstone:** Benchmark dataset of well-studied perturbagens used to assess and explore signature connectivity.

* **Query:** Matches your gene expression signature against the entire CMap library to find positive or negative connections.

* **Proteomics Query:** Connects protein sets of interest to the **Touchstone-P** reference (P100 and GCP proteomics data).
]
.pull-right[
**Visualization & Discovery**

* **Morpheus:** Interactive matrix visualization tool for manipulating, annotating, and exploring heatmaps of genomic datasets.

* **Repurposing:** A dedicated portal to explore ~5,000 clinical drugs and tool compounds for new therapeutic opportunities.

https://clue.io 
]
]

---
## DepMap: Cancer Dependency Map

Mapping Genetic Vulnerabilities in Cancer

* **Gene Essentiality:** Identifies critical survival genes via **CRISPR screens** in >1,000 cell lines.

* High-throughput drug response data (PRISM, CTD²).

* Correlates genetic alterations directly with drug sensitivity.

.small[ https://depmap.org/portal ]

---
## COSMIC: Catalogue of Somatic Mutations

- World's largest database of **somatic mutations**

- Curated from literature + systematic screens

- \>1,000 cancer genomes

- **Cancer Gene Census** - curated cancer drivers
- **Mutation signatures** - mutational processes
- **Resistance mutations** - therapy resistance
- **Structural variants**
- **Fusion genes** - gene fusions in cancer

.small[ https://cancer.sanger.ac.uk/cosmic ]

---
## ClinVar: Clinical Variant Interpretation

- **Germline variant** clinical significance database

- Aggregates interpretations from clinical labs

- Links variants to diseases & phenotypes

- **Classification Levels**
  - Pathogenic / Likely pathogenic
  - Uncertain significance (VUS)
  - Likely benign / Benign
  - Conflicting interpretations flagged

.small[ https://www.ncbi.nlm.nih.gov/clinvar ]

---
## CIViC: Clinical Interpretation of Variants in Cancer

An open-access, collaborative platform for the clinical interpretation of cancer variants.

- Links specific genomic alterations to therapeutic, prognostic, diagnostic, and predisposing clinical actions.

- Peer-reviewed summaries of the clinical significance of mutations, curated from medical literature.

- Designed to bridge the gap between raw genomic data and actionable oncology bedside decisions.

.small[ https://civicdb.org ]

---
## IntOGen: Catalog of Cancer Driver Mutations

A comprehensive resource for identifying **cancer driver genes** across 28,000+ tumors and 66 cancer types.

- Combines multiple computational methods to detect signals of **positive selection** in cancer genomes.

- Provides biological and clinical context for identified drivers to support drug discovery and research.

.small[ https://www.intogen.org

Gonzalez-Perez, Abel, Christian Perez-Llamas, Jordi Deu-Pons, David Tamborero, Michael P Schroeder, Alba Jene-Sanz, Alberto Santos, and Nuria Lopez-Bigas. “IntOGen-Mutations Identifies Cancer Drivers across Tumor Types.” Nature Methods 10, no. 11 (September 15, 2013): 1081–82. https://doi.org/10.1038/nmeth.2642. ]

---
## AACR Project GENIE

An international pan-cancer registry of real-world clinical-grade genomic data.

- Aggregates data from over **150,000 sequenced tumors** across 19 world-leading cancer centers.

- Links CLIA-certified genomic sequencing with clinical outcomes, treatment history, and patient demographics.

- Primary public access is provided through **cBioPortal** for rapid visualization and query.

.small[ https://www.aacr.org/professionals/research/aacr-project-genie ]

---
## OncoKB: Precision Oncology Knowledge Base

High-quality, expert-reviewed database detailing the biological and clinical effects of somatic mutations.

- Categorizes variants by FDA-approved labels and professional guidelines for targeted therapies.

- Maps genomic alterations to specific drugs, providing therapeutic recommendations for clinical decision support.

- Frequently paired with cBioPortal for real-time clinical annotation of cancer genomic profiles.

.small[ https://www.oncokb.org ]

---
## PharmGKB: Pharmacogenomics Knowledgebase

A comprehensive resource for how genetic variation impacts drug efficacy, dosage, and toxicity.

- Provides annotated **CPIC** (Clinical Pharmacogenetics Implementation Consortium) guidelines and regulatory drug labels (FDA, EMA).

- Variants are assigned a **Level of Evidence (1-4)** based on the strength of the association in the literature.

- Includes curated diagrams of pharmacokinetics (PK) and pharmacodynamics (PD) for specific drugs.

.small[ https://www.pharmgkb.org ]

---
class: middle,center

# General genomic resources

---
## GTEx: Genotype-Tissue Expression

The Gold Standard for Baseline Expression

* Bulk RNA-seq of **54 non-diseased tissues**.

* Data from **~1,000 donors** and **>17,000 samples**.

* Leading resource for **eQTL** (expression Quantitative Trait Loci) mapping.

.small[ https://gtexportal.org ]

---
## GEO: Gene Expression Omnibus

The Global Repository for Functional Genomics

* Access to **3+ million samples** from thousands of independent studies.

* Archives both legacy **Microarray** and modern **Next-Gen Sequencing** (RNA-seq).

* Full experimental metadata alongside raw and processed data files.

.small[ https://www.ncbi.nlm.nih.gov/geo ]

---
## dbGaP: Database of Genotypes and Phenotypes

- NIH repository for **individual-level raw data**

- Genotype-phenotype associations

- Controlled-access datasets

- Links genomic data to clinical phenotypes

.small[ https://www.ncbi.nlm.nih.gov/gap ]

---
## dbGaP: Database of Genotypes and Phenotypes

**Data Types**

- Genome-wide association studies (GWAS)

- Sequencing data (whole genome, exome)

- Clinical & epidemiological data

- Includes TCGA, TOPMed, and many other studies

**Important:** Requires **data access approval**

---
## Sequence Read Archive (SRA)

- The NCBI database which stores sequence data obtained from next generation sequence (NGS) technology
    - Archives raw NGS data for various organisms from RNA-seq, WGS, ChIP-seq, etc. (FASTQ files)
    - Serves as a starting point for “secondary analyses”
    - Provides access to data from human clinical samples to authorized users who agree to the datasets’ privacy and usage mandates

- Search metadata to locate the sequence reads for download and further downstream analyses

.small[ https://www.ncbi.nlm.nih.gov/sra ]

---
## sratoolkit - Getting data from SRA

- `fastq-dump`: Convert SRA data into fastq format

- `prefetch`: Allows command-line downloading of SRA, dbGaP, and ADSP data

- `sam-dump`: Convert SRA data to sam format

- `sra-pileup`: Generate pileup statistics on aligned SRA data

- `vdb-config`: Display and modify VDB configuration information

- `vdb-decrypt`: Decrypt non-SRA dbGaP data ("phenotype data")

.small[ https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit ]  
.small[ SRA Handbook https://www.ncbi.nlm.nih.gov/books/NBK47528/ ]

<!--
## Getting data from SRA

- `.sra` files are NOT FASTQ files - need to further convert them using `sratoolkit`

```
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP101/SRP101962/SRR5346141/SRR5346141.sra
## To split paired-end reads, use -I option
sratoolkit.2.8.1-win64/bin/fastq-dump -I --split-files SRR5346141
```
-->

---
## ARCHS4

- A web resource that makes the majority of previously published RNA-seq data from human and mouse freely available at the gene/transcript count level in HDF5 format

- All available FASTQ files from RNA-seq experiments were retrieved from GEO and SRA and aligned using a cloud-based infrastructure.

- 1,040,000 mouse and 922,000 human samples

- Gene-centric exploratory analysis of average expression across cell lines and tissues, top co-expressed genes, and predicted biological functions and protein-protein interactions for each gene based on prior knowledge combined with co-expression

.small[ Lachmann, A., Torre, D., Keenan, A.B. et al. Massive mining of publicly available RNA-seq data from human and mouse. Nat Commun 9, 1366 (2018). https://doi.org/10.1038/s41467-018-03751-6  
https://archs4.org/
]

---
## ENCODE Project

- Encyclopedia of **DNA elements**

- Regulatory regions, TFBSs, chromatin state

- \>10,000 experiments across cell types

- Includes cancer cell lines

- ChIP-seq (histone marks, TFs)
- DNase-seq / ATAC-seq
- RNA-seq (incl. long non-coding RNAs)
- 3D genome organization (Hi-C)

.small[ https://www.encodeproject.org ]

---
## Roadmap Epigenomics

Reference Maps of the Human Epigenome

* Features **127 consolidated epigenomes** from primary human tissues and stem cells.

* Defines core **chromatin states** (promoters, enhancers, quiescent) using ChromHMM.

* Unified mapping of **DNA methylation**, **histone marks**, and **chromatin accessibility**.

* Establishes the "normal" landscape to identify epigenetic disruptions in disease.

.small[ http://www.roadmapepigenomics.org ]

---
class: middle, center

# Genome browsers/Visualization

---
## UCSC Genome Browser

The Hub for Genomic Visualization

* Explore hundreds of annotation tracks.

* Seamlessly upload and view **Custom Tracks**.

* Conservation, regulation, and multi-species alignments.

* Real-time access to variation and clinical data.

.small[ https://genome.ucsc.edu ]

---
## UCSC Xena Functional Genomics Explorer

* Former UCSC Cancer Genomics Browser. Now UCSC Xena

* Includes TCGA, Cancer Cell Line Encyclopedia, the Stand Up To Cancer (SU2C) Breast Cancer data, custom datasets

* A tool to visually explore and analyze cancer genomics data and its associated clinical information.

* Gene- and genome-centric view

* Survival analysis on user-defined subgroups

.small[
https://xenabrowser.net/, https://xenabrowser.net/datapages/, http://xena.ucsc.edu/getting-started/

Cline, Melissa S., Brian Craft, Teresa Swatloski, Mary Goldman, Singer Ma, David Haussler, and Jingchun Zhu. “Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser.” Scientific Reports 3 (October 2, 2013): 2652. https://doi.org/10.1038/srep02652.
]

<!---
## Ensembl

**Comprehensive Genome Annotation**

* **Core Genes:** Gold-standard gene and transcript annotations.

* **Variant Analysis:** Industry-leading **Variant Effect Predictor (VEP)**.

* **Regulation:** Deep epigenomics and regulatory feature mapping.

* **Evolution:** Extensive comparative genomics across lineages.

.small[ https://www.ensembl.org ] -->

---
## WashU Epigenome Browser

Next-Gen Epigenomic Visualization

* Native access to **Roadmap Epigenomics** projects.

* Supports standard UCSC track types and features.

* Specialized for high-resolution large (epi)genomic landscapes.

.small[ https://epigenomegateway.wustl.edu/ ]

---
## Genomic Data Visualization
Integrative Genomics Viewer (IGV)
<img src="img/igv.png" alt="" width="800px" style="display: block; margin: auto;" />
.small[ https://igv.org/doc/desktop/ ]

---
## IGV Features

- Explore large genomic datasets with an intuitive, easy-to-use interface.

- Integrate multiple data types with clinical and other sample information.

- View data from multiple sources:
    - local, remote, and "cloud-based".
    - Intelligent remote file handling - no need to download the whole dataset

- Automation of specific tasks using command-line interface

.small[ https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial ]

---
## UCSC annotations in IGV

<!--
NCBI Gene Browser

**Integrated Clinical & Literature Hub**

* **Reference Standards:** Curated **RefSeq** gene summaries and annotations.

* **Unified Ecosystem:** Direct links to NCBI databases (GEO, dbGaP, Assembly).

* **Clinical Insights:** Real-time association with **ClinVar** and clinical variants.

* **Evidence-Based:** Integrated access to gene-specific **PubMed** literature.

.small[ https://www.ncbi.nlm.nih.gov/gene ]-->

<!--
Galaxy

- Web-based framework offering a user-friendly interface mapping to most popular bioinformatics tools
    - "Data intensive biology for everyone."

- Allows for reproducible results
    - Steps / parameters kept in history

- Ability to design custom pipelines and import others’
    - All through a user-friendly GUI

- Tailored for small/medium scale projects with not too many samples

.small[ https://usegalaxy.org/ ]
-->

---
class: middle,center

# Single-Cell RNA-seq Resources

<!-- **Limitations of Bulk RNA-seq**
- Averages signal across all cells
- Masks **cellular heterogeneity**
- Cannot identify rare cell populations
- Misses cell-type specific effects

**Single-Cell Advantages**
- Cell-type resolution
- Identify rare cell populations (e.g., cancer stem cells)
- Tumor microenvironment composition
- Cell state transitions
- Track clonal evolution
-->

---
## CELLxGENE Discover

- Largest single-cell data repository
- \>60 million cells from 1,000+ datasets
- Curated & standardized metadata
- Interactive browser (no code needed)
- Developed by Chan Zuckerberg Initiative

- Query gene expression in any cell type
- Compare across tissues/conditions
- Download processed data
- Cell type annotations
- Disease vs healthy comparisons

.small[ https://cellxgene.cziscience.com ]

---
## Human Cell Atlas

- International effort to map **all human cell types**
- Focus on healthy tissues as reference
- Multiple organs & developmental stages
- Standardized protocols & analysis

- Browse & download gene expression matrices
- Cell type annotations
- Spatial transcriptomics (some)

.small[ https://www.humancellatlas.org ]  
.small[ https://data.humancellatlas.org ]

---
## 10x Genomics Datasets

Collection of high-quality, open-access scRNA-seq datasets.

- Widely used by the community to test and validate new bioinformatic algorithms and pipelines.

- Includes data for single-cell ATAC-seq, spatial transcriptomics (Visium), and immune profiling.

- Spans diverse biological contexts, including peripheral blood (PBMC), tumor microenvironments, and embryonic development.

.small[ https://www.10xgenomics.com/datasets ]

---
## Cancer-Specific Single-Cell Atlases

**Single Cell Portal (Broad)**

- \>400 studies including many cancers
- Interactive visualization
- Custom analyses available

.small[https://singlecell.broadinstitute.org]

**Tumor Immune Single-cell Hub**

- Focus on **tumor microenvironment**
- \>2 million cells from cancer patients
- Cell type annotations for TME
- Gene expression in immune/stromal cells

.small[ http://tisch.comp-genomics.org ]

<!--
## CancerSEA
**http://biocc.hrbmu.edu.cn/CancerSEA**

- Cancer cell **functional states**
- 14 functional states (e.g., EMT, stemness) hypoxia
- Gene signatures for each state
- Compare across cancer types
-->

---
class: middle,center

# Protein/Pathway resources

---
## Human Protein Atlas

- **Protein expression** in normal & cancer tissues

- Immunohistochemistry images

- Single-cell RNA-seq data

- Pathology-based annotations

- **Tissue Atlas** - normal tissues
- **Pathology Atlas** - 17 major cancers
- **Cell Atlas** - single-cell expression
- **Blood Atlas** - protein in blood
- **Brain Atlas** - brain-specific

.small[ https://www.proteinatlas.org ]

---
## STRING: Protein-Protein Interactions

Mapping the Functional Interactome

* Visualizes complex **Protein-Protein Interaction (PPI)** frameworks.

* Combines confirmed experimental data with high-confidence computational predictions.

* Covers **>14,000 organisms** with a specialized focus on the human proteome.

.small[ https://string-db.org ]

---
## Reactome: Pathway Database

- Curated **pathway database**

- 2,600+ human pathways

- Peer-reviewed annotations

- Hierarchical pathway organization

- PathwayBrowser (visualization)
- Analysis tools (over-representation)
- Tissue-specific pathway activity

.small[ https://reactome.org ]

---
## CPTAC: Clinical Proteomic Tumor Analysis Consortium

A comprehensive effort to accelerate the understanding of the molecular basis of cancer through the integration of **proteomics and genomics**.

- Provides high-throughput, standardized quantitative protein expression data (mass spectrometry) alongside matched DNA and RNA sequencing.

- Goes beyond the genome to identify how mutations and copy number alterations manifest as changes in **protein levels and post-translational modifications** (e.g., phosphorylation).

- Links deep molecular phenotypes with detailed clinical and pathological data across a wide range of cancer types.

.small[ https://proteomics.cancer.gov/programs/cptac ]

---
class: middle, center

# References

---
## Large genomics projects and resources

.small[
| Name                                                    | Website                         | Description                                                                                                                                                                      |
|---------------------------------------------------------|---------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1000 Genomes Project (1KGP)                             | www.internationalgenome.org     | This project includes whole-genome and exome sequencing data from 2,504 individuals across 26 populations                                                                        |
| Cancer Cell Line Encyclopedia (CCLE)                    | portals.broadinstitute.org/ccle | This resource includes data spanning 1,457 cancer cell lines                                                                                                                     |
| Encyclopedia of DNA Elements (ENCODE)                   | www.encodeproject.org           | The goal of this project is to identify functional elements of the human genome using a gamut of sequencing assays across cell lines and tissues                                 |
| Genome Aggregation Database (gnomAD)                    | gnomad.broadinstitute.org       | This resource entails coverage and allele frequency information from over 120,000 exomes and 15,000 whole genomes                                                                |
| Genotype–Tissue Expression (GTEx) Portal                | gtexportal.org                  | This effort has to date performed RNA sequencing or genotyping of 714 individuals across 53 tissues                                                                              |
| Global Alliance for Genomics and Health (GA4GH)         | genomicsandhealth.org           | This consortium of over 400 institutions aims to standardize secure sharing of genomic and clinical data                                                                         |
]

---
## Large genomics projects and resources

.small[
| Name                                                    | Website                         | Description                                                                                                                                                                      |
|---------------------------------------------------------|---------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| International Cancer Genome Consortium (ICGC)           | icgc.org                        | This consortium spans 76 projects, including TCGA                                                                                                                                |
| Million Veterans Program (MVP)                          | www.research.va.gov/mvp         | This US programme aims to collect blood samples and health information from 1 million military veterans                                                                          |
| Model Organism Encyclopedia of DNA Elements (modENCODE) | www.modencode.org               | The goal of this effort is to identify functional elements of the Drosophila melanogaster and Caenorhabditis elegans genomes using a gamut of sequencing assays                  |
| Precision Medicine Initiative (PMI)                     | allofus.nih.gov                 | This US programme aims to collect genetic data from over 1 million individuals                                                                                                   |
| The Cancer Genome Atlas (TCGA)                          | cancergenome.nih.gov            | This resource includes data from 11,350 individuals spanning 33 cancer types                                                                                                     |
| Trans-Omics for Precision Medicine (TOPMed)             | topmed.nhlbi.nih.gov        | The goal of this programme is to build a commons with omics data and associated clinical outcomes data across populations for research on heart, lung, blood and sleep disorders |

Langmead, Ben, and Abhinav Nellore. “Cloud Computing for Genomic Data Analysis and Collaboration.” Nature Reviews Genetics, January 30, 2018. https://doi.org/10.1038/nrg.2017.113. ]

<!--
## A comparison of genomics data types

.small[
| NGS technology                       | Total bases | Compressed bytes | Equivalent size | Core hours to analyse 100 samples | Comments                                                              |
|--------------------------------------|-------------|------------------|-----------------|-----------------------------------|-----------------------------------------------------------------------|
| Single-cell RNA sequencing           | 725 million | 300 MB           | 50 MP3 songs    | 20                                | >100,000 such samples in SRA, >50,000 from humans                     |
| Bulk RNA sequencing                  | 4 billion   | 2 GB             | 2 CD-ROMs       | 100                               | >400,000 such samples in SRA, >100,000 from humans                    |
| Human reference genome (GRCh38)      | 3 billion   | 800 MB           | 1 CD-ROM        | NA                                |                                                                       |
| Whole-exome sequencing               | 9.5 billion | 4.5 GB           | 1 DVD movie     | 4,000                             | \~1,300 human samples from 1000 Genomes Project alone                  |
| Whole-genome sequencing of human DNA | 75 billion  | 25 GB            | 1 Blu-ray movie | 30,000                            | \~18,000 human samples with 30x coverage from the TOPMed project alone |
]
-->