Alignment
Genomic alignment
“Approximate matching” Lecture slides in PDF format
Code
- Naive exact matching, naive_exact.R
- Naive approximate matching, naive_Hamming.R
- Edit distance, recursive implementation, edDistRecursive.R
- Edit distance, dynamic programming implementation, edDistDynamic.R
- Global alignment, dynamic programming implementation, globalAlignment.R
References
Nagarajan, Niranjan, and Mihai Pop. “Sequence Assembly Demystified.” Nature Reviews. Genetics 14, no. 3 (March 2013): 157–67. doi:10.1038/nrg3367. - Gentle introduction into genome assembly. Technologies. Box2: Greedy, overlap-layout-consensus, De Bruijn. Problems
Pevzner, P. A., H. Tang, and M. S. Waterman. “An Eulerian Path Approach to DNA Fragment Assembly.” Proceedings of the National Academy of Sciences 98, no. 17 (August 14, 2001): 9748–53. https://doi.org/10.1073/pnas.171285098. - First de Bruijn graph for genome assembly paper. Idea of breaking reads into fragments. Typical approach reads are vertices connected by edges if they overlap. Hamiltonian path problem - visit each vertex exactly once, NP-complete. de Bruijn graph - overlapping fragments are edges, and the problem is Eulerian path - visit each edge once. Error-correction algorithm.
Primer: Compeau, Phillip E C, Pavel A Pevzner, and Glenn Tesler. “How to Apply de Bruijn Graphs to Genome Assembly.” Nature Biotechnology 29, no. 11 (November 8, 2011): 987–91. doi:10.1038/nbt.2023.
Chaisson, Mark J. P., Richard K. Wilson, and Evan E. Eichler. “Genetic Variation and the de Novo Assembly of Human Genomes.” Nature Reviews Genetics 16, no. 11 (October 7, 2015): 627–40. doi:10.1038/nrg3933. - Genome assembling strategies, problems. OLC, De Bruijn, string graphs. Types of gaps.
Miller, Jason R., Sergey Koren, and Granger Sutton. “Assembly Algorithms for Next-Generation Sequencing Data.” Genomics 95, no. 6 (June 2010): 315–27. doi:10.1016/j.ygeno.2010.03.001. - Assembly tools for overlap/layout/consensus and the de Bruijn graph approaches. de Bruin graph Issues with genome assembly, potential solutions.
String Graph Assembler. Simpson, J. T., and R. Durbin. “Efficient de Novo Assembly of Large Genomes Using Compressed Data Structures.” Genome Research 22, no. 3 (March 1, 2012): 549–56. doi:10.1101/gr.126953.111. - SGA - String Graph Assembler. From an FM-index. Velvet, ABySS, SOAPdenovo de Bruijn graph assemblers. BWA and FM explanation
Koren, Sergey, and Adam M. Phillippy. “One Chromosome, One Contig: Complete Microbial Genomes from Long-Read Sequencing and Assembly.” Current Opinion in Microbiology 23 (February 2015): 110–20. https://doi.org/10.1016/j.mib.2014.11.014. - Genome assembly overview focusing on long reads. Repeats (global and local) are problematic. Details on technologies: PacBio RS, Illumina’s Moleculo, ONT MinION. Assembling approaches: OLC, hierarchical hybrid (long reads correction using another technology) and non-hybrid (self long reads alignment-correction). Assembly augmentation: gap filling, scaffolding, read threading. Table 1 - long read assembly tools and descriptions.
Alignment algorithms and tools
Needleman, S. B., and C. D. Wunsch. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” Journal of Molecular Biology 48, no. 3 (March 1970): 443–53.
Smith, T. F., and M. S. Waterman. “Identification of Common Molecular Subsequences.” Journal of Molecular Biology 147, no. 1 (March 25, 1981): 195–97.
Burrows, Michael, and David J Wheeler. “A Block-Sorting Lossless Data Compression Algorithm,” 1994. - BWT paper
Ferragina, Paolo, and Giovanni Manzini. “Opportunistic Data Structures with Applications.” In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium On, 390–98. IEEE, 2000. https://dl.acm.org/citation.cfm?id=796543. - FM index paper
Li, Heng, and Richard Durbin. “Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform.” Bioinformatics (Oxford, England) 25, no. 14 (July 15, 2009): 1754–60. https://doi.org/10.1093/bioinformatics/btp324. - BWA first paper
Li, Heng, and Richard Durbin. “Fast and Accurate Long-Read Alignment with Burrows-Wheeler Transform.” Bioinformatics (Oxford, England) 26, no. 5 (March 1, 2010): 589–95. https://doi.org/10.1093/bioinformatics/btp698. - BWA second paper
Li, Heng. “Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM.” ArXiv Preprint ArXiv:1303.3997, 2013. - BWA-MEM (maximal exact matches). Automatically select end-to-end or gapped alignment.
Langmead, Ben, and Steven L Salzberg. “Fast Gapped-Read Alignment with Bowtie 2.” Nature Methods 9, no. 4 (March 4, 2012): 357–59. https://doi.org/10.1038/nmeth.1923. - Bowtie2 gapped alignment
Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics (Oxford, England) 29, no. 1 (January 1, 2013): 15–21. https://doi.org/10.1093/bioinformatics/bts635. - STAR gapped aligner. Algorithm, testing on simulated dataset.
Kim, Daehwan, Ben Langmead, and Steven L. Salzberg. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nature Methods 12, no. 4 (April 2015): 357–60. https://doi.org/10.1038/nmeth.3317. - HISAT aligner. Two indexes - one large with many small
The Freiburg RNA tools teaching material on RNA- and DNA-sequencing alignment (Needleman-Wunsch, Smith-Waterman, many more). http://rna.informatik.uni-freiburg.de/Teaching/
Fonseca, Nuno A., Johan Rung, Alvis Brazma, and John C. Marioni. “Tools for Mapping High-Throughput Sequencing Data.” Bioinformatics (Oxford, England) 28, no. 24 (December 15, 2012): 3169–77. https://doi.org/10.1093/bioinformatics/bts605. - Sequencing mappers, lists and characteristics of DNA, RNA, bisulfite aligners. https://www.ebi.ac.uk/~nf/hts_mappers/
Videos
- The Bioinformatics algoriths web site
- “How Do We Assemble Genomes?” - genome assembly algorithms, https://youtu.be/vjB6nhOu3BY?list=PLQ-85lQlPqFNGdaeGpV8dPEeSm3AChb6L
- “How Do We Compare Biological Sequences? (Dynamic Programming)” - global and local alignment using dynamic programming, https://www.youtube.com/watch?list=PLQ-85lQlPqFNmbPEsMoxb5dM5qtRaVShn&v=slUaVeNvuTk
- Ben Langmead’s Slides for Algorithms for DNA Sequencing Coursera class. https://github.com/BenLangmead/ads1-slides.git. The complete class is at http://www.langmead-lab.org/teaching-materials/