“[…] knowledge of sequences could contribute much to our understanding of living matter.” Frederick Sanger, 1980
When we talk about Genome Assembly, we obligatory need to mention sequencing “generations”. If you are a biologist, you problably already know that does not exist independent sequencing generations. In order to have a great assembly, in general, biologists use more than one technology at the same time, to generate DNA reads. So, before defining Assembly and going to some tutorials, let us briefly talk about sequencing generations. Have in mind, though, all these generations have an overlap. Mainly the second and third ones, also called together as Next Generation Sequencing.
First Sequencing Generation
In only two decades, the modern biology has revolutionized whole science, after human genome project conclusion in 2001 (Consortium 2004)–. Almost all research areas are being influenced by genetics, such as energy, agroindustry, medicine and engineering.
The first complete sequenced genome was from Bacteriophage MS2, done in 1976 (Sanger et al., 1977). The technology available that moment was based on “plus and minus” method (Sanger & Coulson, 1975), a variant of the Sanger methodology to sequence DNA, in which deoxyribonucleotides (dNTPs) are used in different reactions to generate assorted length sequences, fractioned later by gel electrophoresis. That chemical sequencer was able to generate a whole DNA fragment and was responsible for the beginning of the Bioinformatics Era.
Frederick Sanger kept improving his technology, creating in 1977 the famous chain-termination or dideoxy technique (Sanger and Nicklen 1977), which is still used in many places until current days. It uses dideoxynucleotides (ddNTPs), dNTPs analogues that lack 3’hydroxyl group – required for DNA extension during its synthesis. Radiolabelled ddNTPs are mixed in four parallel synthesis reactions to generate the original sequences through an autoradiography.
Several other changings have been made to improve this technique, like using phospho- or tritrium- radiolabeling with fluorometric based detection or detection through capillary based electrophoresis (Heather and Chain 2016). However, the machines could not produce reads more than one kilobase in length, reason why shotgun sequencing technique was developed later, in order to assemble those reads into long contiguous sequences (contigs), by a number of cloned and separately sequenced overlapping DNA fragments.
In addition, the creation of technologies such as polymerase chain reaction (PCR) (Saiki et al., 1988) and recombinant DNA (Jackson, Symons and Berg, 1972) made possible much more quantity of pure DNA to sequence.
Second Sequencing Generation
New technologies (Shendure & Ji, 2008) appeared using luminescent method (Nyrén & Lundin, 1985) for measuring pyrophosphate synthesis (Ronaghi et al., 1998): pyrosequencing was licensed to 454 Life Sciences, the first big company to launch next-generation sequencing (NGS) technology. A great transition was made, when they started mass parallelization of sequencing reactions, increasing the amount of DNA – producing millions of 400-500 base pairs (bp) long reads (Heather & Chain, 2016).
Techniques comparable to 454 emerged in the following years, among them, Solexa, later acquired by Illumina, using ‘bridge amplification’ method (Fedurco et al., 2006). Although, the first Genome Analyzer (GA) machine (by Solexa) was capable of generating very short reads – about 35bp long –, it could produce paired-end (PE) data (forward and reverse DNA information), improving the accuracy at mapping reads to a reference genome (Heather & Chain, 2016). The second GA version was later replaced by HiSeq/MiSeq, with longer read lengths – ~150bp long (Quail et al., 2012).
Other impact company was Applied Biosystems (Life Technologies merged with Invitrogen, currently Thermo Fisher Scientific), owner of SOLiD (Mckernan et al., 2009), a ligation and detection sequencing system. SOLiD was followed by Ion Torrent, a platform in which nucleotide incorporation is detected by the difference in pH, caused by the release of protons during DNA synthesis (Rothberg et al., 2011) – it can generate ~200bp long reads (Quail et al., 2012). However, interpreting homopolymer sequences is not an easy task in Ion Torrent, due to the loss of signal of many simultaneous dNTPs incorporation (Loman et al., 2012).
The sequencing cost have been dramatically altered by these companies, revolutionizing the complexity of microchips and increasing the number of chemical methods to sequence (Heather & Chain, 2016). Illumina, though, has been considered the most successful sequencing platform, making this company a near monopoly (Greenleaf & Sidow, 2014; Heather & Chain, 2016).
Third Sequencing Generation
Currently, we are living Third-generation DNA sequencing (Schadt, Turner and Kasarskis, 2010; Heather & Chain, 2016), a step into longer reads, real-time sequencing and new technologies. These technologies can sequence single molecules lacking DNA amplification, needed in all previous sequencers (Heather & Chain, 2016).
A first single molecule sequencing (SMS) machinery was commercialized by Helicos BioSciences (Harris et al., 2008), working with the same methodology Illumina is used to do, but with no bridge amplification – it avoids biases and errors associated to amplified DNA.
But now, one of the most famous third-generation sequencing is the Single Molecule Real Time (SMRT) technology from Pacific Biosciences (PacBio). Despite the cost, PacBio have been used to generate much longer reads, up to 10kbase (Van Dijk et al., 2014), necessary to assemble big genomes – as the 32-gigabase-pair axolotl genome, the biggest genome ever assembled at the time of writing (Nowoshilow et al., 2018) –, although, high base detection error is an issue to settle yet.
Nanopore technologies have also appeared as a promise to the future of sequencing (Haque et al., 2013). The firsts nanopore sequencers were developed by Oxford Nanopore Technologies (ONT) – GridION and MinION (Eisenstein, 2012; Clarke et al., 2013) –, and the latter were innovated by size, similar to an USB drive (Loman & Quinlan, 2014). Nanopore sequencers are hoped to be a future solution to fast, low-cost and compact machines with long and accurate reads (Heather & Chain, 2016). For the moment, they can be used in association with current accurate technologies due to their long reads (Madoui et al., 2015; Karlsson et al., 2015).
Pre-processing the reads
After sequencing, comes the quality control step. It can done throughout several ways, including tools offered by the sequencer company at the sequencing machine. What we need to do is visualize our reads quality. Here we are going to use FastQC software, the most famous tool utilized to achieve quality visualization.
First of all, download FastQC here. Then, follow our instructions:We’ll use the same dataset during all tests here. Please, download the reference genome and the raw reads from the European Nucleotide Archive (ENA). You can do it manually, or through FTP using wget command as follows. Remember to extract that.
gzip -d SRR1816870_subreads.fastq.gz
gzip -d SRR1818128_subreads.fastq.gz
gzip -d SRR572209_1.fastq.gz
gzip -d SRR572209_2.fastq.gz
Type in the terminal “fastqc read_file.fastq -t number_of_threads” to each file. It will generate an html file, in which you can open and detect any issues to handle.
./fastqc SRR1816870_subreads.fastq -t 8
./fastqc SRR1818128_subreads.fastq -t 8
./fastqc SRR572209_1.fastq -t 8
./fastqc SRR572209_2.fastq -t 8
1. All of it will take some time. It’s okay, we are doing science here 🙂 . Anyway, if you have any trouble, you can try the same process only to one or two files, or even take another file from ENA or NCBI.
2. The following video will help you to interpret FastQC html file.
FastQC tell us about the quality of our sequencing. Sometimes, we need to consider re-sequencing our datasets. However, some issues can be solved based on some support tools to trim your data, basically. For example, to the file “SRR572209_1.fastq” we see problems related to “Per tile sequence quality”, “Per base sequence content”, “Sequence Duplication Levels”, and “Kmer Content”. It seems like all of those problems are located a the beginning and at the end of our reads. So, maybe we should eliminate the first and the last bases of our reads. Let’s do it.
Download Trimmomatic; extract that; cut the first and last 9 bases from the reads; and re-run fastqc. The dir I’m storing my files is: /mnt/data-assemblies/guia/. I renamed the paired-end files to R1.fastq and R2.fastq to make it simpler.
java -jar trimmomatic-0.38.jar PE /mnt/data-assemblies/guia/R1.fastq /mnt/data-assemblies/guia/R2.fastq /mnt/data-assemblies/guia/pe_1_paired.fastq /mnt/data-assemblies/guia/pe_1_unpaired.fastq /mnt/data-assemblies/guia/pe_2_paired.fastq /mnt/data-assemblies/guia/pe_2_unpaired.fastq LEADING:9 TRAILING:9 MINLEN:50 -threads 8 -phred33
So here, I’m basically cutting the initial and final bases, and also saying to trimmomatic that it can eliminate reads less than 50bp. For a complete list of actions possible in Trimmomatic, visit http://www.usadellab.org/cms/?page=trimmomatic.
We can also use fastx tools. Take a deeper look at these tools, and choose the one you like the most. To give you another example:
./fastx_trimmer -f 9 -l 90 /mnt/data-assemblies/guia/R1.fastq /mnt/data-assemblies/guia/new_R1.fastq
It does the same as previous command. Take a look at FastQC again using these new files. Enjoy your time learning it 🙂
When we consider PacBio, it’s more complicated. Pacbio generates longer reads, but with lower quality. We can still analyze it using FastQC, but then, we need to consider other tools to manage our reads. PacBio indicates some of these tools here. For now, let’s use it without any change.
When we work with Illumina, we also come across with mate-pair reads. Although we are not going to use them here, let us define single, paired-end and mate-pair reads.
According to Illumina, Paired-End sequencing is a strategy to sequence both ends of a DNA/RNA fragment, in 5’ to 3’ direction and in 3’ to 5’direction (200 – 800bp). It may not only facilitate detection of genomic rearrangements and repetitive sequence elements, but also the detection of gene fusions and novel transcripts. Single-read sequencing, as the name refers, considers sequencing only one end (Illumina Inc., 2018). However, it is not being widely used anymore.
On the other hand, Mate Pair sequencing generates long-insert paired-end reads, longer than 800bp. This strategy is possible through Biotine, which is ligated to fragments, circularizing them (figure 3). The circularized DNA is fragmented, enriched and ligated to adapters. Thus, the final fragment contains the ends of the original longer fragment (ecSeq Bioinformatics, 2018).
The mate-pair reads may now pair the paired-end reads in great distances (figure 4), since the original long read length is known. It can elucidate the existence of long repetitive regions and also the problems generated during paired-end reads assembly.
So, as you noticed we have reads from Illumina as well as reads from Pacbio. Let us make, though, our guide more interesting. We are going to assemble pacbio reads using Canu software, Illumina reads using SPAdes and then, Pacbio and Illumina reads using SPAdes again. At the end we’ll be able to compare all three strategies. However, before going further, I’ll briefly present you some Assembly approaches.
After sequencing, genome assembly is needed, even when using Sanger platforms. But, distinguishing between de novo and mapping approaches is very important to select the best algorithm to assemble.
De novo genome assembly intents to reconstruct DNA or RNA molecules in which there is no genome reference previously sequenced (an NP-hard problem). On the other hand, mapping/comparative (re-sequencing) approach uses a sequenced genome from same or related species as a guide during assembly (alignment) – making it much easier when comparing to de novo genome assembly (Pop 2009; Miller, Koren and Sutton, 2010).
Assembling genomes was a problem that emerged from NGS, a challenge created by millions of short reads. Many algorithms and tools were developed to better achieve de novo genome assembly, considering assembled genomes quality and computational efficiency. The main algorithms are: Greedy, Overlap-Layout-Consensus (OLC), De Bruijn graph (DBG) and string graph; which are summarized below.
Greedy Assembly Algorithm
As any other greedy algorithm, the greedy assembly algorithm selects always the best option each operation, according to an ordering. In this case, a basic operation means: “given any read or contig, add one more read or contig. Each operation uses the next highest-scoring overlap to make the next join” (Miller, Koren and Sutton, 2010). Therefore, contigs and later, scaffolds2, are assembled as larger as possible.
Greedy approach was widely used for assembling Sanger data, in assemblers such phrap, TIGR Assembler and CAP3. However recent software platforms have used different greedy strategies (Pop 2009). OLC and DBG graphs may also be used by greedy algorithms (Chen et al., 2017).
Contig derives from the word contiguous, it means a set of overlapping DNA fragments that together produce a consensus region of DNA (Staden, 1980).
Scaffold is a series of contigs separated by gaps of known length.
Alike greedy algorithm, a list of highest-scoring overlap to each read is given in OLC (Staden, 1980). The list is used for creating an overlap graph, in which each read corresponds to a node, connected by edges that represent an overlap between the corresponding nodes (Figure 1).
A layout step is responsible for identifying paths throughout the graph in order to generate genome fragments, or contigs. The ideal path would traverse each node in the graph only once, reconstructing the whole genome (Pop, 2009). Finding this path is computationally difficult, an NP-hard problem, known as Hamiltonian path. The overlap strategy has time complexity O(n²) (Chen et al., 2017).
Consensus sequence is the final stage, when reads overlapping same genome positions are used to identify the correct bases, detecting polymorphisms, and generating the sequence quality values (Li et al., 2004).
However, in whole genome shotgun (WGS) technique, the ideal path does not exist. Assembling contigs is the strategy, trying to remove gaps and errors, solving repetitions in the genome, and solving forks ( According to Pop (2009), fork means “a read A that overlaps two other reads, B and C; however, B and C do not overlap each other. Such a situation often represents the boundary between a repeat and the genomic regions adjacent to the copies of this repeat throughout the genome; however, forks can also be caused by sequencing errors”).
de Bruijn graph algorithm
De Bruijn is another graph approach, widely used for short reads, that implements K-mer strategy (Idury & Waterman, 1995). In de Bruijn graph, k-mers are nodes, and exactly overlapping of length k – 1 between two adjacent nodes are edges (Pop, 2009); each repeat is presented at once in the graph, with links to different start and end positions (Zerbino & Birney, 2008).
Here, a path in the graph is found using Eulerian path algorithm (O(n)), which possess every edge in the graph. There are several efficient algorithms for finding Eulerian path, perhaps it can generate exponential number of Eulerian paths (Pop, 2009). In addition, finding a Hamiltonian path may be reduced into finding a Eulerian path in a (k-1)-mer DBG (Chen et al., 2017).
A problem from k-mer approach leads to a loss of information – “long-range connectivity information implied by each read” (Pop, 2009). To incorporate read information, Pevzner and colleagues (2001) created a Eulerian path variation, called Eulerian superpath problem. This superpath is produced from sub-paths corresponding to reads given.
String graph algorithm
The string graph approach was first presented explicitly in Euler algorithms. Although, Myres (2005) introduced this graph as a new concept, lacking k-mer idea, in order to get a more efficient algorithm (O(n)), scalable to mammalian genomes.
String graph assembler (SGA) performs a compressed data structure Ferragina-Manzini (FM)-index with a collection of assembly algorithms (Simpson & Durbin, 2010; Chen et al., 2017). The graph is created by pairwise overlaps between reads, removing transitive edges. Just as de Bruijn graph, repeats are collapsed to a single unit, but without the necessity of generating k-mers from reads. An error correction is performed in the reads before assembling, thus constructing FM-index to compute string graph, and then to assemble the contigs (Chen et al., 2017).
Let’s use Canu to assemble Pacbio data. First, rename your files to something simpler Here, we keep only the accession number. After installing Canu, type:
./canu -p Pfermentans -d /mnt/data-assemblies/bacteria/Pfermentans genomeSize=5.03362m -pacbio-raw SRR1818128.fastq SRR1816870.fastq
As you see, -p indicates the name of your organism, -d the directory you want to store the results, genomeSize the estimated length of the genome, and -pacbio-raw the pacbio reads.
And now SPAdes to Illumina, and PacBio + Illumina, respectively. SPAdes automatically generates distinct k lenghts and chooses the best one according to some heuristics. You’ll see it’s very simple. To know more type “./spades.py -h”.
./spades.py -1 SRR572209_1.fastq -2 SRR572209_2.fastq -o /dir_out
./spades.py -1 SRR572209_1.fastq -2 SRR572209_2.fastq –pacbio SRR1818128.fastq –pacbio SRR1816870.fastq -o /dir_out
So, the assembly process looks simpler than it is. Actually, the big problem comes next.
The Genomics field is still somewhat recent and presents a large number of practical issues to tackle. Assembly quality is one of these urgent issues that have emerged, particularly in personalized medicine scenarios. At this point, many metrics and strategies have been proposed in order to evaluate an assembly.
Taking into account the proposed evaluation metrics (summarized in in the figure below), we now face the problem of choosing those that give us a better understanding of the assembly quality. In many contexts, this choice will directly influence assemblies selection and consequently genome assembly application cases.
So, you problably will spend the majority of your time seeking for a good quality assembly. You need to come back many times to the assembly step, trying new paramters and assemblers. For sure, it’s so musch easier when you already have good reads. Then, also consider generating as better reads as possible.
Here we are going to use QUAST to generate many quality metrics.Download QUAST or use it on web: http://quast.bioinf.spbau.ru/. In case of using its command line:
./quast.py –gene-finding –rna-finding -R GCF_000271665.2_ASM27166v2_genomic.fna.gz –est-ref-size 5033620 –pacbio SRR1816870.fastq –pacbio SRR1818128.fastq Pfermentans.unitigs.fasta –threads 8 -o /quast
You’ll do the same for the three assemblies. The command above generates quality metrics for Canu’s assembly. We used the unitigs file. –gene-finding and –rna-finding estimates the number of genes and rnas. -R indicates a reference genome, here we used the one from ENA, GCF_000271665. If you do not have a reference genome, quast will generate the metrics available for unknown reference genomes. You also need to specify the reference genome lenght in –est-ref-size. And also, if you give quast the raw reads, it will return you more quality measures.
Now take some time to interpret the metrics from the three assemblies.
Considering a decreasing-ordered list of contigs, Nx (e.g. N50, N90) is the length of the shortest contig from the sum group of all contigs from the list necessary to get x% of total assembly length. NGx considers not the total assembly length, but the original genome length. And NAx does the same job as Nx but using an aligned contigs list; contigs containing misassemblies are broken into two new contigs [Gurevich et al. 2013].
In terms of contigs, the best assembly is the one made on Canu, with 4 contigs, and the worst is from SPAdes using only Illumina. Also N50 and L50 are better represented in Canu. However, when we look at Reference Mapped, for example, we see Canu with 69,91%, while SPAdes has more than 99%. You problably will agree with me that seems the assembly using both PacBio and Illumina in SPAdes looks the best option here.
Did you enjoy it?
There’s a pleithora of tools available on internet to improve your assembly. You should consider looking around and applying some of them. I would like to cite here GapFiller, CISA and KmerGenie.
GapFiller, as its names suggests, is a tool which seeks to eliminate the gaps inside the assembly, according to some strategies. CISA, as many others, generates hybrid assemblies, gathering many distinct assembler’s outputs. And KmerGenie, gives you the better K number according to your data.
There’s so much more to learn and apply, but it’s not the focus here. As a beginner, you should first learn the basics.
It’s always challenging to speak about the future. Maybe all we have talked here is going to be unnecessary in the years to come, given a technology capable of sequencing the whole DNA molecule at once. But, what we do know, is that for the moment we need to assemble DNA reads.
However, we can extrapolate what we know about the third sequencing generation. Nanopore, for example, sequencing giant reads in the middle of nowhere, is a great guess to what we can expect.
We are walking through a more inclusive, cheaper and quicker science. Our work now is to produce good quality assemblies in order to better understand and manage the molecule of life.
CHEN, Qingfeng et al. Recent advances in sequence assembly: principles and applications. Briefings In Functional Genomics, [s.l.], v. 16, n. 6, p.361-378, 26 abr. 2017. Oxford University Press (OUP). http://dx.doi.org/10.1093/bfgp/elx006.
CHIKHI, R.; MEDVEDEV, P.. Informed and automated k-mer size selection for genome assembly. Bioinformatics, [s.l.], v. 30, n. 1, p.31-37, 3 jun. 2013. Oxford University Press (OUP). http://dx.doi.org/10.1093/bioinformatics/btt310.
CLARKE, James et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nature Nanotechnology, [s.l.], v. 4, n. 4, p.265-270, 22 fev. 2009. Springer Nature. http://dx.doi.org/10.1038/nnano.2009.12.
COMMINS, Jennifer; TOFT, Christina; FARES, Mario A.. Computational Biology Methods and Their Application to the Comparative Genomics of Endocellular Symbiotic Bacteria of Insects. Biological Procedures Online, [s.l.], v. 11, n. 1, p.52-78, 11 mar. 2009. Springer Nature. http://dx.doi.org/10.1007/s12575-009-9004-1.
CONSORTIUM, International Human Genome Sequencing. Finishing the euchromatic sequence of the human genome. Nature, [s.l.], v. 431, n. 7011, p.931-945, 21 out. 2004. Springer Nature. http://dx.doi.org/10.1038/nature03001.
EISENSTEIN, Michael. Oxford Nanopore announcement sets sequencing sector abuzz. Nature Biotechnology, [s.l.], v. 30, n. 4, p.295-296, abr. 2012. Springer Nature. http://dx.doi.org/10.1038/nbt0412-295.
ECSEQ BIOINFORMATICS. What is mate pair sequencing for? Disponível em: . Acesso em: 09 mar. 2018.
FEDURCO, M. et al. BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies. Nucleic Acids Research, [s.l.], v. 34, n. 3, 6 fev. 2006. Oxford University Press (OUP). http://dx.doi.org/10.1093/nar/gnj023.
GREENLEAF, William J; SIDOW, Arend. The future of sequencing: convergence of intelligent design and market Darwinism. Genome Biology, [s.l.], v. 15, n. 3, p.303-310, 2014. Springer Nature. http://dx.doi.org/10.1186/gb4168.
HAQUE, Farzin et al. Solid-state and biological nanopore for real-time sensing of single chemical and sequencing of DNA. Nano Today, [s.l.], v. 8, n. 1, p.56-74, fev. 2013. Elsevier BV. http://dx.doi.org/10.1016/j.nantod.2012.12.008.
HARRIS, T. D. et al. Single-Molecule DNA Sequencing of a Viral Genome. Science, [s.l.], v. 320, n. 5872, p.106-109, 4 abr. 2008. American Association for the Advancement of Science (AAAS). http://dx.doi.org/10.1126/science.115042
HEATHER, James M.; CHAIN, Benjamin. The sequence of sequencers: The history of sequencing DNA. Genomics, [s.l.], v. 107, n. 1, p.1-8, jan. 2016. Elsevier BV. http://dx.doi.org/10.1016/j.ygeno.2015.11.003.
IDURY, Ramana M.; WATERMAN, Michael S.. A New Algorithm for DNA Sequence Assembly. Journal Of Computational Biology, [s.l.], v. 2, n. 2, p.291-306, jan. 1995. Mary Ann Liebert Inc. http://dx.doi.org/10.1089/cmb.1995.2.291.
ILLUMINA INC.. Advantages of paired-end and single-read sequencing: Understand the key differences between these sequencing read types. Disponível em: . Acesso em: 09 mar. 2018.
JACKSON, D. A.; SYMONS, R. H.; BERG, P. Biochemical method for inserting new genetic information into DNA of Simian Virus 40: circular SV40 DNA molecules containing lambda phage genes and the galactose operon of Escherichia coli. PMC, USA, v. 10, n. 69, p.2904-2909, oct. 1972.
KARLSSON, E. et al. Scaffolding of a bacterial genome using MinION nanopore sequencing. Scientific Reports, [s.l.], v. 5, n. 1, 7 july 2015. Springer Nature. http://dx.doi.org/10.1038/srep11996.
LI, M. et al. Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Research, [s.l.], v. 32, n. 17, p.5183-5191, 23 set. 2004. Oxford University Press (OUP). http://dx.doi.org/10.1093/nar/gkh850.
LOMAN, Nicholas J et al. Performance comparison of benchtop high-throughput sequencing platforms. Nature Biotechnology, [s.l.], v. 30, n. 5, p.434-439, 22 abr. 2012. Springer Nature. http://dx.doi.org/10.1038/nbt.2198.
LOMAN, N. J.; QUINLAN, A. R.. Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics, [s.l.], v. 30, n. 23, p.3399-3401, 20 ago. 2014. Oxford University Press (OUP). http://dx.doi.org/10.1093/bioinformatics/btu555.
MADOUI, Mohammed-amin et al. Genome assembly using Nanopore-guided long and error-free DNA reads. Bmc Genomics, [s.l.], v. 16, n. 1, 20 april 2015. Springer Nature. http://dx.doi.org/10.1186/s12864-015-1519-z.
MCKERNAN, K. J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Research, [s.l.], v. 19, n. 9, p.1527-1541, 22 jun. 2009. Cold Spring Harbor Laboratory. http://dx.doi.org/10.1101/gr.091868.109.
MILLER, Jason R.; KOREN, Sergey; SUTTON, Granger. Assembly algorithms for next-generation sequencing data. Genomics, [s.l.], v. 95, n. 6, p.315-327, jun. 2010. Elsevier BV. http://dx.doi.org/10.1016/j.ygeno.2010.03.001.
MYERS, E. W.. The fragment assembly string graph. Bioinformatics, [s.l.], v. 21, n. 2, p.79-85, 1 set. 2005. Oxford University Press (OUP). http://dx.doi.org/10.1093/bioinformatics/bti1114.
NOWOSHILOW, Sergej et al. The axolotl genome and the evolution of key tissue formation regulators. Nature, [s.l.], v. 554, n. 7690, p.50-55, 24 jan. 2018. Springer Nature. http://dx.doi.org/10.1038/nature25458.
NYRÉN, Pål; LUNDIN, Arne. Enzymatic method for continuous monitoring of inorganic pyrophosphate synthesis. Analytical Biochemistry, [s.l.], v. 151, n. 2, p.504-509, dez. 1985. Elsevier BV. http://dx.doi.org/10.1016/0003-2697(85)90211-8.
PENG, Yu et al. IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler. Lecture Notes In Computer Science, [s.l.], p.426-440, 2010. Springer Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-12683-3_28.
PEVZNER, P. A.; TANG, H.; WATERMAN, M. S.. An Eulerian path approach to DNA fragment assembly. Proceedings Of The National Academy Of Sciences, [s.l.], v. 98, n. 17, p.9748-9753, 14 ago. 2001. Proceedings of the National Academy of Sciences. http://dx.doi.org/10.1073/pnas.171285098.
POP, M.. Genome assembly reborn: recent computational challenges. Briefings In Bioinformatics, [s.l.], v. 10, n. 4, p.354-366, 29 maio 2009. Oxford University Press (OUP). http://dx.doi.org/10.1093/bib/bbp026.
QUAIL, Michael et al. A tale of three next generation sequencing platforms: comparison of Ion torrent, pacific biosciences and illumina MiSeq sequencers. Bmc Genomics, [s.l.], v. 13, n. 1, p.341-354, 2012. Springer Nature. http://dx.doi.org/10.1186/1471-2164-13-341.
RONAGHI, M. et al. DNA SEQUENCING: A Sequencing Method Based on Real-Time Pyrophosphate. Science, [s.l.], v. 281, n. 5375, p.363-365, 17 jul. 1998. American Association for the Advancement of Science (AAAS). http://dx.doi.org/10.1126/science.281.5375.363.
ROTHBERG, Jonathan M. et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature, [s.l.], v. 475, n. 7356, p.348-352, jul. 2011. Springer Nature. http://dx.doi.org/10.1038/nature1024
SAIKI, R. K. et al. Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science, v. 239, n. 4839, p.487-491, jan. 1988.
SANGER, F.; COULSON, A.r.. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal Of Molecular Biology, [s.l.], v. 94, n. 3, p.441-448, maio 1975. Elsevier BV. http://dx.doi.org/10.1016/0022-2836(75)90213-2.
SANGER, F.; NICKLEN, S.; COULSON, A. R. DNA sequencing with chain-terminating inhibitors: (DNA polymerase/nucleotide sequences/bacteriophage 4X174). Proc. Natl. Acad. Sci.: Biochemistry, USA, v. 74, n. 12, p.5463-5467, dez. 1977.
SANGER, F. F.; et al. (1977). “Nucleotide sequence of bacteriophage φX174 DNA”. Nature. 265 (5596): 687–695. doi:10.1038/265687a0. PMID 870828.
SCHADT, E. E.; TURNER, S.; KASARSKIS, A.. A window into third-generation sequencing. Human Molecular Genetics, [s.l.], v. 19, n. 2, p.227-240, 21 set. 2010. Oxford University Press (OUP). http://dx.doi.org/10.1093/hmg/ddq416.
SHENDURE, Jay; JI, Hanlee. Next-generation DNA sequencing. Nature Biotechnology, [s.l.], v. 26, n. 10, p.1135-1145, out. 2008. Springer Nature. http://dx.doi.org/10.1038/nbt1486.
STADEN, R.. A mew computer method for the storage and manipulation of DNA gel reading data. Nucleic Acids Research, [s.l.], v. 8, n. 16, p.3673-3694, 1980. Oxford University Press (OUP). http://dx.doi.org/10.1093/nar/8.16.3673.
STADEN, R.. A mew computer method for the storage and manipulation of DNA gel reading data. Nucleic Acids Research, [s.l.], v. 8, n. 16, p.3673-3694, 1980. Oxford University Press (OUP). http://dx.doi.org/10.1093/nar/8.16.3673.
SNUSTAD, D. Peter; SIMMONS, Michael J.. Fundamentos de Genética. 4. ed. Rio de Janeiro: Guanabara Koogan, 2012. 903 p. Tradução Paulo A. Motta.
TREANGEN, Todd J.; SALZBERG, Steven L.. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics, [s.l.], v. 13, n. 1, p.36-46, 29 nov. 2011. Springer Nature. http://dx.doi.org/10.1038/nrg3117
VAN DIJK, Erwin L. et al. Ten years of next-generation sequencing technology. Trends In Genetics, [s.l.], v. 30, n. 9, p.418-426, set. 2014. Elsevier BV. http://dx.doi.org/10.1016/j.tig.2014.07.001.
ZERBINO, D. R.; BIRNEY, E.. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, [s.l.], v. 18, n. 5, p.821-829, 21 fev. 2008. Cold Spring Harbor Laboratory. http://dx.doi.org/10.1101/gr.074492.107.
Copyright © 2020 Guilherme Neumann.