蒐集物種相關信息,好比基因組大小,雜合度,html
基因組大小的獲取關係到對之後組裝結果的大小的正確與否判斷;基因組太大(>10Gb),超出了目前denovo組裝基因組軟件的對機器內存的要求,從客觀條件上講是沒法實現組裝的。數據庫
通常物種的基因組大小能夠從(http://www.genomesize.com/ )這個數據庫查到。若是沒有搜錄,須要考慮經過實驗(流式細胞儀)得到基因組大小。express
1.1.1 流式細胞儀估計基因組大小的例子:app
Yoshida, S., J. K. Ishida, et al. (2010). "A full-length enriched cDNA library and expressed sequence tag analysis of the parasitic weed, Striga hermonthica." BMC Plant Biol 10: 55.ide
1.1.2 基於福爾根染色估計基因組大小的描述:ui
這本書比較經典,重點推薦:Gregory, T. (2005). The evolution of the genome, Academic Press.lua
1.1.3 定量pcr估計基因組大小的例子:orm
Wilhelm, J., A. Pingoud, et al. (2003). "Real-time PCR-based method for the estimation of genome sizes." Nucleic Acids Res 31(10): e56.htm
Jeyaprakash, A. and M. A. Hoy (2009). "The nuclear genome of the phytoseiid Metaseiulus occidentalis (Acari: Phytoseiidae) is among the smallest known in arthropods." Exp Appl Acarol 47(4): 263-273.ip
1.1.4 Kmer估計基因組大小的例子:
Kim, E. B., X. Fang, et al. (2011). "Genome sequencing reveals insights into physiology and longevity of the naked mole rat." Nature 479(7372): 223-227.
雜合度對基因組組裝的影響主要體如今不能合併姊妹染色體,雜合度高的區域,會把兩條姊妹染色單體都組裝出來,從而形成組裝的基因組偏大於實際的基因組大小。
通常是經過SSR在測序親本的子代中檢查SSR的多態性。雜合度若是高於0.5%,則認爲組裝有必定難度。雜合度高於1%則很難組裝出來。
雜和度估計通常經過kmer分析來作,這裏有一個例子:
http://www.nature.com/nature/journal/vaop/ncurrent/full/nature11413.html
下降雜合度能夠經過不少代近交來實現。
雜合度高,並非說組裝不出來,而是說,裝出來的序列不適用於後續的生物學分析。好比拷貝數、基因完整結構。
隨着測序對質量要求愈來愈高和相關技術的逐漸成熟,遺傳圖譜也快成了denovo基因組的必須組成。構建遺傳圖構建相關概念能夠參考這本書(The handbook of plant genome mapping: genetic and physical mapping )
這一步也是很重要的
肯定第一步沒問題,就意味着這個物種是能夠嘗試測序的。測序樣品對一些物種也是很大問題的,某些物種取樣自己就是一個挑戰的問題。
基因組測序用的樣品最好是來自於同一個個體,這樣能夠下降個體間的雜和對組裝的影響。大片斷對此無要求。
通常都是用不一樣梯度的插入片斷來測序,小片斷(200,500,800)和大片斷(1k, 2kb 5kb 10kb 20kb 40kb)。若是是雜合度高和重複序列較多的物種,可能要採起fosmid-by-fosmid或者fosmid pooling的策略。
不言而喻,後者花費是至關高的。
Li, Z., Y. Chen, et al. (2012). "Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph." Brief Funct Genomics 11(1): 25-37.
Treangen, T. J. and S. L. Salzberg (2012). "Repetitive DNA and next-generation sequencing: computational challenges and solutions." Nat Rev Genet 13(1): 36-46.
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Schatz, M. C., J. Witkowski, et al. (2012). "Current challenges in de novo plant genome sequencing and assembly." Genome Biol 13(4): 243
Baker, M. (2012). "De novo genome assembly: what every biologist should know." Nat Methods 9(4): 333-337. (重點推薦)
Compeau, P. E., et al. (2011). "How to apply de Bruijn graphs to genome assembly." Nat Biotechnol 29(11): 987-991.
Birney, E. (2011). "Assemblies: the good, the bad, the ugly." Nat Methods 8(1): 59-60.
Schatz, M. C., et al. (2010). "Assembly of large genomes using second-generation sequencing." Genome Res 20(9): 1165-1173.
Kelley, D. R., M. C. Schatz, et al. (2010). "Quake: quality-aware detection and correction of sequencing errors." Genome Biol 11(11): R116.
Salzberg, S. L., A. M. Phillippy, et al. (2012). "GAGE: A critical evaluation of genome assemblies and assembly algorithms." Genome Res 22(3): 557-567.
Zhang, W., et al. (2011). "A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies." PLoS One 6(3): e17915.
Narzisi, G. and B. Mishra (2011). "Comparing de novo genome assembly: the long and short of it." PLoS One 6(4): e19175.
Lin, Y., et al. (2011). "Comparative Studies of de novo Assembly Tools for Next-generation Sequencing Technologies." Bioinformatics.
Hayden, E. C. (2011). "Genome builders face the competition." Nature 471(7339): 425.
Finotello, F., et al. (2011). "Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data." Brief Bioinform.
Earl, D. A., et al. (2011). "Assemblathon 1: A competitive assessment of de novo short read assembly methods." Genome Res.
Schatz, M. C., et al. (2011). "Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies." Brief Bioinform.
Riba-Grognuz, O., et al. (2011). "Visualization and quality assessment of de novo genome assemblies." Bioinformatics.
我的看法:
目前大基因組的denovo組裝主流軟件仍是ALLPATH-LG SOAPdenovo
ALLPATH-LG的優勢是:組裝的連續性最好,準確性最好,可是消耗內存較大,不是太好使用
SOAPdenovo的優勢是:速度快,消耗的內存能夠接受,組裝的連續性還能夠,可是錯誤相對要多一些。
固然,上述評述並非在全部狀況下的,對不一樣物種,不一樣數據,他們的表現可能會不同。
基於Overlap-layout的方法的組裝軟件首推CABOG,這是當年用來組裝果蠅基因組的原型。另外,快要發佈的MSR-CA貌似也不錯,其整合了上述全部軟件的優勢,來勢很猛啊。
Yandell, M. and D. Ence (2012). "A beginner's guide to eukaryotic genome annotation." Nat Rev Genet 13(5): 329-342.
Nielsen, C. B., M. Cantor, et al. (2010). "Visualizing genomes: techniques and challenges." Nat Methods 7(3 Suppl): S5-S15.
Yang, Z. and B. Rannala (2012). "Molecular phylogenetics: principles and practice." Nat Rev Genet 13(5): 303-314.
Colbourne, J. K., M. E. Pfrender, et al. (2011). "The ecoresponsive genome of Daphnia pulex." Science 331(6017): 555-561.
Kim, E. B., X. Fang, et al. (2011). "Genome sequencing reveals insights into physiology and longevity of the naked mole rat." Nature 479(7372): 223-227.
Grbic, M., T. Van Leeuwen, et al. (2011). "The genome of Tetranychus urticae reveals herbivorous pest adaptations." Nature 479(7374): 487-492.
以上內容轉載自:測序中國seq.cn(http://seq.cn/4607-48597)