1、基因组组装 2019.10.29一、Genome survey Kmer:a continuous nucleic acid sequences,the length is K bp.Suppose the genome is unique to K,we can get G different kmers.when generate a read,the possibility of a certain kmer be sequenced is(L-K+1)/G.L/G is very small,the n_r is very large,this is obey to Poisson d
2、istribution.So,d_k=(L-K+1)/G*n_r n_k=(L-K+1)*n_rthen,G=n_k/d_kQuality control and filtering Reads having a N over 10%of its length.Reads from short insert-size libraries having more than 65%bases with the quality 7,and the reads from large insert-size libraries that contained more than 80%bases with
3、 the quality 7.Read 1 and read 2 of two paired-end reads that were completely identical(and thus considered to be the products of PCR duplication).Error correction before assembly二、SOAPdenovo algorithm SOAPdenovo was developed to assemble large genomes,such as human,it also works well for small geno
4、mes like bacteria.Include five major steps:De bruijn graph construction Graph simplification and obtain contigs Pair-end reads mapping to contigs Construct scaffolds Gap filling with pair-end readsSequence assembly refers to aligning and merging fragments to a much longer DNA sequence in order to re
5、construct the original sequence.Overlap:contigGe+en+no+om+mi+ic+csGenomicsPair-end:scaffoldnomGenomesemassemblyGenome*assembly221、De bruijn graph constructionReads:AGATCTTGTTATTGTTATTGATCTCCDe bruijn graph construction1.liding to take Kmer from reads,storing the links betweenneighboring Kmers.2.If t
6、he Kmer is already existent,merge the links of it with the first ones.AGATCATCTTCTTGTTTGTTTGTTAGTTATATCTCTCTCCGATCTTCTTGTTATTTATTGTTGATATTGATGATCDe bruijn graph2、Graph simplification Contigs:GATCTTGTTATTGATCT GATCTCCAGATCTset-R parameterContigs:AGATCTTGTTATTGATCTCCRead1:AGATCTTGTTATT Read2:GTTATTGAT
7、CTCCAGATC 1GATCTATCTTGTTATTGATCATCTCC234AGATCGATCTATCTTTCTTGCTTGTTTGTTTGTTAGTTATATCTCTCTCCTTATTTATTGATTGATTGATTGATC3、Pair-end mapping to contig4、Construct scaffoldsNote:1.For mate-pair(=2Kb),the order is just opposite.2.A reliable link will be built between two contigs,when pair-end/mate-pair readss
8、upport larger than the number be set.3.The gap size is estimated from the insert size of each reads pair.5、Gap closureGet reads located in the gap and then do local assembly.(1)Close gap by pair-end information(One end mapped on the contig,the other end fall in the gap)(2)Do a local assembly using t
9、he reads fall in the gap to get a sequence connect with the both edges of two contigs.Note:Gap closure here also means extend contigs.Schematic overview 三、Evaluation of assembly result Lengthcontig(scaffold)N50 size,N90 size,total length,coverage ratio of genome.AccuracyCoverage of gene sequences,compare to EST or transcriptome sequences.Compare with golden standard(such as BAC/fosmid).Evaluation of Gene Region CoverageCompare with golden standardComparative genomic analysisAccuracy of gene structuresThanThank k y yo ou u f fo or r l listeningistening!