1、常用生物数据库和数据格式常用生物数据库和数据格式基本内容 生物数据库相关背景 常用数据格式 fasta,fastq,gff,GenBank 常用序列数据库 美国国立生物技术信息中心(NCBI)欧洲生物信息学中心(EBI)DDBJ 常用基因功能数据库 基因本体数据库(Gene Ontology)京东基因与基因组百科全书(KEGG)Interpro蛋白功能数据库 常用基因组数据库 UCSC基因组浏览器 Ensembl基因组注释数据库234SequenceInterpro5?数据多,数据格式多,数据库也多。如何找到我们想要的数据库呢?如何找到我们想要的数据库呢?最新生物数据库列表(Nucleic A
2、cids Research)6基本内容 生物数据库相关背景 常用数据格式 fasta,fastq,gff,GenBank 常用序列数据库 美国国立生物技术信息中心(NCBI)欧洲生物信息学中心(EBI)DDBJ 常用基因功能数据库 基因本体数据库(Gene Ontology)京东基因与基因组百科全书(KEGG)Interpro蛋白功能数据库 常用基因组数据库 UCSC基因组浏览器 Ensembl基因组注释数据库7常见数据格式8常见数据格式 FASTA format FASTQ format GenBank format EMBL format GFF format9FASTA format10
3、l描述行描述行l“”“”分隔符分隔符l一般一般50-10050-100个字符每行个字符每行l没有标准的扩展名没有标准的扩展名FASTQ sequence format11l与与fastafasta格式类似格式类似l一条序列一般占用四行一条序列一般占用四行l序列和质量值各占一行序列和质量值各占一行12/88GenBank File FormatLOCUS SCU49845 5028 bp DNA linear PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene,partial cds;and Axl2p (AXL2
4、)and Rev7p(REV7)genes,complete cds.ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS .SOURCE Saccharomyces cerevisiae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces.REFERENCE 1 (bases 1
5、 to 5028)AUTHORS Torpey,L.E.,Gibbs,P.E.,Nelson,J.and Lawrence,C.W.TITLE Cloning and sequence of REV7,a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890.FEATURES Location/Qualifiers CDS 1.206 /codon_start
6、=3 /product=TCP1-beta /protein_id=AAA98665.1 /db_xref=GI:1293614 /translation=SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacg
7、agca gtagtcagct .4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc/13/88GBFF文件分为三部分:LOCUS SCU49845 5028 bp DNA linear PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene,partial cds;and Axl2p (AXL2)and Rev7p(REV7)genes,complete cds.ACCESSION U49845 VERSION U49845.1 GI:1293613
8、KEYWORDS .SOURCE Saccharomyces cerevisiae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E.,Gibbs,P.E.,Nelson,J.and Lawrence,C.W.TITLE Cloning
9、and sequence of REV7,a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890 FEATURES Location/Qualifiers CDS 1.206 /codon_start=3 /product=TCP1-beta /protein_id=AAA98665.1 /db_xref=GI:1293614 /translation=SS
10、IYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct .4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc/头部包含整
11、个记录的信息(描述符)头部包含整个记录的信息(描述符)第二部分包含了注释这一记录的特性第二部分包含了注释这一记录的特性第三部分是核苷酸序列本身第三部分是核苷酸序列本身所有序列数据库记录都在最后一行以所有序列数据库记录都在最后一行以“/”结尾结尾14/88GBFF格式说明LOCUS SCU49845 5028 bp DNA linear PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene,partial cds;and Axl2p (AXL2)and Rev7p(REV7)genes,complete cds.AC
12、CESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS .SOURCE Saccharomyces cerevisiae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E.,Gibbs,P.
13、E.,Nelson,J.and Lawrence,C.W.TITLE Cloning and sequence of REV7,a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890.FEATURES Location/Qualifiers CDS 1.206 /codon_start=3 /product=TCP1-beta /protein_id=AAA
14、98665.1 /db_xref=GI:1293614 /translation=SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct .4981 tgccatgact caga
15、ttctaa ttttaagcta ttcaatttct ctttgatc/15/88GBFF:locusLOCUS SCU49845 5028 bp DNA linear PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene,partial cds;and Axl2p (AXL2)and Rev7p(REV7)genes,complete cds.ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS .SOURCE Saccharomyces cerevisi
16、ae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E.,Gibbs,P.E.,Nelson,J.and Lawrence,C.W.TITLE Cloning and sequence of REV7,a gene whose funct
17、ion is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890.FEATURES Location/Qualifiers CDS 1.206 /codon_start=3 /product=TCP1-beta /protein_id=AAA98665.1 /db_xref=GI:1293614 /translation=SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSA
18、SEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct .4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc/16/88GBFF:locusLOCUS SCU49845 5028 bp DNA li
19、near PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene,partial cds;and Axl2p (AXL2)and Rev7p(REV7)genes,complete cds.ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS .SOURCE Saccharomyces cerevisiae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Ascomycota;Sacch
20、aromycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E.,Gibbs,P.E.,Nelson,J.and Lawrence,C.W.TITLE Cloning and sequence of REV7,a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOU
21、RNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890.FEATURES Location/Qualifiers CDS 1.206 /codon_start=3 /product=TCP1-beta /protein_id=AAA98665.1 /db_xref=GI:1293614 /translation=SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctccat atacaacgg
22、t atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct .4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc/n所有所有GBFFGBFF都起始于都起始于LOCUSLOCUS行行n第一项是第一项是LOCUSLOCUS名称(名称(SCU49845SCU49845)现在唯一的作用是它在数据库中是独一无二的,已不再)现在唯一的作用是它在数据库中是独一无二的,已不再具有
23、任何实际意义。大多数情况下,它仅使用检索号码(具有任何实际意义。大多数情况下,它仅使用检索号码(accesession numberaccesession number)以满足对)以满足对LOCUSLOCUS名称的要求。名称的要求。n第二项是序列长度第二项是序列长度 (5028 bp5028 bp)。规定单条数据库记录的长度不能超过。规定单条数据库记录的长度不能超过350kb350kb。除历史。除历史原因外,原因外,GenBankGenBank已经很少接受长度低于已经很少接受长度低于50bp50bp的序列了。的序列了。n第三项表明分子类型(第三项表明分子类型(DNADNA),其序列必须是一种单
24、一的分子类型),其序列必须是一种单一的分子类型n第四项是第四项是GenBankGenBank分类码分类码(PLN PLN),由,由3 3个字母组成。现在其作用仅限于在下载数据库时个字母组成。现在其作用仅限于在下载数据库时对数据库作简单的分类。对数据库作简单的分类。n最后一项是其最后修订日期。(最后一项是其最后修订日期。(21-JUN-199921-JUN-1999)。有时也仅表示是数据首次公开日期。有时也仅表示是数据首次公开日期。17/88GenBank分类码中文名称符号中文名称符号灵长类动物序列PRI啮齿类动物序列ROD其他哺乳动物序列MAM其他脊椎动物序列VRT无脊椎动物序列INV植物真菌
25、藻类序列PLN细菌序列BCT病毒序列VRL噬菌体序列PHG人工合成序列SYN未注释序列UNA表达序列标签EST专利序列PAT序列标记位点STS基因组测序序列GSS高通量基因组序列HTG未完成测序的高通量cDNA序列HTC高通量cDNA序列HTCback18/88GBFF:definitionLOCUS SCU49845 5028 bp DNA linear PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene,partial cds;and Axl2p (AXL2)and Rev7p(REV7)genes,comp
26、lete cds.ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS .SOURCE Saccharomyces cerevisiae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.
27、E.,Gibbs,P.E.,Nelson,J.and Lawrence,C.W.TITLE Cloning and sequence of REV7,a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890.FEATURES Location/Qualifiers CDS 1.206 /codon_start=3 /product=TCP1-beta /pro
28、tein_id=AAA98665.1 /db_xref=GI:1293614 /translation=SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct .4981 tgcc
29、atgact cagattctaa ttttaagcta ttcaatttct ctttgatc/19/88GBFF:definitionLOCUS SCU49845 5028 bp DNA linear PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene,partial cds;and Axl2p (AXL2)and Rev7p(REV7)genes,complete cds.ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS .SOURCE Saccha
30、romyces cerevisiae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E.,Gibbs,P.E.,Nelson,J.and Lawrence,C.W.TITLE Cloning and sequence of REV7,a
31、gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890.FEATURES Location/Qualifiers CDS 1.206 /codon_start=3 /product=TCP1-beta /protein_id=AAA98665.1 /db_xref=GI:1293614 /translation=SSIYNGISTSGLDLNNGTIADMRQL
32、GIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct .4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc/LOCUSLOCUS行的下一行为行的下一行为DEFINI
33、TIONDEFINITION行。行。主要对主要对GenBankGenBank记录中所含的生物学意义做出总结。它的说明内容包括了记录中所含的生物学意义做出总结。它的说明内容包括了来源物种、基因来源物种、基因/蛋白质名称。若序列是非编码区,则包含对序列功能的简蛋白质名称。若序列是非编码区,则包含对序列功能的简单描述;若是一段编码区,则标明该序列是部分序列(单描述;若是一段编码区,则标明该序列是部分序列(partial cdspartial cds)还是)还是全序列(全序列(complete cdscomplete cds)。)。20/88GBFF:accessionLOCUS SCU49845 5
34、028 bp DNA linear PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene,partial cds;and Axl2p (AXL2)and Rev7p(REV7)genes,complete cds.ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS .SOURCE Saccharomyces cerevisiae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Asc
35、omycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E.,Gibbs,P.E.,Nelson,J.and Lawrence,C.W.TITLE Cloning and sequence of REV7,a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces c
36、erevisiae JOURNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890.FEATURES Location/Qualifiers CDS 1.206 /codon_start=3 /product=TCP1-beta /protein_id=AAA98665.1 /db_xref=GI:1293614 /translation=SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctc
37、cat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct .4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc/21/88GBFF:accessionLOCUS SCU49845 5028 bp DNA linear PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-bet
38、a gene,partial cds;and Axl2p (AXL2)and Rev7p(REV7)genes,complete cds.ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS .SOURCE Saccharomyces cerevisiae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;
39、Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E.,Gibbs,P.E.,Nelson,J.and Lawrence,C.W.TITLE Cloning and sequence of REV7,a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890.FEATURES Location
40、/Qualifiers CDS 1.206 /codon_start=3 /product=TCP1-beta /protein_id=AAA98665.1 /db_xref=GI:1293614 /translation=SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagt
41、taggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct .4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc/p检索号(检索号(accessionaccession)是序列记录的惟一指针。通常由)是序列记录的惟一指针。通常由1 1个字母加个字母加5 5个数字个数字(U12345U12345)或由)或由2 2个字母加个字母加6 6个数字(个数字(AF123456AF123456)组成。它在数据库中是惟)组成。它在数据库中是惟一而且不变的。一而且不变的。p有时有时ACCESSIONACCES
42、SION行中可能会出现多个检索号,可能是由于数据提交者提交行中可能会出现多个检索号,可能是由于数据提交者提交了一条与原记录相关的新记录或新提交的记录覆盖了原有的旧记录。我们了一条与原记录相关的新记录或新提交的记录覆盖了原有的旧记录。我们称第一个检索号为主检索号,其余的统称二级检索号。称第一个检索号为主检索号,其余的统称二级检索号。22/88GBFF:versionLOCUS SCU49845 5028 bp DNA linear PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene,partial cds;and
43、Axl2p (AXL2)and Rev7p(REV7)genes,complete cds.ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS .SOURCE Saccharomyces cerevisiae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces.REFERENCE
44、 1 (bases 1 to 5028)AUTHORS Torpey,L.E.,Gibbs,P.E.,Nelson,J.and Lawrence,C.W.TITLE Cloning and sequence of REV7,a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890.FEATURES Location/Qualifiers CDS 1.206 /
45、codon_start=3 /product=TCP1-beta /protein_id=AAA98665.1 /db_xref=GI:1293614 /translation=SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttaca
46、agct aaaacgagca gtagtcagct .4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc/23/88GBFF:versionLOCUS SCU49845 5028 bp DNA linear PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene,partial cds;and Axl2p (AXL2)and Rev7p(REV7)genes,complete cds.ACCESSION U49845 VERSION U49845.1
47、GI:1293613 KEYWORDS .SOURCE Saccharomyces cerevisiae(bakers yeast)ORGANISM Saccharomyces cerevisiae Eukaryota;Fungi;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E.,Gibbs,P.E.,Nelson,J.and Lawrence,C.W.TIT
48、LE Cloning and sequence of REV7,a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10(11),1503-1509(1994)PUBMED 7871890.FEATURES Location/Qualifiers CDS 1.206 /codon_start=3 /product=TCP1-beta /protein_id=AAA98665.1 /db_xref=GI:1293614 /tra
49、nslation=SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM gene 687.3158 /gene=AXL2.ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct .4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctt
50、tgatc/VERSIONVERSION行是版本号,格式为:检索号行是版本号,格式为:检索号.版本号。版本号。版本号用于识别数据库中一条单一的特定核苷酸序列。在数据库中,如某版本号用于识别数据库中一条单一的特定核苷酸序列。在数据库中,如某条序列数据发生了变化,即使是单碱基的改变它的版本号也将增加,而其检索条序列数据发生了变化,即使是单碱基的改变它的版本号也将增加,而其检索号保持不变。号保持不变。版本号系统与其后的版本号系统与其后的GIGI(geninfo identifiergeninfo identifier)号系统是平行运行的。即)号系统是平行运行的。即当一条序列改变后,它将被赋予一个新的