生物信息学全册配套最完整精品课件(英文版）.ppt_163文库

资源描述

1、自我介绍自我介绍 2000美国密苏里州立大学生物化学系博士美国密苏里州立大学生物化学系博士 20002003美国休斯敦美国休斯敦Lexicon制药公司高级生物信息学科学家制药公司高级生物信息学科学家 2004度中科院百人计划入学者，目前研究方向包括：度中科院百人计划入学者，目前研究方向包括： 1、精子发生、精子发生 2、干细胞自我更新与分化、干细胞自我更新与分化技术专长：分子生物学、干细胞、生物信息学技术专长：分子生物学、干细胞、生物信息学课程描述课程编号：课程编号：511012Y课程属性：课程属性：学科基础课学时学时/学分：学分：40/2 预修课程：预修课程：分子生物学、遗传学、统计

2、学、C语言教学目的和要求：教学目的和要求：生物信息学是利用数学模型和计算机程序对生物学研究中产生的数据进行分析计算并得出结论和产生新的科学假说的一种科研手段。通过本课程的教授，使得学生能够：懂得生物学中有哪些数学问题，数学模型和数学手段；利用数据库技术、计算机编程和网页工具来进行基本的生物信息学分析；掌握核酸和蛋白质序列分析的基本技能；懂得如何从芯片和其他高通量技术产生的数据来构建基因调控网络；本课程的开设要求学生有分子生物学、遗传学、统计学及C语言的基础知识和技能，更重要的是要求学生要努力培养自己利用数学模型和逻辑思维来思考和解决生物学问题。本课程为生物学各专业博士、硕

3、士研究生的学科基础课，同时也可作为数理、计算机等相关学科研究生的选修课。本课程的考核方式为大作业和期末考试，比例为50%：50%。教学大纲第一章第一章生物信息学入门生物信息学入门（9学时）学时） 1. 生物学中的数学问题生物学中的数学问题(computational problems in biology)（3学时，学时， 3月月2日）日）第二章第二章序列和结构序列和结构（15学时）学时） 1. 序列比对（序列比对（sequence alignment）（）（3学时，学时，3月月9日）日）第一章第一章生物信息学入门生物信息学入门（9学时）学时） 2. 数据库原理、数据库

4、原理、PHP编程入门（编程入门（3学时：学时：3学时上机，学时上机，3月月16日）日） 3. R语言和语言和Bioconductor软件包（软件包（3学时：学时：3学时上机，学时上机， 3月月23日）日）第二章第二章序列和结构序列和结构（15学时）学时） 2. 进化树进化树(phylogenetic trees)（1.5学时，学时，3月月30日）日） 3。模式发现（。模式发现（motif discovery）（）（1.5学时，学时，3月月30日）日） 4. RNA二级结构（二级结构（RNA secondary structure）(3学时，学时，4月月6日，王秀杰日，王秀杰) 5. 蛋白

5、质结构分析（蛋白质结构分析（protein structure analysis）（）（6学时，学时，4月月13日，蒋太交）日，蒋太交）第三章第三章从芯片数据到基因调控网络从芯片数据到基因调控网络（15） 3.1 生物芯片设计生物芯片设计(microarray design)（1学时学时, 4月月27日）日） 3.2 表达值计算表达值计算(summation of expression value)（1学时学时, 4月月27日）日） 3.3 归一化归一化(normalization)（1学时学时, 4月月27日）日） 3.4 差异基因的分析差异基因的分析(differential gen

6、e expression)（3学时学时, 5月月4日）日） 3.5 聚类分析聚类分析(clustering)（3学时学时, 5月月11日）日） 3.6 网络入门网络入门(introduction to networks)（3学时学时, 5月月18日）日） 3.7 贝叶斯网络等贝叶斯网络等(Basian networks and others)（3学时学时, 5月月25日）王秀杰）日）王秀杰）参考书教材：教材：本课程以科研文献阅读为主，没有特定教材。主要参考书：主要参考书： 1. 简明生物信息学钟扬，张亮，赵琼主编高等教育出版社 2001 2. 常用生物数据分析软件王俊，丛丽娟，

7、郑洪坤著科学出版社 2008 3. Bioinformatics: sequence and genome analysis David W. Mount New York : Cold Spring Harbor Laboratory, 2004 Features of my lectures Enlightening启发式 Interactive互动式 Interesting有趣的 English（半）英语的第一章生物信息学入门 1. 生物学中的数学问题 (computational problems in biology) Outlines 1. What is bioinform

8、atics? 2. Basic knowledges 3. Mathematical problems in biological researches: From Mendel to nowadays! Bioinformaticswhat is it? What is a triangle? What is human beings? Platos definition What is bioinformatics? Biologysubject Computer-tool MathematicsModel It is what you are doing for solving biol

9、ogical problems using a computer ! Bioinformatics in the Universe Universe (宇宙宇宙=空间空间+时间时间) Human civilization sciencesartsreligions Natural sciencesSocial sciences biologymathematicsphysics biostatisticsbioinformaticsComputational biology Non-human world What do you mean by biology? Taxonomy Physio

10、logy Evolution Cell biology Genetics Molecular biology-DNA, RNA, Protein How about computer? yes no PC, Server Quantum computer, DNA computer Internet TCP/IP Website Electronic business FTP P2P Telnet Hacker PC Apple Unix/Linux Chinese version C, Perl, PHP, JAVA, .NET compiler Database Spread sheet

11、And mathematics? Object（Subject）： Mathematics is the study of quantity （arithmetic，算术）, structure （algebra, 代数）, space （geometry，几何）, and change （calculus , 微积分）. Pure mathmatics vs Applied mathematics Goldbach Conjecture vs Statistics Definition, axiom, statement Reasoning (proof) theorem (truth, k

12、nowledge) How does does mathematics work? Outlines 1. What is bioinformatics? 2. Basic knowledges 3. Mathematical problems in biological researches: From Mendel to nowadays! Definitions, notions, terminology Sets A set is a group of objects. Elements/members A=7, 21, 57 7A,8 Objects, classes, intera

13、ctions Laws of Thought 1.Law of identity: Whatever is, is. 2.Law of noncontradiction: Nothing can both be and not be. 3.Law of excluded middle: Everything must either be or not be. Reasoning, Logic, Argument Reasoning is the cognitive process of looking for reasons, beliefs, conclusions, actions or

14、feelings. Logic is the study of reasoning. An argument is a set of one or more meaningful declarative sentences (or propositions) known as the premises along with another meaningful declarative sentence (or proposition) known as the conclusion. One approach to the study of reasoning is to identify v

15、arious forms of reasoning that may be used to support or justify conclusions. The main division between forms of reasoning that is made in philosophy is between deductive reasoning and inductive reasoning. Formal logic has been described as the science of deduction. The study of inductive reasoning

16、is generally carried out within the field known as informal logic or critical thinking. Deductive reasoning Premise 1: All humans are mortal. Premise 2: Socrates is a human. Conclusion: Socrates is mortal. Inductive reasoning Premise: The sun has risen in the east every morning up until now. Conclus

17、ion: The sun will also rise in the east tomorrow. Statistical inference Statistical inference is the process of making conclusions using data that is subject to random variation, for example, observational errors or sampling variation. Outlines 1. What is bioinformatics? 2. Basic knowledges 3. Mathe

18、matical problems in biological researches: From Mendel to nowadays! Biological Story 1 Medels Laws Medels Law of Segregation The First Law When any individual produces gametes, the copies of a gene separate so that each gamete receives only one copy. Binary phenotype Dominance Gametes Statistics Com

19、bination Medels Law of Independent Assortment The Second Law Alleles of different genes assort independently of one another during gamete formation. Computational Problems Combinatorial principles 组合原理组合原理 Rule of sum （加法原理） Rule of product （乘法原理） More about Mendels Laws Gregor Johann Mendel, a 19th

20、 century Austrian Priest/monk Trained as physicist and majority of his published works related to meteorology. Between 1856 and 1863, he cultivated and tested some 29,000 pea plants. published in 1866. In 1900，re-discovered by three European scientists, Hugo de Vries, Carl Correns, and Erich von Tsc

21、hermak. William Bateson, who coined the term genetics, gene, and allele to describe many of its tenets. Very few true Mendelian characters in nature. R.A. Fisher Thomas Hunt Morgan, Chromosome, classic genetics Biological Story 2 Hardy-Weinberg Law Hardy-Weinberg Law (1908) P(A1) = p, P(A2) = q; Ran

22、dom mating P(A1A1) = p2, P(A1A2) = 2pq, P(A2A2) = q2; Parent generation: P(A1A1) = p11, P(A1A2) = p12, P(A2A2) = p22; p11 + p12 + p22 = 1; P(A1) = p=p11+0.5p12, P(A2) = q=p22+0.5p12; p + q = 1; Mating frequency offspring genotype FatherMotherA1A1A1A2A2A2 A1A1A1A1p11p11100 ? ? Thomas Hunt Morgan Sept

23、ember 25, 1866 December 4, 1945 American evolutionary biologist, geneticist and embryologist Nobel Prize in Physiology or Medicine in 1933 for discoveries relating the role the chromosome plays in heredity 22 books and 370 scientific papers The Division of Biology he established at the California In

24、stitute of Technology has produced seven Nobel Prize winners. Biological Story 3 Linear arrangement of genes First Genetic Map Alfred Sturtevant (1891-1970) Undergraduate 1 mu = 1 cM = 1% = 0.88 Mbp Crossover vs recombination Isnt there a problem using recombination rate as a measure of distance? Ma

25、ximum recombination rate (q) is 0.5 Distance (l) should be defined as the expected number of crossovers. 0.10.2 0.3 ABC D q =0.05 Psp|P01228|FSHB_PIG Follitropin beta chain precursor (Follicle-stimulating hormone beta subunit) (FSH-beta) (FSH-B) Length = 129 Score = 275 bits (703), Expect = 2e-74 Id

26、entities = 120/129 (93%), Positives = 124/129 (96%) Query: 1 MKTLQFFFLFCCWKAICCNSCELTNITIAIEKEECRFCISINTTWCAGYCYTRDLVYKDP 60 MK+LQF FLFCCWKAICCNSCELTNITI +EKEEC FCISINTTWCAGYCYTRDLVYKDP Sbjct: 1 MKSLQFCFLFCCWKAICCNSCELTNITITVEKEECNFCISINTTWCAGYCYTRDLVYKDP 60 Query: 61 ARPKIQKTCTFKELVYETVRVPGCAHHADSL

27、YTYPVATQCHCGKCDSDSTDCTVRGLGPS 120 ARP IQKTCTFKELVYETV+VPGCAHHADSLYTYPVAT+CHCGKCDSDSTDCTVRGLGPS Sbjct: 61 ARPNIQKTCTFKELVYETVKVPGCAHHADSLYTYPVATECHCGKCDSDSTDCTVRGLGPS 120 Query: 121 YCSFGEMKE 129 YCSF EMKE Sbjct: 121 YCSFSEMKE 129 Yes, do a blast search! But, what does all this mean? A sequence is a

28、list of objects in some order. For example, (7, 21, 57) A sequence with k elements is a k-tuple. A sequence is different from a set in that repetition is allowed in the former but not in the later. DNAs and proteins are perfect examples of sequences with different alphabets. G, A, T, C for DNA 20 am

29、ino acids for proteins. Sequence alignment is to check how well two or more sequences match with each other. Similarity means homology, ie, coming from a common source. “Nature is a tinkerer and not an inventor.” -Jacob 1977 A T T C G G C A T TC A G T G C T A G A A T T C G G C A T TG C T A G A Optim

30、al alignment and scoring system. A T T C G G C A T TC A G T G C T A G A A T T C G G C A T T-G C T A G A Global vs local alignment Global alignment - aaagcggaagtcacag |.|.| |.| aaggctgaagt-atag Local alignment - : aaagcggaagtcacag .| . aaggctgaagt-atag Sequence Alignments with Indels Evolution produc

31、es insertions and deletions (indels) In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHNALQRRTVWVNAY MHHALQRRTVWVNAY- MHH-ALQRRTVWVNAY Blosum Score = 2 (end = -6) Score = 79 (gap = -6) An alignment must have equal length aligned sequences So, we must add gaps at the start and the ends C

32、ombinatorially difficult problem to find best indel solution Gap Penalties Gaps are penalised Write wx to indicate the penalty for a gap of length x For example, each gap scores -6, so wx = -6*x One common scheme is Score -12 for opening a gap And -2 for every subsequent gap i.e., wx = -12 - 2*(x-1)

33、 Start and end gap penalties often set to zero But this can leave a doubt About evolutionary conclusions How many ways of intercalating two sequences of lengths n and m? Any pair-wise alignment can be transformed into an intercalated sequence. ! )!( alignments ofnumber total nm mn The brute force ap

34、proach Cm+n m A T T -G G C A T T a ttc g a c -t A a T tT tc G g G a C c A T T t Dot Matrix Representations (Dotplots) To help visualise best alignments Plot where each pair is the same, then draw best line MNALSQLN Nl l Al Ll l Ml Sl Ql Nl l H M NALSQLN Nl l Al Ll Ml Sl Ql Nl l H Getting Alignments

35、from Dotplot Paths M NALSQLN Nl l Al Ll Ml Sl Ql Nl l H Indicates that M matches with a gap Indicates that L matches with a gap Stage 1: Align middle Use triangles To indicate gaps NAL-SQLN NALMSQ-N Stage 2: Sort the ends out MNAL-SQLN- -NALMSQ-NH Dotplots for Real Proteins Need a way to automatical

36、ly find the best path(s) Outline lSequence alignment defined lDynamic programming, fasta, and blast lScoring matrix Dynamic Programming Approach Dynamic Programming: Also known as: Needleman-Wunsch Algorithm Can use it to draw the Dotplot paths From that we can get the alignment Mathematically guara

37、nteed To find the best scoring alignment Given a substitution scheme (scoring scheme, e.g., BLOSUM) And given a gap penalty Overview of Needleman- Wunsch Four Stages 1. Initialise a matrix for the sequences 2. Fill in the entries of that matrix (call these Si,j) At the same time drawing arrows in th

38、e matrix 3. Use the arrows to find the best scoring path(s) 4. Interpret the paths as alignments as before Illustrate with: MNALQM bestSequence s ; Return bestSequence FASTA - Algorithm - Step 1 Find all hot-spots (matching words) / Hot spots are pairs of words of length k (k-tuple) that exactly mat

39、ch K= 1 or 2 for protein K= 4 or 6 for DNA Hot Spots Sequence 1 Sequence 2 FASTA - Algorithm - Step 1 in detail Use look-up Table Query : G A A T T C A G T T A Sequence: G G A T C G A 4,5,9,10T 1,8G 6C 2,3,7,11A LocationQ *A *G *C *T *A *G *G ATTGACTTAAG Look-up Table DotMatrix FASTA - Algorithm - S

40、tep 2 Score the Hot-spot and locate the ten best diagonal run. / There is some scoring system; ex. PAM250 lStep 3 Combine sub-alignments into one alignment with GAP FASTA - Algorithm - One of local alignment GAP FASTA - Algorithm - Step 3 algorithm # Consider weighted direct graph. # Let node be a s

41、ub-alignment found in step 2 # Let u and v be nodes # Edge (u,v) exists if alignment u is before in the sequence. # Each edge has gap penalty (negative) # Find the maximum weight path One Sequence runs Edge One of Sequence FASTA - Algorithm - Step 3 in detail Sub-alignment Gap -5 -3 GAP Max Weight P

42、ath -3 FASTA - Algorithm - Step 4 Use the dynamic programming in restricted area around the best- score alignment to find out the higher-score alignment than the best-score alignment Width of this band is a parameter FASTA - Summary of Algorithm - 1: Find all hot-spots / Hot spots is pairs of words

43、of length k that exactly match 2: Score the Hot-spot and locate the ten best diagonal run. 3: Combine sub-alignments into one alignment: Score Each alignment with gap penalty and pick up the best-score alignment 4: Use the dynamic programming in restricted area around the best-score alignment to fin

44、d out the alignment greater than the best-score alignment. FASTA - Complexity - Complexity # Step 1 and 2 / select the best 10 diagonal run Let n be a sequence from DB O(n) because Step 1 just uses look up the table O(n) O(mn) m,n = 100 to 200 FASTA - Complexity - # Step 3 / compute the MAX Weight P

45、ath Let r be the number of sub-alignments. (r = 10) Lets be the number of edges O(r2) = 104 -5 -3 Max Weight Path -3 Positive Weight FASTA - Complexity - # Step 4/ compute partial D.P. Depends on the restricted area filename 只能创建新文件,不能编辑已有文件. 3.将几个文件合并为一个文件。 $cat file1 file2 file3 -bash-3.2$ cat seq

46、.txt query ADRGHLKEWI seq ADRTHLKDWI rm 功能：功能：删除命令中所列出的每个文件。默认时，该命令不删除目录。其中FILE是要删除的文件 OPTION参数： -f:强制删除，不出现确认信息。 -r或-R或-recursive：以递归方式删除目录中内容。 cp seq.txt test.txt ls seq.txt test.txt rm test.txt ls seq.txt find find /opt/www/ -name seq.txt /opt/www/hanclass/bio_stu/seq.txt Outline Operating Syste

47、m and Linux Web server, http, HTML and Apache Database and MySQL Programming and PHP FASTA implementation Web server TCP, IP, HTTP HTML student! /body? http:/210.77.20.246/hanclass/bio_stu/test.php ftp http:/ WinSCP/4.3.2/winscp432setup.exe/downl oad php editor http:/ 14.html http:/ Outline Operatin

48、g System and Linux HTML and Apache Database and MySQL Programming and PHP FASTA implementation Database A database is a system intended to organize, store, and retrieve large amounts of data easily. Database management system (DBMS) consists of software that operates databases, providing storage, ac

49、cess, security, backup and other facilities. Relational databaseER model DBMS：Data Base Management System Microsoft Access Oracle DB2 MySQL 数据库服务器My SQL 什么是MySQL MySQL是一个小型关系型数据库管理系统，开发者为瑞典MySQL AB公司。在2008年1月16号被Sun公司收购。而2009年,SUN又被 Oracle收购. 目前MySQL被广泛地应用在Internet上的中小型网站中。由于其体积小、速度快、总体拥有成本低，尤其是开放源

50、码这一特点，许多中小型网站为了降低网站总体拥有成本而选择了MySQL作为网站数据库。 MySQL是一个多用户、多线程的SQL数据库，是一个客户机/服务器结构的应用，它由一个服务器守护程序mysqld和很多不同的客户程序和库组成。MySQL 主要的目标是快速、稳定和容易使用。 Connect and disconnect 当使用mysql命令来连接MySQL服务器时，通常需要提供一个MySQL用户名和密码。如果MySQL服务器运行在不是用户所登录的计算机上时，还将需要指定主机名。连接MySQL服务器的语句格式如下： mysql -u bio_stu -p mysql QUIT s

展开阅读全文