1、 自我介绍自我介绍 2000美国密苏里州立大学生物化学系博士美国密苏里州立大学生物化学系博士 20002003美国休斯敦美国休斯敦Lexicon制药公司高级生物信息学科学家制药公司高级生物信息学科学家 2004度中科院百人计划入学者,目前研究方向包括:度中科院百人计划入学者,目前研究方向包括: 1、精子发生、精子发生 2、干细胞自我更新与分化、干细胞自我更新与分化 技术专长:分子生物学、干细胞、生物信息学技术专长:分子生物学、干细胞、生物信息学 课程描述 课程编号:课程编号:511012Y课程属性:课程属性:学科基础课学时学时/学分:学分:40/2 预修课程:预修课程:分子生物学、遗传学、统计
2、学、C语言 教学目的和要求:教学目的和要求: 生物信息学是利用数学模型和计算机程序对生物学研究中产生的数据进行分 析计算并得出结论和产生新的科学假说的一种科研手段。通过本课程的教授, 使得学生能够: 懂得生物学中有哪些数学问题,数学模型和数学手段; 利用数据库技术、计算机编程和网页工具来进行基本的生物信息学分析; 掌握核酸和蛋白质序列分析的基本技能; 懂得如何从芯片和其他高通量技术产生的数据来构建基因调控网络; 本课程的开设要求学生有分子生物学、遗传学、统计学及C语言的基础知识 和技能,更重要的是要求学生要努力培养自己利用数学模型和逻辑思维来思 考和解决生物学问题。本课程为生物学各专业博士、硕
3、士研究生的学科基础 课,同时也可作为数理、计算机等相关学科研究生的选修课。本课程的考核 方式为大作业和期末考试,比例为50%:50%。 教学大纲 第一章第一章 生物信息学入门生物信息学入门 (9学时)学时) 1. 生物学中的数学问题生物学中的数学问题(computational problems in biology)(3学时,学时, 3月月2日)日) 第二章第二章 序列和结构序列和结构 (15学时)学时) 1. 序列比对(序列比对(sequence alignment)()(3学时,学时,3月月9日)日) 第一章第一章 生物信息学入门生物信息学入门 (9学时)学时) 2. 数据库原理、数据库
4、原理、PHP编程入门(编程入门(3学时:学时:3学时上机,学时上机,3月月16日)日) 3. R语言和语言和Bioconductor软件包(软件包(3学时:学时:3学时上机,学时上机, 3月月23日)日) 第二章第二章 序列和结构序列和结构 (15学时)学时) 2. 进化树进化树(phylogenetic trees)(1.5学时,学时,3月月30日)日) 3。模式发现(。模式发现(motif discovery)()(1.5学时,学时,3月月30日)日) 4. RNA二级结构(二级结构(RNA secondary structure)(3学时,学时,4月月6日,王秀杰日,王秀杰) 5. 蛋白
5、质结构分析(蛋白质结构分析(protein structure analysis)()(6学时,学时,4月月13日,蒋太交)日,蒋太交) 第三章第三章 从芯片数据到基因调控网络从芯片数据到基因调控网络 (15) 3.1 生物芯片设计生物芯片设计(microarray design)(1学时学时, 4月月27日)日) 3.2 表达值计算表达值计算(summation of expression value)(1学时学时, 4月月27日)日) 3.3 归一化归一化(normalization)(1学时学时, 4月月27日)日) 3.4 差异基因的分析差异基因的分析(differential gen
6、e expression)(3学时学时, 5月月4日)日) 3.5 聚类分析聚类分析(clustering)(3学时学时, 5月月11日)日) 3.6 网络入门网络入门(introduction to networks)(3学时学时, 5月月18日)日) 3.7 贝叶斯网络等贝叶斯网络等(Basian networks and others)(3学时学时, 5月月25日)王秀杰)日)王秀杰) 参考书 教材:教材: 本课程以科研文献阅读为主,没有特定教材。 主要参考书:主要参考书: 1. 简明生物信息学 钟扬, 张亮,赵琼主编 高等教育出版社 2001 2. 常用生物数据分析软件 王俊,丛丽娟,
7、郑洪坤著 科学出版社 2008 3. Bioinformatics: sequence and genome analysis David W. Mount New York : Cold Spring Harbor Laboratory, 2004 Features of my lectures Enlightening启发式 Interactive互动式 Interesting有趣的 English(半)英语的 第一章 生物信息学入门 1. 生物学中的数学问题 (computational problems in biology) Outlines 1. What is bioinform
8、atics? 2. Basic knowledges 3. Mathematical problems in biological researches: From Mendel to nowadays! Bioinformaticswhat is it? What is a triangle? What is human beings? Platos definition What is bioinformatics? Biologysubject Computer-tool MathematicsModel It is what you are doing for solving biol
9、ogical problems using a computer ! Bioinformatics in the Universe Universe (宇宙宇宙=空间空间+时间时间) Human civilization sciencesartsreligions Natural sciencesSocial sciences biologymathematicsphysics biostatisticsbioinformaticsComputational biology Non-human world What do you mean by biology? Taxonomy Physio
10、logy Evolution Cell biology Genetics Molecular biology-DNA, RNA, Protein How about computer? yes no PC, Server Quantum computer, DNA computer Internet TCP/IP Website Electronic business FTP P2P Telnet Hacker PC Apple Unix/Linux Chinese version C, Perl, PHP, JAVA, .NET compiler Database Spread sheet
11、And mathematics? Object(Subject): Mathematics is the study of quantity (arithmetic,算术), structure (algebra, 代数), space (geometry,几何), and change (calculus , 微积分). Pure mathmatics vs Applied mathematics Goldbach Conjecture vs Statistics Definition, axiom, statement Reasoning (proof) theorem (truth, k
12、nowledge) How does does mathematics work? Outlines 1. What is bioinformatics? 2. Basic knowledges 3. Mathematical problems in biological researches: From Mendel to nowadays! Definitions, notions, terminology Sets A set is a group of objects. Elements/members A=7, 21, 57 7A,8 Objects, classes, intera
13、ctions Laws of Thought 1.Law of identity: Whatever is, is. 2.Law of noncontradiction: Nothing can both be and not be. 3.Law of excluded middle: Everything must either be or not be. Reasoning, Logic, Argument Reasoning is the cognitive process of looking for reasons, beliefs, conclusions, actions or
14、feelings. Logic is the study of reasoning. An argument is a set of one or more meaningful declarative sentences (or propositions) known as the premises along with another meaningful declarative sentence (or proposition) known as the conclusion. One approach to the study of reasoning is to identify v
15、arious forms of reasoning that may be used to support or justify conclusions. The main division between forms of reasoning that is made in philosophy is between deductive reasoning and inductive reasoning. Formal logic has been described as the science of deduction. The study of inductive reasoning
16、is generally carried out within the field known as informal logic or critical thinking. Deductive reasoning Premise 1: All humans are mortal. Premise 2: Socrates is a human. Conclusion: Socrates is mortal. Inductive reasoning Premise: The sun has risen in the east every morning up until now. Conclus
17、ion: The sun will also rise in the east tomorrow. Statistical inference Statistical inference is the process of making conclusions using data that is subject to random variation, for example, observational errors or sampling variation. Outlines 1. What is bioinformatics? 2. Basic knowledges 3. Mathe
18、matical problems in biological researches: From Mendel to nowadays! Biological Story 1 Medels Laws Medels Law of Segregation The First Law When any individual produces gametes, the copies of a gene separate so that each gamete receives only one copy. Binary phenotype Dominance Gametes Statistics Com
19、bination Medels Law of Independent Assortment The Second Law Alleles of different genes assort independently of one another during gamete formation. Computational Problems Combinatorial principles 组合原理组合原理 Rule of sum (加法原理) Rule of product (乘法原理) More about Mendels Laws Gregor Johann Mendel, a 19th
20、 century Austrian Priest/monk Trained as physicist and majority of his published works related to meteorology. Between 1856 and 1863, he cultivated and tested some 29,000 pea plants. published in 1866. In 1900,re-discovered by three European scientists, Hugo de Vries, Carl Correns, and Erich von Tsc
21、hermak. William Bateson, who coined the term genetics, gene, and allele to describe many of its tenets. Very few true Mendelian characters in nature. R.A. Fisher Thomas Hunt Morgan, Chromosome, classic genetics Biological Story 2 Hardy-Weinberg Law Hardy-Weinberg Law (1908) P(A1) = p, P(A2) = q; Ran
22、dom mating P(A1A1) = p2, P(A1A2) = 2pq, P(A2A2) = q2; Parent generation: P(A1A1) = p11, P(A1A2) = p12, P(A2A2) = p22; p11 + p12 + p22 = 1; P(A1) = p=p11+0.5p12, P(A2) = q=p22+0.5p12; p + q = 1; Mating frequency offspring genotype FatherMotherA1A1A1A2A2A2 A1A1A1A1p11p11100 ? ? Thomas Hunt Morgan Sept
23、ember 25, 1866 December 4, 1945 American evolutionary biologist, geneticist and embryologist Nobel Prize in Physiology or Medicine in 1933 for discoveries relating the role the chromosome plays in heredity 22 books and 370 scientific papers The Division of Biology he established at the California In
24、stitute of Technology has produced seven Nobel Prize winners. Biological Story 3 Linear arrangement of genes First Genetic Map Alfred Sturtevant (1891-1970) Undergraduate 1 mu = 1 cM = 1% = 0.88 Mbp Crossover vs recombination Isnt there a problem using recombination rate as a measure of distance? Ma
25、ximum recombination rate (q) is 0.5 Distance (l) should be defined as the expected number of crossovers. 0.10.2 0.3 ABC D q =0.05 Psp|P01228|FSHB_PIG Follitropin beta chain precursor (Follicle-stimulating hormone beta subunit) (FSH-beta) (FSH-B) Length = 129 Score = 275 bits (703), Expect = 2e-74 Id
26、entities = 120/129 (93%), Positives = 124/129 (96%) Query: 1 MKTLQFFFLFCCWKAICCNSCELTNITIAIEKEECRFCISINTTWCAGYCYTRDLVYKDP 60 MK+LQF FLFCCWKAICCNSCELTNITI +EKEEC FCISINTTWCAGYCYTRDLVYKDP Sbjct: 1 MKSLQFCFLFCCWKAICCNSCELTNITITVEKEECNFCISINTTWCAGYCYTRDLVYKDP 60 Query: 61 ARPKIQKTCTFKELVYETVRVPGCAHHADSL
27、YTYPVATQCHCGKCDSDSTDCTVRGLGPS 120 ARP IQKTCTFKELVYETV+VPGCAHHADSLYTYPVAT+CHCGKCDSDSTDCTVRGLGPS Sbjct: 61 ARPNIQKTCTFKELVYETVKVPGCAHHADSLYTYPVATECHCGKCDSDSTDCTVRGLGPS 120 Query: 121 YCSFGEMKE 129 YCSF EMKE Sbjct: 121 YCSFSEMKE 129 Yes, do a blast search! But, what does all this mean? A sequence is a
28、list of objects in some order. For example, (7, 21, 57) A sequence with k elements is a k-tuple. A sequence is different from a set in that repetition is allowed in the former but not in the later. DNAs and proteins are perfect examples of sequences with different alphabets. G, A, T, C for DNA 20 am
29、ino acids for proteins. Sequence alignment is to check how well two or more sequences match with each other. Similarity means homology, ie, coming from a common source. “Nature is a tinkerer and not an inventor.” -Jacob 1977 A T T C G G C A T TC A G T G C T A G A A T T C G G C A T TG C T A G A Optim
30、al alignment and scoring system. A T T C G G C A T TC A G T G C T A G A A T T C G G C A T T-G C T A G A Global vs local alignment Global alignment - aaagcggaagtcacag |.|.| |.| aaggctgaagt-atag Local alignment - : aaagcggaagtcacag .| . aaggctgaagt-atag Sequence Alignments with Indels Evolution produc
31、es insertions and deletions (indels) In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHNALQRRTVWVNAY MHHALQRRTVWVNAY- MHH-ALQRRTVWVNAY Blosum Score = 2 (end = -6) Score = 79 (gap = -6) An alignment must have equal length aligned sequences So, we must add gaps at the start and the ends C
32、ombinatorially difficult problem to find best indel solution Gap Penalties Gaps are penalised Write wx to indicate the penalty for a gap of length x For example, each gap scores -6, so wx = -6*x One common scheme is Score -12 for opening a gap And -2 for every subsequent gap i.e., wx = -12 - 2*(x-1)
33、 Start and end gap penalties often set to zero But this can leave a doubt About evolutionary conclusions How many ways of intercalating two sequences of lengths n and m? Any pair-wise alignment can be transformed into an intercalated sequence. ! )!( alignments ofnumber total nm mn The brute force ap
34、proach Cm+n m A T T -G G C A T T a ttc g a c -t A a T tT tc G g G a C c A T T t Dot Matrix Representations (Dotplots) To help visualise best alignments Plot where each pair is the same, then draw best line MNALSQLN Nl l Al Ll l Ml Sl Ql Nl l H M NALSQLN Nl l Al Ll Ml Sl Ql Nl l H Getting Alignments
35、from Dotplot Paths M NALSQLN Nl l Al Ll Ml Sl Ql Nl l H Indicates that M matches with a gap Indicates that L matches with a gap Stage 1: Align middle Use triangles To indicate gaps NAL-SQLN NALMSQ-N Stage 2: Sort the ends out MNAL-SQLN- -NALMSQ-NH Dotplots for Real Proteins Need a way to automatical
36、ly find the best path(s) Outline lSequence alignment defined lDynamic programming, fasta, and blast lScoring matrix Dynamic Programming Approach Dynamic Programming: Also known as: Needleman-Wunsch Algorithm Can use it to draw the Dotplot paths From that we can get the alignment Mathematically guara
37、nteed To find the best scoring alignment Given a substitution scheme (scoring scheme, e.g., BLOSUM) And given a gap penalty Overview of Needleman- Wunsch Four Stages 1. Initialise a matrix for the sequences 2. Fill in the entries of that matrix (call these Si,j) At the same time drawing arrows in th
38、e matrix 3. Use the arrows to find the best scoring path(s) 4. Interpret the paths as alignments as before Illustrate with: MNALQM bestSequence s ; Return bestSequence FASTA - Algorithm - Step 1 Find all hot-spots (matching words) / Hot spots are pairs of words of length k (k-tuple) that exactly mat
39、ch K= 1 or 2 for protein K= 4 or 6 for DNA Hot Spots Sequence 1 Sequence 2 FASTA - Algorithm - Step 1 in detail Use look-up Table Query : G A A T T C A G T T A Sequence: G G A T C G A 4,5,9,10T 1,8G 6C 2,3,7,11A LocationQ *A *G *C *T *A *G *G ATTGACTTAAG Look-up Table DotMatrix FASTA - Algorithm - S
40、tep 2 Score the Hot-spot and locate the ten best diagonal run. / There is some scoring system; ex. PAM250 lStep 3 Combine sub-alignments into one alignment with GAP FASTA - Algorithm - One of local alignment GAP FASTA - Algorithm - Step 3 algorithm # Consider weighted direct graph. # Let node be a s
41、ub-alignment found in step 2 # Let u and v be nodes # Edge (u,v) exists if alignment u is before in the sequence. # Each edge has gap penalty (negative) # Find the maximum weight path One Sequence runs Edge One of Sequence FASTA - Algorithm - Step 3 in detail Sub-alignment Gap -5 -3 GAP Max Weight P
42、ath -3 FASTA - Algorithm - Step 4 Use the dynamic programming in restricted area around the best- score alignment to find out the higher-score alignment than the best-score alignment Width of this band is a parameter FASTA - Summary of Algorithm - 1: Find all hot-spots / Hot spots is pairs of words
43、of length k that exactly match 2: Score the Hot-spot and locate the ten best diagonal run. 3: Combine sub-alignments into one alignment: Score Each alignment with gap penalty and pick up the best-score alignment 4: Use the dynamic programming in restricted area around the best-score alignment to fin
44、d out the alignment greater than the best-score alignment. FASTA - Complexity - Complexity # Step 1 and 2 / select the best 10 diagonal run Let n be a sequence from DB O(n) because Step 1 just uses look up the table O(n) O(mn) m,n = 100 to 200 FASTA - Complexity - # Step 3 / compute the MAX Weight P
45、ath Let r be the number of sub-alignments. (r = 10) Lets be the number of edges O(r2) = 104 -5 -3 Max Weight Path -3 Positive Weight FASTA - Complexity - # Step 4/ compute partial D.P. Depends on the restricted area filename 只能创建新文件,不能编辑已有文件. 3.将几个文件合并为一个文件。 $cat file1 file2 file3 -bash-3.2$ cat seq
46、.txt query ADRGHLKEWI seq ADRTHLKDWI rm 功能:功能:删除命令中所列出的每个文件。默认时,该命令不删除目 录。其中FILE是要删除的文件 OPTION参数: -f:强制删除,不出现确认信息。 -r或-R或-recursive:以递归方式删除目录中内容。 cp seq.txt test.txt ls seq.txt test.txt rm test.txt ls seq.txt find find /opt/www/ -name seq.txt /opt/www/hanclass/bio_stu/seq.txt Outline Operating Syste
47、m and Linux Web server, http, HTML and Apache Database and MySQL Programming and PHP FASTA implementation Web server TCP, IP, HTTP HTML student! /body? http:/210.77.20.246/hanclass/bio_stu/test.php ftp http:/ WinSCP/4.3.2/winscp432setup.exe/downl oad php editor http:/ 14.html http:/ Outline Operatin
48、g System and Linux HTML and Apache Database and MySQL Programming and PHP FASTA implementation Database A database is a system intended to organize, store, and retrieve large amounts of data easily. Database management system (DBMS) consists of software that operates databases, providing storage, ac
49、cess, security, backup and other facilities. Relational databaseER model DBMS:Data Base Management System Microsoft Access Oracle DB2 MySQL 数据库服务器My SQL 什么是MySQL MySQL是一个小型关系型数据库管理系统,开发者为瑞典MySQL AB公司。在2008年1月16号被Sun公司收购。而2009年,SUN又被 Oracle收购. 目前MySQL被广泛地应用在Internet上的中小型网站中。 由于其体积小、速度快、总体拥有成本低,尤其是开放源
50、码这一特 点,许多中小型网站为了降低网站总体拥有成本而选择了MySQL作 为网站数据库。 MySQL是一个多用户、多线程的SQL数据库,是一个客户机/服务器结 构的应用,它由一个服务器守护程序mysqld和很多不同的客户程序 和库组成。MySQL 主要的目标是快速、稳定和容易使用。 Connect and disconnect 当使用mysql命令来连接MySQL服务器时, 通常需要提供一个MySQL用户名和密码。 如果MySQL服务器运行在不是用户所登 录的计算机上时,还将需要指定主机名。 连接MySQL服务器的语句格式如下: mysql -u bio_stu -p mysql QUIT s