Python爬虫程序设计KC23.pptx_163文库

资源描述

1、2.3.1 BeautifulSoup查找查找HTML元素元素2.3.1 BeautifulSoup查找查找HTML元素元素查找文档的元素是我们爬取网页信息的重要手段，BeautifulSoup提供了一系列的查找元素的方法，其中功能强大的find_all函数就是其中常用的一个方法。find_all函数的原型如下：find_all(self,name=None,attrs=,recursive=True,text=None,limit=None,*kwargs)self表明它是一个类成员函数；name是要查找的tag元素名称，默认是None，如果不提供，就是查找所有的元素；attrs是元素的属性

2、，它是一个字典，默认是空，如果提供就是查找有这个指定属性的元素；recursive指定查找是否在元素节点的子树下面全范围进行，默认是True；后面的text、limit、kwargs参数比较复杂，将在后面用到时介绍；find_all函数返回查找到的所有指定的元素的列表，每个元素是一个bs4.element.Tag对象。find_all函数是查找所有满足要求的元素节点，如果我们只查找一个元素节点，那么可以使用find函数，它的原型如下：find(self,name=None,attrs=,recursive=True,text=None,limit=None,*kwargs)使用方法与find_

3、all类似，不同的是它只返回第一个满足要求的节点，不是一个列表。例例2-3-1：查找文档中的：查找文档中的元素元素from bs4 import BeautifulSoupdoc=The Dormouses storyThe Dormouses storyOnce upon a time there were three little sisters;and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.soup=BeautifulSoup(doc,lxml)tag=soup.find(

4、title)print(type(tag),tag)程序结果：The Dormouses story由此可见查找到元素，元素类型是一个bs4.element.Tag对象。例例2-3-2：查找文档中的所有：查找文档中的所有元素元素from bs4 import BeautifulSoupdoc=The Dormouses storyThe Dormouses storyOnce upon a time there were three little sisters;and their names wereElsie,Lacie andTillie;and they lived at the bo

5、ttom of a well.soup=BeautifulSoup(doc,lxml)tags=soup.find_all(a)for tag in tags:print(tag)程序结果找到3个元素：ElsieLacieTillie例例2-3-3：查找文档中的第一个：查找文档中的第一个元素元素from bs4 import BeautifulSoupdoc=The Dormouses storyThe Dormouses storyOnce upon a time there were three little sisters;and their names wereElsie,Lacie

6、andTillie;and they lived at the bottom of a well.soup=BeautifulSoup(doc,lxml)tag=soup.find(a)print(tag)程序结果找到第一个元素：Elsie例例2-3-4：查找文档中：查找文档中class=title的的元素元素from bs4 import BeautifulSoupdoc=The Dormouses storyThe Dormouses storyOnce upon a time there were three little sisters;and their names wereElsi

7、e,Lacie andTillie;and they lived at the bottom of a well.soup=BeautifulSoup(doc,lxml)tag=soup.find(p,attrs=class:title)print(tag)程序结果找到class=title的元素The Dormouses story很显然如果使用：tag=soup.find(p)也能找到这个元素，因为它是文档的第一个元素。例例2-3-5：查找文档中：查找文档中class=sister的元素的元素from bs4 import BeautifulSoupdoc=The Dormouses st

8、oryThe Dormouses storyOnce upon a time there were three little sisters;and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.soup=BeautifulSoup(doc,lxml)tags=soup.find_all(name=None,attrs=class:sister)for tag in tags:print(tag)其中name=None表示无论是什么名字的元素，程序结果找到3个：a class=siste

9、r href=http:/ id=link1ElsieLacieTillie对于这个文档，很显然语句：tags=soup.find_all(a)或者：tags=soup.find_all(a,attrs=class:sister)效果一样。2.3.2 BeautifulSoup获取元素的属性值获取元素的属性值2.3.2 BeautifulSoup获取元素的属性值获取元素的属性值如果一个元素已经找到，例如找到元素，那么怎么样获取它的属性值呢？BeautifulSoup使用:tagattrName来获取tag元素的名称为attrName的属性值，其中tag是一个bs4.element.Tag对象。

10、例例2-3-6：查找文档中所有超级链接地址：查找文档中所有超级链接地址from bs4 import BeautifulSoupdoc=The Dormouses storyThe Dormouses storyOnce upon a time there were three little sisters;and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.soup=BeautifulSoup(doc,lxml)tags=soup.find_all(a)for tag in tags

11、:print(taghref)程序结果：http:/ BeautifulSoup获取元素包含的文本值获取元素包含的文本值2.3.3 BeautifulSoup获取元素包含的文本值获取元素包含的文本值如果一个元素已经找到，例如找到元素，那么怎么样获取它包含的文本值呢？BeautifulSoup使用:tag.text来获取tag元素包含的文本值，其中tag是一个bs4.element.Tag对象。例例2-3-7：查找文档中所有：查找文档中所有超级链接包含的文本值超级链接包含的文本值from bs4 import BeautifulSoupdoc=The Dormouses storyThe Dor

12、mouses storyOnce upon a time there were three little sisters;and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.soup=BeautifulSoup(doc,lxml)tags=soup.find_all(a)for tag in tags:print(tag.text)程序结果：ElsieLacieTillie例例2-3-8：查找文档中所有：查找文档中所有超级链接包含的文本值超级链接包含的文本值from bs4 impor

13、t BeautifulSoupdoc=The Dormouses storyThe Dormouses storyOnce upon a time there were three little sisters;and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.soup=BeautifulSoup(doc,lxml)tags=soup.find(p)for tag in tags:print(tag.text)程序结果：The Dormouses story Once upon a

14、time there were three little sisters;and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.其中第二个包含的值就是节点子树下面所有文本节点的组合值。2.3.4 BeautifulSoup高级查找高级查找一般find或者find_all都能满足我们的需要，如果还不能那么可以设计一个查找函数来进行查找。例例2-3-9：我们查找文档中的：我们查找文档中的href=http:/ bs4 import BeautifulSoupdoc=The Dormouses

15、 storyElsieLacieTilliedef myFilter(tag):print(tag.name)return(tag.name=a and tag.has_attr(href)and taghref=http:/ 说明：在程序中我们定义了一个筛选函数myFilter(tag)，它的参数是tag对象，在调用soup.find_all(myFilter)时程序会把每个tag元素传递给myFilter函数，由该函数决定这个tag的取舍，如果myFilter返回True就保留这个tag到结果集中，不然就丢掉这个tag。因此程序执行时可以看到html,body,head,title,bod

16、y,a,a,a等一个个tag经过myFilter的筛选，只有节点Lacie满足要求，因此结果为：Lacie其中：tag.name是tag的名称；tag.has_attr(attName)判断tag是否有attName属性；tagattName是tag的attName属性值；例例2-3-10：通过函数查找可以查找到一些复杂的节点元素，查找文本值以：通过函数查找可以查找到一些复杂的节点元素，查找文本值以cie结尾所有结尾所有节点节点from bs4 import BeautifulSoupdoc=The Dormouses storyElsieLacieTillieTilciedef endsWi

17、th(s,t):if len(s)=len(t):return slen(s)-len(t):=t return False def myFilter(tag):return(tag.name=a and endsWith(tag.text,cie)soup=BeautifulSoup(doc,lxml)tags=soup.find_all(myFilter)for tag in tags:print(tag)程序结果：LacieTilcie程序中定义了一个endsWIth(s,t)函数判断s字符串是否以字符串t结尾，是就返回True，不然返回False，在myFilter中调用这个函数判断tag.text是否以cie结尾，最后找出所有文本值以cie结尾的节点。

展开阅读全文