BS4库怎么在Python中安装与使用

发布时间：2021-02-25 17:28:14 来源：亿速云阅读：263 作者：Leah 栏目：开发技术

本篇文章为大家展示了BS4库怎么在Python中安装与使用，内容简明扼要并且容易理解，绝对能使你眼前一亮，通过这篇文章的详细介绍希望你能有所收获。

bs4库的安装

Python的强大之处就在于他作为一个开源的语言，有着许多的开发者为之开发第三方库，这样我们开发者在想要实现某一个功能的时候，只要专心实现特定的功能，其他细节与基础的部分都可以交给库来做。bs4库就是我们写爬虫强有力的帮手。

安装的方式非常简单：我们用pip工具在命令行里进行安装

$ pip install beautifulsoup4

接着我们看一下是否成功安装了bs4库

$ pip list

这样我们就成功安装了 bs4 库

BS4库怎么在Python中安装与使用

bs4库的简单使用

这里我们先简单的讲解一下bs4库的使用，

暂时不去考虑如何从web上抓取网页，

假设我们需要爬取的html是如下这么一段：

下面的一段HTML代码将作为例子被多次用到.这是爱丽丝梦游仙境的的一段内容(以后内容中简称为爱丽丝的文档):

<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p>    <p class="story">Once upon a time there were three little sisters; and their names were http://example.com/elsie" class="sister" id="link1">Elsie, http://example.com/lacie" class="sister" id="link2">Lacie and http://example.com/tillie" class="sister" id="link3">Tillie; and they lived at the bottom of a well.</p>    <p class="story">...</p> </html>

下面我们开始用bs4库解析这一段html网页代码。

#导入bs4模块 from bs4 import BeautifulSoup #做一个美味汤 soup = BeautifulSoup(html，'html.parser') #输出结果 print(soup.prettify())    ''' OUT:    # <html> # <head> #  <title> #  The Dormouse's story #  </title> # </head> # <body> #  <p class="title"> #  <b> #   The Dormouse's story #  </b> #  </p> #  <p class="story"> #  Once upon a time there were three little sisters; and their names were #  <a class="sister" href="http://example.com/elsie" rel="external nofollow" id="link1"> #   Elsie #  </a> #  , #  <a class="sister" href="http://example.com/lacie" rel="external nofollow" id="link2"> #   Lacie #  </a> #  and #  <a class="sister" href="http://example.com/tillie" rel="external nofollow" id="link2"> #   Tillie #  </a> #  ; and they lived at the bottom of a well. #  </p> #  <p class="story"> #  ... #  </p> # </body> # </html> '''

可以看到bs4库将网页文件变成了一个soup的类型，

事实上，bs4库是解析、遍历、维护、“标签树“的功能库。

通俗一点说就是： bs4库把html源代码重新进行了格式化，

从而方便我们对其中的节点、标签、属性等进行操作。

下面是几个简单的浏览结构化数据的方式：

请仔细观察最前面的html文件

# 找到文档的title soup.title # <title>The Dormouse's story</title>    #title的name值 soup.title.name # u'title'    #title中的字符串String soup.title.string # u'The Dormouse's story'    #title的父亲节点的name属性 soup.title.parent.name # u'head'    #文档的第一个找到的段落 soup.p # <p class="title"><b>The Dormouse's story</b></p>    #找到的p的class属性值 soup.p['class'] # u'title'    #找到a标签 soup.a # http://example.com/elsie" id="link1">Elsie    #找到所有的a标签 soup.find_all('a') # [http://example.com/elsie" id="link1">Elsie, # http://example.com/lacie" id="link2">Lacie, # http://example.com/tillie" id="link3">Tillie]    #找到id值等于3的a标签 soup.find(id="link3") # http://example.com/tillie" id="link3">Tillie

通过上面的例子我们知道bs4库是这样理解一个html源文件的：

首先把html源文件转换为soup类型

接着从中通过特定的方式抓取内容

更高级点的用法？

从文档中找到所有<a>标签的链接:

#发现了没有，find_all方法返回的是一个可以迭代的列表 for link in soup.find_all('a'):   print(link.get('href'))   # http://example.com/elsie   # http://example.com/lacie   # http://example.com/tillie

从文档中获取所有文字内容:

#我们可以通过get_text 方法 快速得到源文件中的所有text内容。 print(soup.get_text()) # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # ...

上述内容就是BS4库怎么在Python中安装与使用，你们学到知识或技能了吗？如果还想学到更多技能或者丰富自己的知识储备，欢迎关注亿速云行业资讯频道。

向AI问一下细节

BS4库怎么在Python中安装与使用

猜你喜欢

最新资讯

相关推荐

相关标签