跳至主要内容

Beautiful Soup

Untitled23
In [1]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
In [2]:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
# prettify() 方法将Beautiful Soup的文档树格式化后以Unicode编码输出
# 每个XML / HTML标签都独占一行
# BeautifulSoup对象和它的标签例程都可以调用prettify()方法
print(soup.prettify())
# 如果不想用UTF-8编码输出,可以将编码方式纳入prettify()方法
# print(soup.prettify("latin-1"))
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
In [3]:
# 如果只想得到结果字符串,不估计格式,
# 那么可以对一个BeautifulSoup对象或Tag对象使用Python的str()方法
str(soup)
Out[3]:
'<html><head><title>The Dormouse\'s story</title></head>\n<body>\n<p class="title"><b>The Dormouse\'s story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p>\n</body></html>'
In [4]:
# 如果只想得到标签中包含的文本内容,那么可以调用get_text()方法
# 这个方法获取到标签中包含的所有文版内容包括子孙标签中的内容
soup.get_text()
Out[4]:
"The Dormouse's story\n\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"
In [5]:
print(soup.get_text())
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

In [6]:
# 可以通过参数指定标签的文本内容的分隔符
soup.get_text("|")
Out[6]:
"The Dormouse's story|\n|\n|The Dormouse's story|\n|Once upon a time there were three little sisters; and their names were\n|Elsie|,\n|Lacie| and\n|Tillie|;\nand they lived at the bottom of a well.|\n|...|\n"
In [7]:
# 还可以去除获得文本内容的前后空白
soup.get_text("|", strip=True)
Out[7]:
"The Dormouse's story|The Dormouse's story|Once upon a time there were three little sisters; and their names were|Elsie|,|Lacie|and|Tillie|;\nand they lived at the bottom of a well.|..."
In [8]:
# 或者使用.stripped_strings生成器,获得文本列表后手动处理列表
[text for text in soup.stripped_strings]
Out[8]:
["The Dormouse's story",
 "The Dormouse's story",
 'Once upon a time there were three little sisters; and their names were',
 'Elsie',
 ',',
 'Lacie',
 'and',
 'Tillie',
 ';\nand they lived at the bottom of a well.',
 '...']
In [9]:
# BeautifulSoup对象表示的是一个文档的全部内容。
# 大部分时候,可以把它当作Tag对象,
# 它支持遍历文档树和搜索文档树中描述的大部分的方法。
# 因为BeautifulSoup对象并非真正的HTML或XML的标签,所以它没有名称和属性。
# 但有时查看它的.name属性是很方便的,
# 所以BeautifulSoup对象包含了一个替代“ [document]”的特殊属性.name
soup.name
Out[9]:
'[document]'
In [10]:
# 通过点取属性的方式只能获得当前名字的第一个标签
# 如果想要得到所有的<a>标签,就需要用到搜索树中描述的方法,例如:find_all()
soup.a
Out[10]:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
In [11]:
soup.a.attrs
Out[11]:
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
In [12]:
soup.a['href']
Out[12]:
'http://example.com/elsie'
In [13]:
soup.a['class']
Out[13]:
['sister']
In [14]:
soup.a['id']
Out[14]:
'link1'
In [15]:
soup.p
Out[15]:
<p class="title"><b>The Dormouse's story</b></p>
In [16]:
type(soup.p)
Out[16]:
bs4.element.Tag
In [17]:
soup.p.name
Out[17]:
'p'
In [18]:
type(soup.p.name)
Out[18]:
str
In [19]:
# .parent属性来获取某个元素的父节点
soup.p.parent
Out[19]:
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
In [20]:
soup.p.parent.name
Out[20]:
'body'
In [21]:
# .parents属性可以递归归得到元素的所有父辈例程
for parent in soup.p.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
body
html
[document]
In [22]:
soup.p.b
Out[22]:
<b>The Dormouse's story</b>
In [23]:
soup.p.b.text
Out[23]:
"The Dormouse's story"
In [24]:
type(soup.p.b.text)
Out[24]:
str
In [25]:
# NavigableString可以包含字符串或其他标记的方式,
# NavigableString对象支持遍历文档树和搜索文档树中定义的大部分属性,而不是全部。
# 尤其是,一个字符串不能包含其他内容(标签能够包含链接以及其他标签),
# 字符串不支持.contents或.string属性或find()方法。
type(soup.p.b.string)
Out[25]:
bs4.element.NavigableString
In [26]:
# 如果标记只有一个NavigableString类型子例程,那么这个标记可以使用得到.string子例程
soup.p.b.string
Out[26]:
"The Dormouse's story"
In [27]:
# 如果tag包含了多个子例程,tag就无法确定.string方法应该调用该子例程的内容,.string的输出结果是None
print(soup.body.string)
None
In [28]:
# 如果tag中包含多个字符串,可以使用.strings来循环获取
# repr() 函数采用单个参数,返回给定对象的可打印表示形式,语法repr(obj)
# 将不可见的换行符输出为'\n'
for string in soup.body.strings:
    print(repr(string))
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'
In [29]:
# 使用.stripped_strings可以删除多余的空白内容
# 全部是空格的行会被忽略掉,段首和段末的空白会被删除
for string in soup.body.stripped_strings:
    print(repr(string))
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'
In [30]:
# Beautiful Soup定义了很多搜索方法,这里着重介绍2个:find()和find_all()
# 唯一的区别是find_all()方法的返回结果是值包含一个元素的列表,而find()方法直接返回结果。
# find_all()方法没有找到目标是返回空列表,find()方法找到目标时,返回None。
# find(name,attrs,递归,字符串,** kwargs)
soup.find(id="link3")
Out[30]:
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
In [31]:
# 要用print函数才能打印出来None
print(soup.find("nosuchtag"))
None
In [32]:
# soup.head.title是标签的名字方法的简写。
# 这个简写的原理就是多次调用当前标签的find()方法
soup.head.title
Out[32]:
<title>The Dormouse's story</title>
In [33]:
soup.find("head").find("title")
Out[33]:
<title>The Dormouse's story</title>
In [34]:
# find_all(name,attrs,recursive,字符串,** kwargs)
# find_all() 方法搜索当前标签的所有标签子例程,并判断是否符合过滤器的条件
# name参数可以查找所有名字为name的标签,字符串对象会被自动忽略掉。用法如下
soup.find_all("title")
Out[34]:
[<title>The Dormouse's story</title>]
In [35]:
# BeautifulSoup对象和tag对象可以被当作一个方法来使用,
# 这个方法的执行结果与调用这个对象的find_all()方法相同
soup("title")
Out[35]:
[<title>The Dormouse's story</title>]
In [36]:
soup.title.find_all(string=True)
Out[36]:
["The Dormouse's story"]
In [37]:
soup.title(string=True)
Out[37]:
["The Dormouse's story"]
In [38]:
soup.find_all("p", "title")
Out[38]:
[<p class="title"><b>The Dormouse's story</b></p>]
In [39]:
# 关键字参数
# 如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数指定命名标签的属性来搜索,如果包含一个名字为id参数,
# Beautiful Soup会搜索每个标签的“ id”属性
soup.find_all(id="link2")
Out[39]:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
In [40]:
# 搜索指定名称的属性时可以使用的参数值包括字符串,正则表达式,列表,True
# 下面的例子在文档树中查找所有包含id属性的标签,无论id的值是什么
soup.find_all(id=True)
Out[40]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [41]:
# 如果预设href参数,Beautiful Soup会搜索每个标签的“ href”属性
soup.find_all(href=re.compile("elsie"))
Out[41]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
In [42]:
# 使用多个指定名字的参数可以同时过滤tag的多个属性
soup.find_all(href=re.compile("elsie"), id='link1')
Out[42]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
In [43]:
# 有些标签属性在搜索不能使用,例如HTML5中的data- *属性
# data_soup.find_all(data-foo="value") 会报错
# 但是可以通过find_all()方法的attrs参数定义一个字典参数来搜索包含特殊属性的标签
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(attrs={"data-foo": "value"})
Out[43]:
[<div data-foo="value">foo!</div>]
In [44]:
# CSS类名的关键字class在Python中是保留字
# 可以通过class_参数搜索有指定CSS类名的标签
soup.find_all('a', class_="sister")
Out[44]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [45]:
# class_参数均接受不同类型的过滤器,字符串,正则表达式,方法或True
soup.find_all(class_=re.compile("itl"))
Out[45]:
[<p class="title"><b>The Dormouse's story</b></p>]
In [46]:
def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)
Out[46]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [47]:
# 完全匹配class的值时,如果CSS类名的顺序与实际不符,将搜索不到结果
soup.find_all("a", attrs={"class": "sister"})
Out[47]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [48]:
# 标签的class属性是多值属性。按照CSS类名搜索标签时,可以分别搜索标签中的每个CSS类名
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.find_all("p", class_="strikeout")
Out[48]:
[<p class="body strikeout"></p>]
In [49]:
css_soup.find_all("p", class_="body")
Out[49]:
[<p class="body strikeout"></p>]
In [50]:
# 搜索class属性时也可以通过CSS值完全匹配
css_soup.find_all("p", class_="body strikeout")
Out[50]:
[<p class="body strikeout"></p>]
In [51]:
# 通过string参数可以搜搜文档文档中的字符串内容
# 与name参数的可选值一样,string参数接受字符串,正则表达式,列表,True
soup.find_all(string="Elsie")
Out[51]:
['Elsie']
In [52]:
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
Out[52]:
['Elsie', 'Lacie', 'Tillie']
In [53]:
soup.find_all(string=re.compile("Dormouse"))
Out[53]:
["The Dormouse's story", "The Dormouse's story"]
In [54]:
soup.find_all("a", string="Elsie")
Out[54]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
In [55]:
# 使用limit参数限制返回结果的数量
soup.find_all("a", limit=2)
Out[55]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
In [56]:
# tag标签具有很多属性和方法,可以通过将标签视为字典来访问标签的属性, 可以直接通过.attrs方式访问该词典
for link in soup.find_all('a'):
    print(link, type(link))
    print(link.name, type(link.name))
    print(link.text, type(link.text))
    print(link.string,type(link.string),type(str(link.string)))
    print(link.attrs)
    print(link['class'], link.get('class'))
    print(link['href'], link.get('href'))
    print(link['id'], link.get('id'), link.get_attribute_list('id'))
    print('*'*100)
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <class 'bs4.element.Tag'>
a <class 'str'>
Elsie <class 'str'>
Elsie <class 'bs4.element.NavigableString'> <class 'str'>
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
['sister'] ['sister']
http://example.com/elsie http://example.com/elsie
link1 link1 ['link1']
****************************************************************************************************
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <class 'bs4.element.Tag'>
a <class 'str'>
Lacie <class 'str'>
Lacie <class 'bs4.element.NavigableString'> <class 'str'>
{'href': 'http://example.com/lacie', 'class': ['sister'], 'id': 'link2'}
['sister'] ['sister']
http://example.com/lacie http://example.com/lacie
link2 link2 ['link2']
****************************************************************************************************
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> <class 'bs4.element.Tag'>
a <class 'str'>
Tillie <class 'str'>
Tillie <class 'bs4.element.NavigableString'> <class 'str'>
{'href': 'http://example.com/tillie', 'class': ['sister'], 'id': 'link3'}
['sister'] ['sister']
http://example.com/tillie http://example.com/tillie
link3 link3 ['link3']
****************************************************************************************************
In [57]:
# 如果正则表达式作为过滤器参数,Beautiful Soup会通过正则表达式的search()来匹配内容
# 找到所有以b开头的标签
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
body
b
In [58]:
# 找到所有名字中包含“ t”的标签
for tag in soup.find_all(re.compile("t")):
    print(tag.name)
html
title
In [59]:
# 如果列表作为过滤器参数,Beautiful Soup与列表中任一元素匹配的内容返回
#下面的代码找到文档中所有<a>标签和<b>标签
soup.find_all(["a", "b"])
Out[59]:
[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [60]:
# True 可以匹配任何值,下面的代码查找到所有的标签,但是不会返回字符串中断
for tag in soup.find_all(True):
    print(tag.name)
html
head
title
body
p
b
p
a
a
a
p
In [61]:
# 如果没有合适的过滤器,那么还可以定义一个函数,但是函数只接受一个元素参数
# 如果这个函数返回True表示当前元素匹配并被找到,如果不是则反回False
# 下面的函数验证了当前元素,如果包含class属性却不包含id属性,那么将返回True
# 注意a标签是在p标签里面,没有单独出现a标签
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
Out[61]:
[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]
In [62]:
def is_the_only_string_within_a_tag(s):
    """Return True if this string is the only child of its parent tag."""
    return (s == s.parent.string)

soup.find_all(string=is_the_only_string_within_a_tag)
Out[62]:
["The Dormouse's story",
 "The Dormouse's story",
 'Elsie',
 'Lacie',
 'Tillie',
 '...']
In [63]:
# 通过一个函数来过滤一类标签属性的时候,这个函数的参数是要被过滤的属性的值,而不是这个标签。
# 下面的示例是搜寻href属性不符合指定正则的a标签
def not_lacie(href):
        return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)
Out[63]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [64]:
soup.html.find_all("title")
Out[64]:
[<title>The Dormouse's story</title>]
In [65]:
# <title>标签在<html>标签下,但不是直接子例程,<head>标签才是直接子例程。
# 在允许查询所有后代例程时Beautiful Soup能够查找到<title>标签。
# 但是使用了递归recursive=False 参数之后,只能找到直接子例程,这样就查不到<title>标签了
soup.html.find_all("title", recursive=False)
Out[65]:
[]
In [66]:
soup.head
Out[66]:
<head><title>The Dormouse's story</title></head>
In [67]:
# tag的.contents属性可以将tag的子例程以列表的方式输出,字符串没有.contents属性
soup.head.contents
Out[67]:
[<title>The Dormouse's story</title>]
In [68]:
soup.head.contents[0]
Out[68]:
<title>The Dormouse's story</title>
In [69]:
soup.head.contents[0].contents
Out[69]:
["The Dormouse's story"]
In [70]:
# 通过标签的.children生成器,可以对标签的子节点进行循环
for child in soup.head.children:
    print(child)
<title>The Dormouse's story</title>
In [71]:
len(list(soup.children))
Out[71]:
1
In [72]:
for child in soup.children:
    print(child)
    print('*'*30)
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
******************************
In [73]:
# .descendants属性可以对所有标签的子孙节点进行递归循环
for child in soup.head.descendants:
    print(child)
<title>The Dormouse's story</title>
The Dormouse's story
In [74]:
len(list(soup.descendants))
Out[74]:
26
In [75]:
for child in soup.descendants:
    print(child)
    print('~'*100)
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<head><title>The Dormouse's story</title></head>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<title>The Dormouse's story</title>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Dormouse's story
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<p class="title"><b>The Dormouse's story</b></p>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<b>The Dormouse's story</b>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Dormouse's story
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Once upon a time there were three little sisters; and their names were

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elsie
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
,

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Lacie
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 and

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tillie
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
;
and they lived at the bottom of a well.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<p class="story">...</p>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In [76]:
# BeautifulSoup对象本身一定会包含子例程,初始<html>标签也是BeautifulSoup对象的子例程:
len(soup.contents)
Out[76]:
1
In [77]:
soup.contents[0]
Out[77]:
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
In [78]:
soup.contents[0].name
Out[78]:
'html'
In [79]:
# 如果<b>标签和<c>标签是同一层,则被称为兄弟
# 可以用.next_sibling和.previous_sibling属性来查询兄弟
# 实际文档中的标签的.next_sibling和.previous_sibling属性通常是字符串或空白
soup.a.next_sibling
Out[79]:
',\n'
In [80]:
# 通过.next_siblings和.previous_siblings属性可以对当前例程的兄弟迭代输出
for sibling in soup.a.next_siblings:
    print(repr(sibling))
',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
In [81]:
soup.find('a', id="link3").next_sibling
Out[81]:
';\nand they lived at the bottom of a well.'
In [82]:
# .next_element属性指向解析过程中下一个被解析的对象(字符串或标签),
# 结果可能与.next_sibling相同,但通常是不一样的
soup.find('a', id="link3").next_element
Out[82]:
'Tillie'
In [83]:
# .previous_element属性指向当前被解析的对象的前一个解析对象
soup.find('a', id="link3").previous_element
Out[83]:
' and\n'
In [84]:
# 通过.next_elements和.previous_elements的继承器就可以向前或向后
# 访问文档的解析内容,就好像文档正在被解析一样
for element in soup.find('a', id="link3").next_elements:
    print(repr(element))
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'
In [85]:
# Beautiful Soup支持大部分的CSS选择器http://www.w3.org/TR/CSS2/selector.html 
# 在Tag或BeautifulSoup对象的.select()方法中可以插入参数,可以使用CSS选择器的语法找到标签
soup.select("title")
Out[85]:
[<title>The Dormouse's story</title>]
In [86]:
soup.select("p:nth-of-type(3)")
Out[86]:
[<p class="story">...</p>]
In [87]:
# 通过标签标签逐层查找
soup.select("body a")
Out[87]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [88]:
soup.select("html head title")
Out[88]:
[<title>The Dormouse's story</title>]
In [89]:
# 找到某个标签标签下的直接子标签
soup.select("head > title")
Out[89]:
[<title>The Dormouse's story</title>]
In [90]:
soup.select("p > a")
Out[90]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [91]:
soup.select("p > a:nth-of-type(2)")
Out[91]:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
In [92]:
soup.select("p > #link1")
Out[92]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
In [93]:
soup.select("body > a")
Out[93]:
[]
In [94]:
# CSS中查找id用#,class用.
soup.select("#link1,.sister")
Out[94]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [95]:
# 找到兄弟标签
# 获得id为link1,class为sister的兄弟标签内容(所有的兄弟便签)
soup.select("#link1 ~ .sister")
Out[95]:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [96]:
# 获得id为link1,class为sister的兄弟标签内容(下一个兄弟便签)
soup.select("#link1 + .sister")
Out[96]:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
In [97]:
# 通过CSS的类名查找
soup.select(".sister")
Out[97]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [98]:
soup.select("[class~=sister]")
Out[98]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [99]:
# 通过tag的id查找
soup.select("#link1")
Out[99]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
In [100]:
soup.select("a#link2")
Out[100]:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
In [101]:
# 同时用多种CSS选择器查询元素
soup.select("#link1,#link2")
Out[101]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
In [102]:
# 通过是否存在某个属性来查找
soup.select('a[href]')
Out[102]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [103]:
# 通过属性的值来查找
soup.select('a[href="http://example.com/elsie"]')
Out[103]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
In [104]:
soup.select('a[href^="http://example.com/"]')
Out[104]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [105]:
soup.select('a[href$="tillie"]')
Out[105]:
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [106]:
soup.select('a[href*=".com/el"]')
Out[106]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
In [107]:
# 返回查找到的元素的第一个
soup.select_one(".sister")
Out[107]:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

评论

此博客中的热门博文

自动发送消息

  # https://pyperclip.readthedocs.io/en/latest/ import pyperclip while True :     # pyperclip.copy('Hello, world!')     # pyperclip.paste()     # pyperclip.waitForPaste()     print ( pyperclip. waitForNewPaste ( ) )     # 获取要输入新的坐标,也可以通过autohotkey import time import pyautogui  as pag import os   try :     while True :         print ( "Press Ctrl-C to end" )         x , y = pag. position ( )   # 返回鼠标的坐标         posStr = "Position:" + str ( x ) . rjust ( 4 ) + ',' + str ( y ) . rjust ( 4 )         print ( posStr )   # 打印坐标         time . sleep ( 0.2 )         os . system ( 'cls' )   # 清楚屏幕 except KeyboardInterrupt :     print ( 'end....' )     # 打印消息 import pyautogui import time import pyperclip   content = """   呼叫龙叔! 第二遍! 第三遍! 第四遍...

学习地址

清华大学计算机系课程攻略 https://github.com/PKUanonym/REKCARC-TSC-UHT 浙江大学课程攻略共享计划 https://github.com/QSCTech/zju-icicles https://home.unicode.org/ 世界上的每个人都应该能够在手机和电脑上使用自己的语言。 http://codecanyon.net   初次看到这个网站,小伙伴们表示都惊呆了。原来代码也可以放在网上卖的?!! 很多coder上传了各种代码,每个代码都明码标价。看了下销售排行,有的19刀的卖了3万多份,额di神啊。可以看到代码的演示效果,真的很漂亮。代码以php、wordpress主题、Javascript、css为主,偏前台。 https://www.lintcode.com/ 算法学习网站,上去每天刷两道算法题,走遍天下都不怕。 https://www.codecademy.com/ 包含在线编程练习和课程视频 https://www.reddit.com/ 包含有趣的编程挑战题,即使不会写,也可以查看他人的解决方法。 https://ideone.com/ 在线编译器,可运行,可查看代码示例。 http://it-ebooks.info/ 大型电子图书馆,可即时免费下载书籍。 刷题 https://github.com/jackfrued/Python-100-Days https://github.com/kenwoodjw/python_interview_question 面试问题 https://github.com/kenwoodjw/python_interview_question https://www.journaldev.com/15490/python-interview-questions#python-interpreter HTTP 身份验证 https://developer.mozilla.org/zh-CN/docs/Web/HTTP/Authentication RESTful 架构详解 https://www.runoob.com/w3cnote/restful-architecture.html https://www.rosettacode.org/wiki/Rosetta_C...

mysql 入门

资料 https://dinfratechsource.com/2018/11/10/how-to-install-latest-mysql-5-7-21-on-rhel-centos-7/ https://dev.mysql.com/doc/refman/5.7/en/linux-installation-yum-repo.html https://www.runoob.com/mysql/mysql-create-database.html https://www.liquidweb.com/kb/install-java-8-on-centos-7/ 工具 https://www.heidisql.com/ HeidiSQL是免费软件,其目标是易于学习。 “ Heidi”使您可以从运行数据库系统MariaDB,MySQL,Microsoft SQL或PostgreSQL的计算机上查看和编辑数据和结构 MySQL 连接时尽量使用 127.0.0.1 而不是 localhost localhost 使用的 Linux socket,127.0.0.1 使用的是 tcp/ip 为什么我使用 localhost 一直没出问题 因为你的本机中只有一个 mysql 进程, 如果你有一个 node1 运行在 3306, 有一个 node2 运行在 3307 mysql -u root -h localhost -P 3306 mysql -u root -h localhost -P 3307 都会连接到同一个 mysql 进程, 因为 localhost 使用 Linux socket, 所以 -P 字段直接被忽略了, 等价于 mysql -u root -h localhost mysql -u root -h localhost 而 -h 默认是 localhost, 又等价于 mysql -u root mysql -u root 为了避免这种情况(比如你在本地开发只有一个 mysql 进程,线上或者 qa 环境有多个 mysql 进程)最好的方式就是使用 IP mysql -u root -h 127 .0 .0 .1 -P 3307 strac...