python 爬虫基础

做了一个整理,把python的爬虫基础发一下。这个是基于周莫凡的python整理材料

[TOC]

网页爬虫

了解网页结构

用python登录网页:

1
2
3
4
5
6
from urllib.request import urlopen
# if has Chinese, apply decode()
html = urlopen(
"https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

然后用正则表达式:
1
2
res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)    # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])

BeautifulSoup 解析网页

BS基础

选着要爬的网址 (url)
使用 python 登录上这个网址 (urlopen等)
读取网页信息 (read() 出来)
将读取的信息放入 BeautifulSoup
使用 BeautifulSoup 选取 tag 信息等 (代替正则表达式)
安装

1
2
# Python 3+
pip3 install beautifulsoup4

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from bs4 import BeautifulSoup
from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/basic-structure.html").read().decode('utf-8')
print(html)

#这边的解析方式有很多 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 推荐使用lxml
soup = BeautifulSoup(html, features='lxml')
print(soup.h1) #直接h1
print('\n', soup.p)


all_href = soup.find_all('a') #
all_href = [l['href'] for l in all_href]
print('\n', all_href)

BS 解析CSS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#还是导入两个模块
from bs4 import BeautifulSoup
from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read().decode('utf-8')
print(html)

soup = BeautifulSoup(html, features='lxml')

# use class to narrow search
month = soup.find_all('li', {"class": "month"})
for m in month:
print(m.get_text()) #这边用get_text()来获得里面的具体信息

jan = soup.find('ul', {"class": 'jan'})
d_jan = jan.find_all('li') # use jan as a parent
for d in d_jan:
print(d.get_text())

BS与正则表达式的运用

  1. 导入正则模块
  2. re.compile()
  3. 迭代输出,输出的是列表
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    import re

    # if has Chinese, apply decode()
    html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')
    print(html)

    soup = BeautifulSoup(html, features='lxml')

    img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
    for link in img_links:
    print(link['src'])

    course_links = soup.find_all('a', {'href': re.compile('https://morvan.*')})
    for link in course_links:
    print(link['href'])

爬百度百科 重要

https://morvanzhou.github.io/tutorials/data-manipulation/scraping/2-04-practice-baidu-baike/

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

#Practice: scrape Baidu Baike
#Here we build a scraper to crawl Baidu Baike from this page onwards. We store a historical webpage that we have already visited to keep tracking it.

#In [2]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random


base_url = "https://baike.baidu.com"
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

#Select the last sub url in "his", print the title and url.

url = base_url + his[-1]

html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(soup.find('h1').get_text(), ' url: ', his[-1])
#网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
#Find all sub_urls for baidu baike (item page), randomly select a sub_urls and store it in "his". If no valid sub link is found, than pop last url in "his".

# find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

if len(sub_urls) != 0:
his.append(random.sample(sub_urls, 1)[0]['href'])
else:
# no valid sub link found
his.pop()
print(his)
#['/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711', '/item/%E4%B8%8B%E8%BD%BD%E8%80%85']
#Put everthing together. Random running for 20 iterations. See what we end up with.
#his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

for i in range(20):
url = base_url + his[-1]

html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(i, soup.find('h1').get_text(), ' url: ', his[-1])

# find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})

if len(sub_urls) != 0:
his.append(random.sample(sub_urls, 1)[0]['href'])
else:
# no valid sub link found
his.pop()

更多请求/下载方式

Request的方式

Requests:get/post
我们就来说两个重要的, get, post, 95% 的时间, 你都是在使用这两个来请求一个网页.
post:
账号登录
搜索内容
上传图片
上传文件
往服务器传数据 等
get:
正常打开网页
不往服务器传数据

网页使用 get 就可以了, 都是只是 get 发送请求. 而 post, 我们则是给服务器发送个性化请求, 比如将你的账号密码传给服务器, 让它给你返回一个含有你个人信息的 HTML.

从主动和被动的角度来说, post 中文是发送, 比较主动, 你控制了服务器返回的内容. 而 get 中文是取得, 是被动的, 你没有发送给服务器个性化的信息, 它不会根据你个性化的信息返回不一样的 HTML

安装Requests

1
2
# python 3+
pip3 install requests

get有一个参数 params,就是Request.post
用Session来传递cookie,就是Session.get
注意post的地址,get地址里面没有现实的参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

#requests: an alternative to urllib
#requests has more functions to replace urlopen. Use request.get() to replace urlopen() and pass some parameters to the webpage. The webbrowser is to open the new url and give you an visualization of this result.

import requests
import webbrowser #内置模块,打开浏览器
param = {"wd": "莫烦Python"}
r = requests.get('http://www.baidu.com/s', params=param)
print(r.url)
webbrowser.open(r.url)
http://www.baidu.com/s?wd=%E8%8E%AB%E7%83%A6Python
#Out[2]:
#True

##post
##We test the post function in this page. To pass some data to the server to analyse and send some response to you accordingly.
data = {'firstname': '莫烦', 'lastname': '周'}
r = requests.post('http://pythonscraping.com/files/processing.php', data=data)
print(r.text)
#Hello there, 莫烦 周!

##upload image
##We still use post function to update image in this page.
file = {'uploadFile': open('./image.png', 'rb')}
r = requests.post('http://pythonscraping.com/files/processing2.php', files=file)
print(r.text)
#The file image.png has been uploaded.
#login

##Use post method to login to a website.
payload = {'username': 'Morvan', 'password': 'password'}
r = requests.post('http://pythonscraping.com/pages/cookies/welcome.php', data=payload)
print(r.cookies.get_dict())
r = requests.get('http://pythonscraping.com/pages/cookies/profile.php', cookies=r.cookies)
print(r.text)
#{'username': 'Morvan', 'loggedin': '1'}
#Hey Morvan! Looks like you're still logged into the site!


##another general way to login
##Use session instead requests. Keep you in a session and keep track the cookies.
session = requests.Session()
payload = {'username': 'Morvan', 'password': 'password'}
r = session.post('http://pythonscraping.com/pages/cookies/welcome.php', data=payload)
print(r.cookies.get_dict())
r = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(r.text)
#{'username': 'Morvan', 'loggedin': '1'}
#Hey Morvan! Looks like you're still logged into the site!

下载文件

3种方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#了下载到一个特定的文件夹, 我们先建立一个文件夹吧. 并且规定这个图片下载地址
import os
os.makedirs('./img/', exist_ok=True)
IMAGE_URL = "https://morvanzhou.github.io/static/img/description/learning_step_flowchart.png"

#方法1 使用 urlretrieve
from urllib.request import urlretrieve
urlretrieve(IMAGE_URL, './img/image1.png')

#方法2 使用 request
import requests
r = requests.get(IMAGE_URL)
with open('./img/image2.png', 'wb') as f:
f.write(r.content)

#方法3 使用流的方式,不然只能先存在内存里
r = requests.get(IMAGE_URL, stream=True) # stream loading
with open('./img/image3.png', 'wb') as f:
for chunk in r.iter_content(chunk_size=32):
f.write(chunk)

循环下载图片

requests 访问和 下载功能, 还有 BeautifulSoup、
找到img_list下面。。。的img的src

  1. 找到图片位置,分析
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    from bs4 import BeautifulSoup
    import requests
    URL = "http://www.nationalgeographic.com.cn/animals/"

    # find list of image holder
    html = requests.get(URL).text
    soup = BeautifulSoup(html, 'lxml')
    img_ul = soup.find_all('ul', {"class": "img_list"})

    #Create a folder for these pictures
    import os
    os.makedirs('./img/', exist_ok=True)

    #download
    #Find all picture urls and download them.In [4]:
    for ul in img_ul:
    imgs = ul.find_all('img')# find_all()函数
    for img in imgs:
    url = img['src']
    r = requests.get(url, stream=True) #用流的方式
    image_name = url.split('/')[-1]
    with open('./img/%s' % image_name, 'wb') as f:
    for chunk in r.iter_content(chunk_size=128):
    f.write(chunk)
    print('Saved %s' % image_name)

加速爬虫

多进程分布式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# 倒入模块
import multiprocessing as mp
import time
from urllib.request import urlopen, urljoin
from bs4 import BeautifulSoup
import re
base_url = "http://127.0.0.1:4000/"
# base_url = 'https://morvanzhou.github.io/'
# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
restricted_crawl = True
else:
restricted_crawl = False

#Create a crawl function to open a url in parallel.
def crawl(url):
response = urlopen(url) #urlopen函数功能
time.sleep(0.1) # slightly delay for downloading
return response.read().decode()

#Create a parse function to find all results we need in parallel
def parse(html):
soup = BeautifulSoup(html, 'lxml')
urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
title = soup.find('h1').get_text().strip()
page_urls = set([urljoin(base_url, url['href']) for url in urls])
url = soup.find('meta', {'property': "og:url"})['content']
return title, page_urls, url

#Normal way
#Do not use multiprocessing, test the speed. Firstly, set what urls we have already seen and what we haven't in a python set.
unseen = set([base_url,])
seen = set()
count, t1 = 1, time.time()

while len(unseen) != 0: # still get some url to visit
if restricted_crawl and len(seen) > 20:
break

print('\nDistributed Crawling...')
htmls = [crawl(url) for url in unseen]

print('\nDistributed Parsing...')
results = [parse(html) for html in htmls]

print('\nAnalysing...')
seen.update(unseen) # seen the crawled
unseen.clear() # nothing unseen

for title, page_urls, url in results:
print(count, title, url)
count += 1
unseen.update(page_urls - seen) # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, )) # 53 s

#multiprocessing
#Create a process pool and scrape parallelly.
unseen = set([base_url,])
seen = set()

pool = mp.Pool(4)
count, t1 = 1, time.time()
while len(unseen) != 0: # still get some url to visit
if restricted_crawl and len(seen) > 20:
break
print('\nDistributed Crawling...')
crawl_jobs = [pool.apply_async(crawl, args=(url,)) for url in unseen]
htmls = [j.get() for j in crawl_jobs] # request connection

print('\nDistributed Parsing...')
parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
results = [j.get() for j in parse_jobs] # parse html

print('\nAnalysing...')
seen.update(unseen) # seen the crawled
unseen.clear() # nothing unseen

for title, page_urls, url in results:
print(count, title, url)
count += 1
unseen.update(page_urls - seen) # get new url to crawl
print('Total time: %.1f s' % (time.time()-t1, )) # 16 s !!!

异步加载Asyncio

本质是单线程,GIL锁优化,3.5是原生库
爬网页用aiohttp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import aiohttp
import asyncio
import time
from bs4 import BeautifulSoup
from urllib.request import urljoin
import re
import multiprocessing as mp

# base_url = "https://morvanzhou.github.io/"
base_url = "http://127.0.0.1:4000/"

# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
restricted_crawl = True
else:
restricted_crawl = False


seen = set()
unseen = set([base_url])


def parse(html):
soup = BeautifulSoup(html, 'lxml')
urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
title = soup.find('h1').get_text().strip()
page_urls = set([urljoin(base_url, url['href']) for url in urls])
url = soup.find('meta', {'property': "og:url"})['content']
return title, page_urls, url


async def crawl(url, session):
r = await session.get(url)
html = await r.text()
await asyncio.sleep(0.1) # slightly delay for downloading
return html


async def main(loop):
pool = mp.Pool(8) # slightly affected
async with aiohttp.ClientSession() as session:
count = 1
while len(unseen) != 0:
print('\nAsync Crawling...')
tasks = [loop.create_task(crawl(url, session)) for url in unseen]
finished, unfinished = await asyncio.wait(tasks)
htmls = [f.result() for f in finished]

print('\nDistributed Parsing...')
parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
results = [j.get() for j in parse_jobs]

print('\nAnalysing...')
seen.update(unseen)
unseen.clear()
for title, page_urls, url in results:
# print(count, title, url)
unseen.update(page_urls - seen)
count += 1

if __name__ == "__main__":
t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
# loop.close()
print("Async total time: ", time.time() - t1)

高级爬虫

Selenium控制浏览器

安装:
pip3 install selenium
火狐有插件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import os
os.makedirs('./img/', exist_ok=True)

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()
html = driver.page_source # get html
driver.get_screenshot_as_file("./img/sreenshot1.png")
driver.close()
print(html[:200])

#如果不要浏览器show
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless") # define headless

# add the option when creating driver
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()

html = driver.page_source # get html
driver.get_screenshot_as_file("./img/sreenshot2.png")
driver.close()
print(html[:200])

Scrapy爬虫库 需要拓展

是个很大的框架
https://www.jianshu.com/p/a8aad3bf4dc4
https://blog.csdn.net/u012150179/article/details/32343635

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import scrapy


class MofanSpider(scrapy.Spider):
name = "mofan"
start_urls = [
'https://morvanzhou.github.io/',
]
# unseen = set()
# seen = set() # we don't need these two as scrapy will deal with them automatically

def parse(self, response):
yield { # return some results
'title': response.css('h1::text').extract_first(default='Missing').strip().replace('"', ""),
'url': response.url,
}

urls = response.css('a::attr(href)').re(r'^/.+?/$') # find all sub urls
for url in urls:
yield response.follow(url, callback=self.parse) # it will filter duplication automatically


# lastly, run this in terminal
# scrapy runspider 5-2-scrapy.py -o res.json