python 爬虫基础
做了一个整理,把python的爬虫基础发一下。这个是基于周莫凡的python整理材料
[TOC]
网页爬虫
了解网页结构
用python登录网页:1
2
3
4
5
6from urllib.request import urlopen
# if has Chinese, apply decode()
html = urlopen(
"https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)
然后用正则表达式:1
2res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL) # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])
BeautifulSoup 解析网页
BS基础
选着要爬的网址 (url)
使用 python 登录上这个网址 (urlopen等)
读取网页信息 (read() 出来)
将读取的信息放入 BeautifulSoup
使用 BeautifulSoup 选取 tag 信息等 (代替正则表达式)
安装1
2# Python 3+
pip3 install beautifulsoup4
1 | from bs4 import BeautifulSoup |
BS 解析CSS
1 | #还是导入两个模块 |
BS与正则表达式的运用
- 导入正则模块
- re.compile()
- 迭代输出,输出的是列表
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')
print(html)
soup = BeautifulSoup(html, features='lxml')
img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
for link in img_links:
print(link['src'])
course_links = soup.find_all('a', {'href': re.compile('https://morvan.*')})
for link in course_links:
print(link['href'])
爬百度百科 重要
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/2-04-practice-baidu-baike/1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#Practice: scrape Baidu Baike
#Here we build a scraper to crawl Baidu Baike from this page onwards. We store a historical webpage that we have already visited to keep tracking it.
#In [2]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random
base_url = "https://baike.baidu.com"
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
#Select the last sub url in "his", print the title and url.
url = base_url + his[-1]
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(soup.find('h1').get_text(), ' url: ', his[-1])
#网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
#Find all sub_urls for baidu baike (item page), randomly select a sub_urls and store it in "his". If no valid sub link is found, than pop last url in "his".
# find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})
if len(sub_urls) != 0:
his.append(random.sample(sub_urls, 1)[0]['href'])
else:
# no valid sub link found
his.pop()
print(his)
#['/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711', '/item/%E4%B8%8B%E8%BD%BD%E8%80%85']
#Put everthing together. Random running for 20 iterations. See what we end up with.
#his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
for i in range(20):
url = base_url + his[-1]
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
print(i, soup.find('h1').get_text(), ' url: ', his[-1])
# find valid urls
sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})
if len(sub_urls) != 0:
his.append(random.sample(sub_urls, 1)[0]['href'])
else:
# no valid sub link found
his.pop()
更多请求/下载方式
Request的方式
Requests:get/post
我们就来说两个重要的, get, post, 95% 的时间, 你都是在使用这两个来请求一个网页.
post:
账号登录
搜索内容
上传图片
上传文件
往服务器传数据 等
get:
正常打开网页
不往服务器传数据
网页使用 get 就可以了, 都是只是 get 发送请求. 而 post, 我们则是给服务器发送个性化请求, 比如将你的账号密码传给服务器, 让它给你返回一个含有你个人信息的 HTML.
从主动和被动的角度来说, post 中文是发送, 比较主动, 你控制了服务器返回的内容. 而 get 中文是取得, 是被动的, 你没有发送给服务器个性化的信息, 它不会根据你个性化的信息返回不一样的 HTML
安装Requests1
2# python 3+
pip3 install requests
get有一个参数 params,就是Request.post
用Session来传递cookie,就是Session.get
注意post的地址,get地址里面没有现实的参数
1 |
|
下载文件
3种方式1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21#了下载到一个特定的文件夹, 我们先建立一个文件夹吧. 并且规定这个图片下载地址
import os
os.makedirs('./img/', exist_ok=True)
IMAGE_URL = "https://morvanzhou.github.io/static/img/description/learning_step_flowchart.png"
#方法1 使用 urlretrieve
from urllib.request import urlretrieve
urlretrieve(IMAGE_URL, './img/image1.png')
#方法2 使用 request
import requests
r = requests.get(IMAGE_URL)
with open('./img/image2.png', 'wb') as f:
f.write(r.content)
#方法3 使用流的方式,不然只能先存在内存里
r = requests.get(IMAGE_URL, stream=True) # stream loading
with open('./img/image3.png', 'wb') as f:
for chunk in r.iter_content(chunk_size=32):
f.write(chunk)
循环下载图片
requests 访问和 下载功能, 还有 BeautifulSoup、
找到img_list下面。。。的img的src
- 找到图片位置,分析
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25from bs4 import BeautifulSoup
import requests
URL = "http://www.nationalgeographic.com.cn/animals/"
# find list of image holder
html = requests.get(URL).text
soup = BeautifulSoup(html, 'lxml')
img_ul = soup.find_all('ul', {"class": "img_list"})
#Create a folder for these pictures
import os
os.makedirs('./img/', exist_ok=True)
#download
#Find all picture urls and download them.In [4]:
for ul in img_ul:
imgs = ul.find_all('img')# find_all()函数
for img in imgs:
url = img['src']
r = requests.get(url, stream=True) #用流的方式
image_name = url.split('/')[-1]
with open('./img/%s' % image_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
print('Saved %s' % image_name)
加速爬虫
多进程分布式
1 | # 倒入模块 |
异步加载Asyncio
本质是单线程,GIL锁优化,3.5是原生库
爬网页用aiohttp1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66import aiohttp
import asyncio
import time
from bs4 import BeautifulSoup
from urllib.request import urljoin
import re
import multiprocessing as mp
# base_url = "https://morvanzhou.github.io/"
base_url = "http://127.0.0.1:4000/"
# DON'T OVER CRAWL THE WEBSITE OR YOU MAY NEVER VISIT AGAIN
if base_url != "http://127.0.0.1:4000/":
restricted_crawl = True
else:
restricted_crawl = False
seen = set()
unseen = set([base_url])
def parse(html):
soup = BeautifulSoup(html, 'lxml')
urls = soup.find_all('a', {"href": re.compile('^/.+?/$')})
title = soup.find('h1').get_text().strip()
page_urls = set([urljoin(base_url, url['href']) for url in urls])
url = soup.find('meta', {'property': "og:url"})['content']
return title, page_urls, url
async def crawl(url, session):
r = await session.get(url)
html = await r.text()
await asyncio.sleep(0.1) # slightly delay for downloading
return html
async def main(loop):
pool = mp.Pool(8) # slightly affected
async with aiohttp.ClientSession() as session:
count = 1
while len(unseen) != 0:
print('\nAsync Crawling...')
tasks = [loop.create_task(crawl(url, session)) for url in unseen]
finished, unfinished = await asyncio.wait(tasks)
htmls = [f.result() for f in finished]
print('\nDistributed Parsing...')
parse_jobs = [pool.apply_async(parse, args=(html,)) for html in htmls]
results = [j.get() for j in parse_jobs]
print('\nAnalysing...')
seen.update(unseen)
unseen.clear()
for title, page_urls, url in results:
# print(count, title, url)
unseen.update(page_urls - seen)
count += 1
if __name__ == "__main__":
t1 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
# loop.close()
print("Async total time: ", time.time() - t1)
高级爬虫
Selenium控制浏览器
安装:
pip3 install selenium
火狐有插件1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37import os
os.makedirs('./img/', exist_ok=True)
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()
html = driver.page_source # get html
driver.get_screenshot_as_file("./img/sreenshot1.png")
driver.close()
print(html[:200])
#如果不要浏览器show
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless") # define headless
# add the option when creating driver
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://morvanzhou.github.io/")
driver.find_element_by_xpath(u"//img[@alt='强化学习 (Reinforcement Learning)']").click()
driver.find_element_by_link_text("About").click()
driver.find_element_by_link_text(u"赞助").click()
driver.find_element_by_link_text(u"教程 ▾").click()
driver.find_element_by_link_text(u"数据处理 ▾").click()
driver.find_element_by_link_text(u"网页爬虫").click()
html = driver.page_source # get html
driver.get_screenshot_as_file("./img/sreenshot2.png")
driver.close()
print(html[:200])
Scrapy爬虫库 需要拓展
是个很大的框架
https://www.jianshu.com/p/a8aad3bf4dc4
https://blog.csdn.net/u012150179/article/details/323436351
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24import scrapy
class MofanSpider(scrapy.Spider):
name = "mofan"
start_urls = [
'https://morvanzhou.github.io/',
]
# unseen = set()
# seen = set() # we don't need these two as scrapy will deal with them automatically
def parse(self, response):
yield { # return some results
'title': response.css('h1::text').extract_first(default='Missing').strip().replace('"', ""),
'url': response.url,
}
urls = response.css('a::attr(href)').re(r'^/.+?/$') # find all sub urls
for url in urls:
yield response.follow(url, callback=self.parse) # it will filter duplication automatically
# lastly, run this in terminal
# scrapy runspider 5-2-scrapy.py -o res.json