Python Crawler之网页数据抓取

恩…开始正式看爬虫了,晚上写了抓取网页的小脚本,挖个坑继续更(这只是一个测试)

网页抓取

根据链接

从入口页面开始抓取出所有链接,下载函数支持proxy、支持定义深度抓取、链接去重等
code如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import urlparse
import urllib2
import re
import Queue
#页面下载
def page_download(url,num_retry=2,user_agent='zhxfei',proxy=None):
#print 'downloading ' , url
headers = {'User-agent':user_agent}
request = urllib2.Request(url,headers = headers)
opener = urllib2.build_opener()
if proxy:
proxy_params = {urlparse(url).scheme:proxy}
opener.add_handler(urllib2.ProxyHandler(proxy_params))
try:
html = urllib2.urlopen(request).read() #try : download the page
except urllib2.URLError as e: #except :
print 'Download error!' , e.reason #URLError
html = None
if num_retry > 0: # retry download when time>0
if hasattr(e, 'code') and 500 <=e.code <=600:
return page_download(url,num_retry-1)
if html is None:
print '%s Download failed' % url
else:
print '%s has Download' % url
return html
#使用正则表达式匹配出页面中的链接
def get_links_by_html(html):
webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
return webpage_regex.findall(html)
#判断抓取的链接和入口页面是否为同站
def same_site(url1,url2):
return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc
def link_crawler(seed_url,link_regex,max_depth=-1):
crawl_link_queue = Queue.deque([seed_url])
seen = {seed_url:0} # seen means page had download
depth = 0
while crawl_link_queue:
url = crawl_link_queue.pop()
depth = seen.get(url)
if seen.get(url) > max_depth:
continue
links = []
html = page_download(url)
links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x))
for link in links:
if link not in seen:
seen[link]= depth + 1
if same_site(link, seed_url):
crawl_link_queue.append(link)
#print seen.values()
print '----All Done----' , len(seen)
return seen
if __name__ == '__main__':
all_links = link_crawler('http://www.zhxfei.com',r'/.*',max_depth=1)

运行结果:

1
2
3
4
5
6
7
8
http://www.zhxfei.com/archives has Download
http://www.zhxfei.com/2016/08/04/lvs/ has Download
...
...
http://www.zhxfei.com/2016/07/22/app-store-审核-IPv6-Olny/#more has Download
http://www.zhxfei.com/archives has Download
http://www.zhxfei.com/2016/07/22/HDFS/#comments has Download
----All Done----

根据sitmap

sitemap是相当于网站的地图,于其相关的还有robots.txt,一般都是在网站的根目录下专门提供给各种spider,使其更加友好的被搜索引擎收录,定义了一些正规爬虫的抓取规则

所有也可以这样玩,将xml文件中的url拿出来,根据url去直接抓取网站,这是最方便的做法(虽然别人不一定希望我们这么做)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/usr/bin/env python
# _*_encoding:utf-8 _*_
# description: this modlue is load crawler By SITEMAP
import re
from download import page_download
def load_crawler(url):
#download the sitemap
sitemap = page_download(url)
links = re.findall('<loc>(.*?)</loc>',sitemap)
for link in links:
page_download(link)
if link == links[-1]:
print 'All links has Done'
# print links
load_crawler('http://example.webscraping.com/sitemap.xml')
小结

好了,现在爬虫已经具备了抓取网页的能力,然而他并没有做什么事情,只是将网页download下来,所以我们还要进行数据处理。也就是需要在网页中抓取出我们想要的信息。

数据提取

使用Lxml提取

抓取网页中的信息常用的的三种方法:

  • 使用正则表达式解析,re模块,这是最快的解决方案,并且默认的情况下它会缓存搜索的结果(可以借助re.purge()来讲缓存清除),当然也是最复杂的方案(不针对你是一只老鸟)
  • 使用Beautifulsoup进行解析,这是最人性化的选择,因为它处理起来很简单,然而处理大量数据的时候很慢,所以当抓取很多页面的时候,一般不推荐使用
  • 使用Lxml,这是相对比较中性的做法,使用起来也比较简单,这里我们选择它对抓取的页面进行处理

Lxml的使用有两种方式:Xpath和cssselect,都是使用起来比较简单的,Xpath可以和bs一样,使用find和find_all匹配parten(匹配模式),用链型的结构描述DOM和数据的位置。而cssselct直接是用了jQuery的选择器来进行匹配,这样对有前端功底的同学更加友好。

先给个demo试下:即将抓取的网页http://example.webscraping.com/places/view/United-Kingdom-239 has Download

网页中有个表格<table>,我们想要的信息都是存在body的表格中,可以使用浏览器的开发者工具来省查元素,也可以使用firebug(Firefox上面的一款插件)来查看DOM结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import lxml.html
import cssselect
from download import page_download
example_url = 'http://example.webscraping.com/places/view/United-Kingdom-239'
def demo():
html = page_download(example_url, num_retry=2)
result = lxml.html.fromstring(html)
print type(result)
td = result.cssselect('tr#places_area__row > td.w2p_fw')
print type(td)
print len(td)
css_element = td[0]
print type(css_element)
print css_element.text_content()

执行结果:

1
2
3
4
5
6
http://example.webscraping.com/places/view/United-Kingdom-239 has Download
<class 'lxml.html.HtmlElement'>
<type 'list'>
1
<class 'lxml.html.HtmlElement'>
244,820 square kilometres

可以看到,使用cssselect进行选择器是拿到了一个长度是1的列表,当然列表的长度显然和我定义的选择器的模式有关,这个列表中每一项都是一个HtmlElement,他有一个text_content方法可以返回这个节点的内容,这样我们就拿到了我们想要的数据。

回调处理

接下来我们就可以为上面的爬虫增加定义一个回调函数,在我们每下载一个页面的时候,做一些小的操作。
显然应该修改link_crawler函数,并在其参数传递回调函数的引用,这样就可以针对不同页面来进行不同的回调处理如:

1
2
3
4
5
6
7
8
def link_crawler(seed_url,link_regex,max_depth=-1,scrape_callback=None):
...
html = page_download(url) #这行和上面一样
if scrape_callback:
scrape_callback(url,html)
links.extend(urlparse.urljoin(seed_url, x) for x in get_links_by_html(html) if re.match(link_regex, x)) #这行和上面一样
...

接下来编写回调函数,由于python的面向对象很强大,所以这里使用回调类来完成,由于我们需要调用回调类的实例,所以需要重写它的__call__方法,并实现在调用回调类的实例的时候,将拿到的数据以csv格式保存,这个格式可以用wps打开表格。当然你也可以将其写入到数据库中,这个之后再提

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import csv
class ScrapeCallback():
def __init__(self):
self.writer = csv.writer(open('contries.csv','w+'))
self.rows_name = ('area','population','iso','country','capital','tld','currency_code','currency_name','phone','postal_code_format','postal_code_regex','languages','neighbours')
self.writer.writerow(self.rows_name)
def __call__(self,url,html):
if re.search('/view/', url):
tree = lxml.html.fromstring(html)
rows = []
for row in self.rows_name:
rows.append(tree.cssselect('#places_{}__row > td.w2p_fw'.format(row))[0].text_content())
self.writer.writerow(rows)

可以看到回调类有三个属性:

self.rows_name这个属性保存了我们的想要抓取数据的信息
self.writer这个类似文件句柄一样的存在
self.writer.writerow这个属性方法是将数据写入csv格式表格

好了,这样就可以将我们的数据持久化保存起来

修改下link_crawler的define:def link_crawler(seed_url,link_regex,max_depth=-1,scrape_callback=ScrapeCallback()):

运行看下结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
zhxfei@zhxfei-HP-ENVY-15-Notebook-PC:~/桌面/py_tran$ python crawler.py
http://example.webscraping.com has Download
http://example.webscraping.com/index/1 has Download # /index 在__call__中的/view 所以不会进行数据提取
http://example.webscraping.com/index/2 has Download
http://example.webscraping.com/index/0 has Download
http://example.webscraping.com/view/Barbados-20 has Download
http://example.webscraping.com/view/Bangladesh-19 has Download
http://example.webscraping.com/view/Bahrain-18 has Download
...
...
http://example.webscraping.com/view/Albania-3 has Download
http://example.webscraping.com/view/Aland-Islands-2 has Download
http://example.webscraping.com/view/Afghanistan-1 has Download
----All Done---- 35
zhxfei@zhxfei-HP-ENVY-15-Notebook-PC:~/桌面/py_tran$ ls
contries.csv crawler.py

打开这个csv,就可以看到数据都保存了:

添加缓存

之前的逻辑是直接将网页HTML下载下来,虽然在页面下载函数中对url去重处理,然而在多次运行脚本,不会检查到之前运行后下载了哪些页面,若我们不想重复下载,则需要对页面进行存储,主要有磁盘序列化持久存储和利用数据库进行缓存。在下载之前检查是否被缓存,若已经被缓存的页面则无需再次下载

要想增加缓存,需要修改page_download,新建一个download类,这个类的实例在被调用的时候会检查是否缓存以及是否进行download,即使用download类来包装我们的缓存和下载逻辑。如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class Download(object):
'''
default:user-agent is choosed in radom
'''
def __init__(self,num_retry=2,user_agent=user_agent.generate_user_agent(),proxy=None,cached=None):
self.num_retry = num_retry
self.user_agent = user_agent
self.proxy = proxy
self.cached = cached
def __call__(self,url):
result = None
if self.cached:
try:
result = self.cached[url]
except KeyError:
print 'the url is not been cache and will cached'
result = self.page_download(url)
self.cached[url] = result
else:
result = self.page_download(url)
return result
def page_download(self,url):
# print 'downloading ' , url
headers = {'User-agent':self.user_agent}
request = urllib2.Request(url,headers = headers)
opener = urllib2.build_opener()
if self.proxy:
proxy_params = {urlparse(url).scheme:self.proxy}
opener.add_handler(urllib2.ProxyHandler(proxy_params))
try:
html = urllib2.urlopen(request).read() #try : download the page
except urllib2.URLError as e: #except :
print 'Download error!' , e.reason #URLError
html = None
if self.num_retry > 0: # retry download when time>0
if hasattr(e, 'code') and 500 <=e.code <=600:
return self.page_download(url,self.num_retry-1)
if html is None:
print '%s has Download error' % url
else:
print '%s has Download' % url
return html

同时需要修改link_crawler中的部分语句:

1
2
3
4
5
6
7
8
9
def link_crawler(seed_url,link_regex,max_depth=-1,scrape_callback=ScrapeCallback()):
crawl_link_queue = Queue.deque([seed_url])
...
links = [] #上面都一样
do = Download(num_retry=3,cached=disk_cached.DiskCache()) #主要是这两行
html = do(url)
if scrape_callback: #下面的都一样
scrape_callback(url,html)
...

这样就完成了我们的缓存框架,剩下就是添加缓存类,由于在Download类的__call__中使用了self.cached[url],根据url来索取对应的缓存,所以缓存类需要实现__getitem____setitem__两个方法

给出磁盘缓存:使用pickle的序列化
磁盘存储

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import os
import re
import urlparse
import pickle
class DiskCache(object):
def __init__(self,cache_dir='/home/zhxfei/crawler_cache/test'):
self.cache_dir = cache_dir
def url_to_path(self,url):
url_obj = urlparse.urlsplit(url)
path = url_obj.path
if not path:
path = '/index.html' #首页
if path.endswith('/'):
path += 'index.html'
filename = url_obj.netloc + path +url_obj.query
filename = re.sub(r'[^/0-9a-zA-Z\-.,;_]','_',filename)
return os.path.join(self.cache_dir,filename)
def __getitem__(self,url):
filename = self.url_to_path(url)
if os.path.exists(filename):
with open(filename,'rb') as fp:
return pickle.load(fp)
else:
raise KeyError('%s is not in cache' % url)
def __setitem__(self,url,result):
filename = self.url_to_path(url)
folder = os.path.dirname(filename)
if not os.path.exists(folder):
os.mkdir(folder)
with open(filename,'wb') as fp:
fp.write(pickle.dumps(result))

之后再下载,就可以看到缓存结果了

1
2
3
4
5
6
zhxfei@zhxfei-HP-ENVY-15-Notebook-PC:~/桌面/py_tran$ time python crawler.py
----All Done---- 35
real 0m0.149s
user 0m0.132s
sys 0m0.016s

坚持原创技术分享,您的支持将鼓励我继续创作!