python网络爬虫–简单爬取糗事百科-小浪学习网

刚开始学习python爬虫，写了一个简单python程序爬取糗事百科。

具体步骤是这样的：首先查看糗事百科的url：http://www.qiushibaike.com/8hr/page/2/?s=4959489，可以发现page后的数据代表第几页。

然后装配request，注意要设置user_agent

代码语言：JavaScript代码运行次数：0运行复制

1 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; windows NT)'2 headers = {'User-Agent': user_agent}3 request=urllib2.Request(url,headers=headers)4 response=urllib2.urlopen(request)

然后获取返回的数据

代码语言：javascript代码运行次数：0运行复制

content=response.read().decode('utf-8')

然后是关键，使用正则匹配出所有的具体内容。这里可以使用浏览器的检查功能查看页面结构，写出相对应的正则式，比如我们对下面的

…

进行匹配的正则式如下

立即学习“Python免费学习笔记（深入）”；

代码语言：javascript代码运行次数：0运行复制

pattern=re.compile('<div class="content">....<span>(.*?)</span>...</div>',re.S)

(.*?) ：表示组，该部分为一个整体，将该部分匹配到字符串作为返回值返回，findall表示找到所有匹配的字符串，以序列的形式返回

参数re.S表示”.”点号匹配所有字符包括换行

下面是完整代码

代码语言：javascript代码运行次数：0运行复制

 1 import urllib 2 import urllib2 3 import re 4 import time 5  6 page=2 7 f=open("D:qiushi.txt","r+") 8 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 9 headers = {'User-Agent': user_agent}10 while page....<span>(.*?)</span>...',re.S)20         items=re.findall(pattern,content)21         f.write((url+"n").encode('utf-8'))22         for item in items:23             print "------"24             item=item+"n"25             print item26             f.write("------n".encode('utf-8'))27             f.write(item.replace('<br>','n').encode('utf-8'))28     except urllib2.URLError,e:29         if hasattr(e,"code"):30             print e.code31         if hasattr(e,"reason"):32             print e.reason33     finally:34         page+=135         time.sleep(1)

这里我是将找到的输出到d盘下的qiushi.txt文件

以上就是<a

文章版权归作者所有，未经允许请勿转载。

THE END