Python3爬蟲--兩種方法（requests(urllib)和BeautifulSoup）爬取網站pdf

時間 2020-07-26

標籤 python3 python 爬蟲兩種方法 requests urllib beautifulsoup 網站 pdf 欄目 Python 简体版

原文原文鏈接

一、任務簡介

本次任務是爬取IJCAI（國際人工智能聯合會議）最新2018年的pdf論文文件。html

本次編碼用到了正則表達式從html裏面提取信息，以下對正則表達式匹配規則做簡要的介紹。python

二、正則表達式規則

\w匹配字母數字及下劃線正則表達式

\W匹配非字母數字及下劃線shell

\s匹配任意空白字符，等價於 [\t\n\r\f].編碼

\S匹配任意非空字符人工智能

\d匹配任意數字，等價於 [0-9]url

\D匹配任意非數字spa

\A匹配字符串開始code

\Z匹配字符串結束，若是是存在換行，只匹配到換行前的結束字符串xml

\z匹配字符串結束

\G匹配最後匹配完成的位置

\n匹配一個換行符

\t匹配一個製表符

^匹配字符串的開頭

$匹配字符串的末尾

.匹配任意字符，除了換行符，當re.DOTALL標記被指定時，則能夠匹配包括換行符的任意字符。

[...]用來表示一組字符,單獨列出：[amk] 匹配 'a'，'m'或'k'

[^...]不在[]中的字符：[^abc] 匹配除了a,b,c以外的字符。

*匹配0個或多個的表達式。

+匹配1個或多個的表達式。

?匹配0個或1個由前面的正則表達式定義的片斷，非貪婪方式

{n}精確匹配n個前面表達式。

{n, m}匹配 n 到 m 次由前面的正則表達式定義的片斷，貪婪方式

a|b匹配a或b

( )匹配括號內的表達式，也表示一個組

三、代碼實現

第一種方法實現以下：

 1 # -*- coding: utf-8 -*-
 2 """
 3 Created on Tue Aug 7 12:32:25 2018  4 
 5 @author: Lenovo  6 """
 7 import urllib.request  8 import re  9 import os 10 
11 url = 'http://www.ijcai.org/proceedings/2017/'
12 
13 def getHtml(url): 14     request = urllib.request.Request(url) 15     request.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36') 16     response = urllib.request.urlopen(request) 17     html = response.read() 18     
19     return html 20 
21 html = getHtml(url) 22 
23 def getPaper(html): 24     if not os.path.exists('IJCAI_2017') : #文件夾不存在時，再進行建立
25         os.mkdir('IJCAI_2017') 26     os.chdir(os.path.join(os.getcwd(), 'IJCAI_2017')) 27     
28     reg = 'href="(\d{4}\.pdf)"'        #正則表達式
29     papre = re.compile(reg) 30     addr_list = re.findall(papre, html.decode('utf-8')) 31     
32     num = len(addr_list) 33     print('論文總數：', num) 34     
35     m =1
36     for paperurl in addr_list: 37         fname = '%s.pdf' %m#論文下載名
38         paper_url = url + paperurl#論文下載地址
39         print(paper_url) 40         paper = getHtml(paper_url) 41         
42         with open(fname, 'wb') as f: 43  f.write(paper) 44         
45         m += 1
46         
47         print('已下載') 48  f.close() 49 
50 getPaper(html)

第二種方法實現以下：

 1 # -*- coding: utf-8 -*-
 2 """
 3 Created on Sun Aug 5 10:41:13 2018  4 
 5 @author: Lenovo  6 """
 7 import requests  8 import os  9 from bs4 import BeautifulSoup, Comment 10 
11 url = 'http://www.ijcai.org/proceedings/2018/'
12 headers = {'Host' : 'www.ijcai.org', 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'} 13  
14 def find_paper(): 15     html = requests.get(url, headers = headers).content 16     s = BeautifulSoup(html, 'lxml') 17     
18     #要爬取的信息在頁面解析後的註釋內，獲取註釋內容，結果返回全部註釋列表
19     comments = s.find_all(text=lambda text:isinstance(text, Comment)) 20     
21     #論文信息爲comments[2]，再次使用beautifulsoup解析
22     soup = BeautifulSoup(comments[2], 'lxml') 23     
24     titles = soup.find_all("div", class_ = "title")#因爲class是關鍵字因此加一個'_'
25     details = soup.find_all("div", class_ = "details") 26     
27     return titles, details 28 
29 titles, details = find_paper() 30 
31 def download_paper(): 32     if not os.path.exists('IJCAI_2018') : #文件夾不存在時，再進行建立
33         os.mkdir('IJCAI_2018') 34     os.chdir(os.path.join(os.getcwd(), 'IJCAI_2018'))#os.path.join(path1[, path2[, ...]]) 將多個路徑組合後返回，第一個絕對路徑以前的參數將被忽略,os.getcwd()獲取當前工做目錄，即當前python腳本工做的目錄路徑,os.chdir("dirname") 改變當前腳本工做目錄；至關於shell下cd
35     
36     num = len(titles) 37     print('論文總數：', num) 38         
39     for i in range(num): 40         detail = details[i] 41         
42         fname = detail.contents[1].get('href')#論文下載名
43         deatil_url = url + fname#論文下載地址
44         
45         print(deatil_url) 46         r = requests.get(deatil_url) 47         
48         with open(fname, 'wb') as f: 49  f.write(r.content) 50         
51         print('已下載：', titles[i].string) 52  f.close() 53 
54 if __name__ == '__main__': 55     download_paper()