Python3爬蟲下載pdf（一）

時間 2019-12-05

標籤 python3 python 爬蟲下載 pdf 欄目 Python 简体版

原文原文鏈接

Python3爬蟲下載pdf（一）

最近在學習python的爬蟲，而且玩的不亦說乎，所以寫個博客，記錄並分享一下。php

需下載如下模塊html

bs4 模塊python
requests 模塊函數

1、源碼

"""
功能：下載指定url內的全部的pdf
語法：將含有pdf的url放到腳本後面執行就能夠了
"""

from bs4 import BeautifulSoup as Soup
import requests
from sys import argv

try:
    ##用於獲取命令行參數，argv[0]是腳本的名稱
    root_url = argv[1]
except:
    print("please input url behind the script!!")
    exit()

##得到含有全部a標籤的一個列表
def getTagA(root_url):
    res = requests.get(root_url)
    soup = Soup(res.text,'html.parser')
    temp = soup.find_all("a")
    return temp

##從全部a標籤中找到含有pdf的，而後下載
def downPdf(root_url,list_a):
    number = 0
    ##若是網站url是以相似xx/index.php格式結尾，那麼只取最後一個/以前的部分
    if not root_url.endswith("/"):     
        index = root_url.rfind("/")
        root_url = root_url[:index+1]
    for name in list_a:
        name02 = name.get("href")
        ##篩選出以.pdf結尾的a標籤
        if name02.lower().endswith(".pdf"):
            pdf_name = name.string 
            number += 1
            print("Download the %d pdf immdiately!!!"%number,end='  ')
            print(pdf_name+'downing.....') 
             ##由於要下載的是二進制流文件，將strem參數置爲True     
            response = requests.get(root_url+pdf_name,stream="TRUE")
            with open(pdf_name,'wb') as file:
                for data in response.iter_content():
                    file.write(data)

if __name__ == "__main__":
    downPdf(root_url,getTagA(root_url))

2、亮點

利用str.rfind("S") 函數來得到 S 在str 從右邊數第一次出現的index學習
使用str.lower().endswith("S") 函數來判斷str 是否以S/s 結尾網站

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。