Python爬蟲之多線程下載程序類電子書

時間 2019-12-21

原文原文鏈接

近段時間，筆者發現一個神奇的網站：http://www.allitebooks.com/ ，該網站提供了大量免費的編程方面的電子書，是技術愛好者們的福音。其頁面以下：html

那麼咱們是否能夠經過Python來製做爬蟲來幫助咱們實現自動下載這些電子書呢？答案是yes.
筆者在空閒時間寫了一個爬蟲，主要利用urllib.request.urlretrieve()函數和多線程來下載這些電子書。
首先呢，筆者的想法是先將這些電子書的下載連接網址儲存到本地的txt文件中，便於永久使用。其Python代碼（Ebooks_spider.py）以下，該代碼僅下載第一頁的10本電子書做爲示例：python

# -*- coding:utf-8 -*-
# 本爬蟲用來下載http://www.allitebooks.com/中的電子書
# 本爬蟲將須要下載的書的連接寫入txt文件，便於永久使用
# 網站http://www.allitebooks.com/提供編程方面的電子書

#  導入必要的模塊
import urllib.request
from bs4 import BeautifulSoup

#  獲取網頁的源代碼
def get_content(url):
    html = urllib.request.urlopen(url)
    content = html.read().decode('utf-8')
    html.close()
    return content

# 將762個網頁的網址儲存在list中
base_url = 'http://www.allitebooks.com/'
urls = [base_url]
for i in range(2, 762):
    urls.append(base_url + 'page/%d/' % i)

# 電子書列表，每個元素儲存每本書的下載地址和書名
book_list =[]

# 控制urls的數量,避免書下載過多致使空間不夠!!!
# 本例只下載前3頁的電子書做爲演示
# 讀者能夠經過修改url[:3]中的數字,爬取本身想要的網頁書，最大值爲762
for url in urls[:1]:
    try:
        # 獲取每一頁書的連接
        content = get_content(url)
        soup = BeautifulSoup(content, 'lxml')
        book_links = soup.find_all('div', class_="entry-thumbnail hover-thumb")
        book_links = [item('a')[0]['href'] for item in book_links]
        print('\nGet page %d successfully!' % (urls.index(url) + 1))
    except Exception:
        book_links = []
        print('\nGet page %d failed!' % (urls.index(url) + 1))

    # 若是每一頁書的連接獲取成功
    if len(book_links):
        for book_link in book_links:
            # 下載每一頁中的電子書
            try:
                content = get_content(book_link)
                soup = BeautifulSoup(content, 'lxml')
                # 獲取每本書的下載網址
                link = soup.find('span', class_='download-links')
                book_url = link('a')[0]['href']

                # 若是書的下載連接獲取成功
                if book_url:
                    # 獲取書名
                    book_name = book_url.split('/')[-1]
                    print('Getting book: %s' % book_name)
                    book_list.append(book_url)
            except Exception as e:
                print('Get page %d Book %d failed'
                      % (urls.index(url) + 1, book_links.index(book_link)))

# 文件夾
directory = 'E:\\Ebooks\\'
# 將書名和連接寫入txt文件中，便於永久使用
with open(directory+'book.txt', 'w') as f:
    for item in book_list:
        f.write(str(item)+'\n')

print('寫入txt文件完畢!')

能夠看到，上述代碼主要爬取的是靜態頁面，所以效率很是高！運行該程序，顯示結果以下：git

在book.txt文件中儲存了這10本電子書的下載地址，以下：github

接着咱們再讀取這些下載連接，用urllib.request.urlretrieve()函數和多線程來下載這些電子書。其Python代碼（download_ebook.py）以下：web

# -*- coding:utf-8 -*-
# 本爬蟲讀取已寫入txt文件中的電子書的連接，並用多線程下載

import time
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
import urllib.request

# 利用urllib.request.urlretrieve()下載PDF文件
def download(url):
    # 書名
    book_name = 'E:\\Ebooks\\'+url.split('/')[-1]
    print('Downloading book: %s'%book_name) # 開始下載
    urllib.request.urlretrieve(url, book_name)
    print('Finish downloading book: %s'%book_name) #完成下載

def main():
    start_time = time.time() # 開始時間

    file_path = 'E:\\Ebooks\\book.txt' # txt文件路徑
    # 讀取txt文件內容，即電子書的連接
    with open(file_path, 'r') as f:
        urls = f.readlines()
    urls = [_.strip() for _ in urls]

    # 利用Python的多線程進行電子書下載
    # 多線程完成後，進入後面的操做
    executor = ThreadPoolExecutor(len(urls))
    future_tasks = [executor.submit(download, url) for url in urls]
    wait(future_tasks, return_when=ALL_COMPLETED)

    # 統計所用時間
    end_time = time.time()
    print('Total cost time:%s'%(end_time - start_time))

main()

運行上述代碼，結果以下：編程

再去文件夾中查看文件：微信

能夠看到這10本書都已成功下載，總共用時327秒，每本書的平均下載時間爲32.7，約半分鐘，而這些書的大小爲87.7MB，可見效率至關高的！
怎麼樣，看到爬蟲能作這些多有意思的事情，不知此刻的你有沒有心動呢？心動不如行動，至理名言~~
本次代碼已上傳github, 地址爲： https://github.com/percent4/E... .多線程

注意：本人現已開通兩個微信公衆號：用Python作數學（微信號爲：python_math）以及輕鬆學會Python爬蟲（微信號爲：easy_web_scrape），歡迎你們關注哦~~app