python 下載oreilly 免費電子書

一、打開oreilly free主頁:javascript

http://www.oreilly.com/programming/free/html

在頁面上檢查元素,執行如下JS代碼,得到書籍下載連接列表java

$.map($('body > article:nth-child(4) > div > section > div > a'), function(e){return e.href.replace(/free/, "free/files").replace(/csp.*/, "pdf")})

獲得的列表以下 :python

["http://www.oreilly.com/programming/free/files/open-source-in-brazil.pdf",
 "http://www.oreilly.com/programming/free/files/ten-steps-to-linux-survival.pdf", 
"http://www.oreilly.com/programming/free/files/open-by-design.pdf", 
"http://www.oreilly.com/programming/free/files/getting-started-with-innersource.pdf", 
"http://www.oreilly.com/programming/free/files/microservices-in-production.pdf", 
"https://info.lightbend.com/COLL-20XX-Developing-Reactive-Microservices_Landing-Page.html?lst=OR",
 "http://www.oreilly.com/programming/free/files/microservices-antipatterns-and-pitfalls.pdf",
 "http://www.oreilly.com/programming/free/files/microservices-vs-service-oriented-architecture.pdf",
 "http://www.oreilly.com/programming/free/files/evolving-architectures-of-fintech.pdf", 
"http://www.oreilly.com/programming/free/files/software-architecture-patterns.pdf", 
"http://www.oreilly.com/programming/free/files/migrating-cloud-native-application-architectures.pdf",
 "http://www.oreilly.com/programming/free/files/reactive-microservices-architecture-orm.pdf"]

二、編寫Python代碼執行下載:react

初版代碼:直接使用urllib庫的urlretrieve函數進行下載,獲得的列表中有可能存在非法值,在循環裏進行判斷並跳過。linux

import urllib 

path = "G:\\books\\auto_dowloading\\"
def downloading(books):
    for book in books:
        tmp = book.split("/")
        if '.pdf' not in book:
            continue
        print "downloading %s" %(tmp[-1])
        urllib.urlretrieve(book, path+tmp[-1])
        print "download %s is over!" %(tmp[-1])
    print "all job done"

第二版代碼:經過輸入網址連接,爬取全部書籍的地址列表,將列表傳入進程池調用下載函數進行下載。app

import urllib
import os
import re
from multiprocessing import Pool

path = "G:\\books_new\\"
job =[]

def get_booklist(url):
    page = urllib.urlopen(url)
    html = page.read()
    
    tmp = re.findall(r'http://.*?\.csp',html)
    tmp2 = [i.replace('free','free/files').replace('csp','pdf') for i in tmp ]
    job.extend(tmp2)
    
def download_book(url,path=path):
    if '.pdf' not in url:
        return
    name = url.split("/")[-1]
    print "downloading %s" %(name)
    urllib.urlretrieve(url, path+name)
    print "download %s is over!" %(name)
    

if __name__=='__main__':
    get_booklist('http://www.oreilly.com/programming/free/')
    pool=Pool()
    pool.map(download_book,job)
    print('The documents have been downloaded successfully !')
相關文章
相關標籤/搜索