利用python腳本自動下載ICML會議接受的文章

時間 2019-11-08

標籤利用 python 腳本自動下載 icml 會議接受文章欄目 Python 简体版

原文原文鏈接

最近須要下載ICML會議2015年接受的最新的文章，可是到官網一看，那麼多的文章，若是我一篇一篇點擊下載的話，何時是個頭呢？因而就想着用python腳本對文章的頁面進行處理，獲得相關文章的url，而後進行下載。html

經過觀察ICML會議的Accepted Papers發現，其的結構仍是比較整齊的，其中咱們須要的信息的代碼片斷以下：python

<div class="paper">
    <p class="title">Approval Voting and Incentives in Crowdsourcing</p>
    <p class="details">
    <span class="authors">
            Nihar Shah,
        
            Dengyong Zhou,
        
            Yuval Peres
    </span>
    </p>
    <p class="links">
        [<a href="shaha15.html">abs</a>]
        [<a href="shaha15.pdf">pdf</a>]
        [<a href="shaha15-supp.pdf">supplementary</a>]
    </p>
</div>

只要咱們提取到了title和具體文章的鏈接這件事計算完成了。正則表達式

提取html的相關的內容通常有兩種方式：url

對html文檔進行解析
利用正則表達式進行內容匹配

對html文檔進行解析要比利用正則表達式進行內容匹配要慢，可是對於個人這個小的數據處理，速度不是首要的要求，最重要的是可以實現。因此就試着用了下HtmlPaper，可是這好像不是我要的，用起來比較困難，就轉而使用python的正則表達式來進行匹配。爲了匹配以上咱們須要的內容，我寫了以下的正則表達式，並對文章的標題和url進行了分組。spa

<div.*?class="paper".*?>[\s\S]*?<p.*?class="title".*?>([\s\S]*?)</p>[\s\S]*?<a.*?href="(.*?.pdf)">pdf</a>[\s\S]*?</div>

整個python腳本的流程是：code

獲得要處理的html文檔
對文章的標題和url進行提取
對url的資源進行下載並保存爲標題對應的pdf文檔

所有的代碼以下：htm

# -*- coding: utf-8 -*-  
import urllib2
import re

def getDocument():
    url='http://jmlr.org/proceedings/papers/v37/'
    response=urllib2.urlopen(url)
    return response.read()


def download(url,file):
    """
    download the file 

    @parameters
    url:the resource of the file 
    file:the name to save the file
    """
    f=urllib2.urlopen(url)
    with open(file+'.pdf','wb') as output:
        output.write(f.read())

def  process(document):
    #print document
    p=re.compile('<div.*?class="paper".*?>[\s\S]*?<p.*?class="title".*?>([\s\S]*?)</p>[\s\S]*?<a.*?href="(.*?.pdf)">pdf</a>[\s\S]*?</div>',re.IGNORECASE)
    m=p.finditer(document)
    url='http://jmlr.org/proceedings/papers/v37/'
    for i in m:
        print 'title:',i.group(1)
        print 'url:',url+i.group(2)
        print 'downloading....'
        download(url+i.group(2),i.group(1))

if __name__ == '__main__':
    process(getDocument())

運行以上腳本：blog

在對應爲文件夾下，能夠看到下載的papers：utf-8

打開其中一篇，也可以正常顯示：ci

ps：惟一不足的是，咱們能夠看到有的文章是有補充的，可是在我寫正則表達式的時候沒有試驗成功，也沒有再深究，有知道的同窗不吝賜教。由於是有的文章有，有的文章是沒有的嘛，因此我想就是若存在則匹配，若不存在，則匹配不到，因爲對正則表達式不是很熟悉，先到這裏，之後找到解決方式的話再更新。沒有技術難度，僅做平常記錄。