爬蟲性能：NodeJs VS Python

時間 2019-11-11

原文原文鏈接

爬蟲項目html

衆籌網-衆籌中項目 http://www.zhongchou.com/brow...，咱們就以這個網站爲例，咱們爬取它全部目前正在衆籌中的項目，得到每個項目詳情頁的URL，存入txt文件中。node

實戰比較 python原始版python

# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language':'zh-CN,zh;q=0.8',
   'Connection':'keep-alive',
   'Host':'www.zhongchou.com',
   'Upgrade-Insecure-Requests':1,
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}

# 得到項目url列表
def getItems(allpage):
    no = 0
    items = open('pystandard.txt','a')
    for page in range(allpage):
        if page==0:
            url = 'http://www.zhongchou.com/browse/di'
        else:
            url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
        # print url #①
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode('utf8')
        soup = BeautifulSoup(html);
        lists = soup.findAll(attrs={"class":"ssCardItem"})
        for i in range(len(lists)):
            href = lists[i].a['href']
            items.write(href+"\n")
            no +=1
    items.close()
    return no
    
if __name__ == '__main__':
    start = time.clock()
    allpage = 30
    no = getItems(allpage)
    end = time.clock()
    print('it takes %s Seconds to get %s items '%(end-start,no))

實驗5次的結果：web

it takes 48.1727159614 Seconds to get 720 items 
 it takes 45.3397999415 Seconds to get 720 items  
 it takes 44.4811429862 Seconds to get 720 items 
 it takes 44.4619293082 Seconds to get 720 items
 it takes 46.669706593 Seconds to get 720 items

python多線程版npm

# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time,threading
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language':'zh-CN,zh;q=0.8',
   'Connection':'keep-alive',
   'Host':'www.zhongchou.com',
   'Upgrade-Insecure-Requests':1,
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}

items = open('pymulti.txt','a')
no = 0
lock = threading.Lock()

# 得到項目url列表
def getItems(urllist):
    # print urllist  #①
    global items,no,lock
    for url in urllist:
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode('utf8')
        soup = BeautifulSoup(html);
        lists = soup.findAll(attrs={"class":"ssCardItem"})
        for i in range(len(lists)):
            href = lists[i].a['href']
            lock.acquire()
            items.write(href+"\n")
            no +=1
            # print no
            lock.release()
    
if __name__ == '__main__':
    start = time.clock()
    allpage = 30
    allthread = 30
    per = (int)(allpage/allthread)
    urllist = []
    ths = []
    for page in range(allpage):
        if page==0:
            url = 'http://www.zhongchou.com/browse/di'
        else:
            url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
        urllist.append(url)
    for i in range(allthread):
        # print urllist[i*(per):(i+1)*(per)]
        th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],))
        th.start()
        th.join()
    items.close()
    end = time.clock()
    print('it takes %s Seconds to get %s items '%(end-start,no))

實驗5次的結果：多線程

it takes 45.5222291114 Seconds to get 720 items 
it takes 46.7097831417 Seconds to get 720 items
it takes 45.5334646156 Seconds to get 720 items 
it takes 48.0242797553 Seconds to get 720 items
it takes 44.804855018 Seconds to get 720 items

這個多線程並無優點，通過 #① 的註釋與否發現，這個所謂的多線程也是按照單線程運行的。併發

python改進單線程app

首先咱們把解析html的步驟改進一下，分析發現函數

lists = soup.findAll('a',attrs={"class":"siteCardICH3"})

比網站

lists = soup.findAll(attrs={"class":"ssCardItem"})

更好，由於它是直接找 a ，而不是先找 div 再找 div 下的 a 改進後實驗5次結果以下，可見有進步：

it takes 41.0018861912 Seconds to get 720 items 
it takes 42.0260390497 Seconds to get 720 items
it takes 42.249635988 Seconds to get 720 items 
it takes 41.295524133 Seconds to get 720 items 
it takes 42.9022894154 Seconds to get 720 items

多線程

修改 getItems(urllist) 爲 getItems(urllist，thno) 函數起止加入 print thno," begin at",time.clock() 和 print thno," end at",time.clock()。 結果：

0  begin at 0.00100631078628
0  end at 1.28625832936
1  begin at 1.28703230691
1  end at 2.61739476075
2  begin at 2.61801291642
2  end at 3.92514717937
3  begin at 3.9255829208
3  end at 5.38870235361
4  begin at 5.38921134066
4  end at 6.670658786
5  begin at 6.67125734731
5  end at 8.01520989534
6  begin at 8.01566383155
6  end at 9.42006780585
7  begin at 9.42053340537
7  end at 11.0386755513
8  begin at 11.0391565464
8  end at 12.421359168
9  begin at 12.4218294329
9  end at 13.9932716671
10  begin at 13.9939957256
10  end at 15.3535799145
11  begin at 15.3540870354
11  end at 16.6968289314
12  begin at 16.6972665389
12  end at 17.9798803157
13  begin at 17.9804714125
13  end at 19.326706238
14  begin at 19.3271438455
14  end at 20.8744308886
15  begin at 20.8751017624
15  end at 22.5306500245
16  begin at 22.5311450156
16  end at 23.7781693541
17  begin at 23.7787245279
17  end at 25.1775114499
18  begin at 25.178350742
18  end at 26.5497330734
19  begin at 26.5501776789
19  end at 27.970799259
20  begin at 27.9712727895
20  end at 29.4595075375
21  begin at 29.4599959972
21  end at 30.9507299602
22  begin at 30.9513989679
22  end at 32.2762763982
23  begin at 32.2767182045
23  end at 33.6476256057
24  begin at 33.648137392
24  end at 35.1100517711
25  begin at 35.1104907783
25  end at 36.462657099
26  begin at 36.4632234696
26  end at 37.7908515759
27  begin at 37.7912845182
27  end at 39.4359928956
28  begin at 39.436448698
28  end at 40.9955021593
29  begin at 40.9960871912
29  end at 42.6425665264
it takes 42.6435882327 Seconds to get 720 items

可見這些線程是真的沒有併發執行，而是順序執行的，並無達到多線程的目的。問題在哪裏呢？原來個人循環中

th.start()
th.join()

兩行代碼是緊接着的，因此新的線程會等待上一個線程執行完畢纔會start，修改成

for i in range(allthread):
    # print urllist[i*(per):(i+1)*(per)]
    th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],i))
    ths.append(th)
for th in ths:
    th.start()
for th in ths:
    th.join()

7  end at 69.1060433231
22  end at 69.2743398214
2  end at 69.5523713152
14  end at 69.6454986837
15  end at 69.8333400981
12  end at 69.9508018062
10  end at 70.2860348602
26  end at 70.3670659719
13  end at 70.3847232972
27  end at 70.3941635841
11  end at 70.5132838156
1  end at 70.7272351926
0  end at 70.9115253609
6  end at 71.0876563409
8  end at 71.112480539825
  end at 71.1145248855
3  end at 71.4606034226
19  end at 71.6103622486
18  end at 71.6674453096
20  end at 71.725601862
17  end at 71.7778992318
9  end at 71.7847479301
28  end at 71.7921004837
it takes 71.7931912368 Seconds to get 720 items

反思

上面的的多線是併發了，但是比單線程運行時間長了太多......我還沒找出來緣由，猜測是否是beautifulsoup不支持多線程？請各位多多指教。爲了驗證這個想法，我準備不用beautifulsoup,直接使用字符串查找。首先仍是從單線程的修改：

# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language':'zh-CN,zh;q=0.8',
   'Connection':'keep-alive',
   'Host':'www.zhongchou.com',
   'Upgrade-Insecure-Requests':'1',
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}

# 得到項目url列表
def getItems(allpage):
    no = 0
    data = set()
    for page in range(allpage):
        if page==0:
            url = 'http://www.zhongchou.com/browse/di'
        else:
            url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
        # print url #①
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode('utf8')
        start = 5000    
        while  True:     
            index = html.find("deal-show", start)   
            if index == -1:     
                break    
            # print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n"
            # time.sleep(100)
            data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n")    
            start = index  + 1000 
    items = open('pystandard.txt','a')
    items.write("".join(data))
    items.close()
    return len(data)
    
if __name__ == '__main__':
    start = time.clock()
    allpage = 30
    no = getItems(allpage)
    end = time.clock()
    print('it takes %s Seconds to get %s items '%(end-start,no))

it takes 11.6800132309 Seconds to get 720 items
it takes 11.3621804427 Seconds to get 720 items
it takes 11.6811991567 Seconds to get 720 items

而後對多線程進行修改：

# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time,threading
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
header = {
   'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language':'zh-CN,zh;q=0.8',
   'Connection':'keep-alive',
   'Host':'www.zhongchou.com',
   'Upgrade-Insecure-Requests':'1',
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}

data = set()
no = 0
lock = threading.Lock()

# 得到項目url列表 
def getItems(urllist,thno):
    # print urllist
    # print thno," begin at",time.clock()
    global no,lock,data
    for url in urllist:
        r1 = requests.get(url,headers=header)
        html = r1.text.encode('utf8')
        start = 5000    
        while  True:     
            index = html.find("deal-show", start)   
            if index == -1:     
                break
            lock.acquire()
            data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n")    
            start = index  + 1000 
            lock.release()
        
    # print thno," end at",time.clock()
    
if __name__ == '__main__':
    start = time.clock()
    allpage = 30  #頁數
    allthread = 10 #線程數
    per = (int)(allpage/allthread)
    urllist = []
    ths = []
    for page in range(allpage):
        if page==0:
            url = 'http://www.zhongchou.com/browse/di'
        else:
            url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
        urllist.append(url)
    for i in range(allthread):
        # print urllist[i*(per):(i+1)*(per)]
        low = i*allpage/allthread#注意寫法
        high = (i+1)*allpage/allthread
        # print low,' ',high
        th = threading.Thread(target = getItems,args= (urllist[low:high],i))
        ths.append(th)
    for th in ths:
        th.start()
    for th in ths:
        th.join()
    items = open('pymulti.txt','a')
    items.write("".join(data))
    items.close()
    end = time.clock()
    print('it takes %s Seconds to get %s items '%(end-start,len(data)))

實驗3次，結果：

it takes 1.4781525123 Seconds to get 720 items 
it takes 1.44905954029 Seconds to get 720 items
it takes 1.49297891786 Seconds to get 720 items

可見多線程確實比單線程快好多倍。對於簡單的爬取任務而言，用字符串的內置方法比用beautifulsoup解析html快不少。

NodeJs

// npm install request -g #貌似不行，要進入代碼所在目錄：npm install --save request
// npm install cheerio -g  #npm install --save cheerio

var request = require("request");
var cheerio = require('cheerio');
var fs = require('fs');

var t1 = new Date().getTime();
var allpage = 30;
var urllist = new Array()  
var urldata = "";
var mark = 0;
var no = 0;
for (var i=0; i<allpage; i++) {
    if (i==0) 
        urllist[i] = 'http://www.zhongchou.com/browse/di'
    else
        urllist[i] = 'http://www.zhongchou.com/browse/di-p'+(i+1).toString();
    request(urllist[i],function(error,resp,body){
        if (!error && resp.statusCode==200) {
            getUrl(body);
        }
    });
} 

function getUrl(data) {
    var $ = cheerio.load(data);  //cheerio解析data
    var href = $("a.siteCardICH3").toArray();
    for (var i = href.length - 1; i >= 0; i--) {
        // console.log(href[i].attribs["href"]);
        urldata += (href[i].attribs["href"]+"\n");
        no += 1;
    }    
    mark += 1;
    if (mark==allpage) {
        // console.log(urldata);
        fs.writeFile('./nodestandard.txt',urldata,function(err){
                    if(err) throw err;
        });
        var t2 = new Date().getTime();
        console.log("it takes " + ((t2-t1)/1000).toString() + " Seconds to get " + no.toString() + " items");
    }  
}

it takes 3.949 Seconds to get 720 items
it takes 3.642 Seconds to get 720 items
it takes 3.641 Seconds to get 720 items
it takes 3.938 Seconds to get 720 items
it takes 3.783 Seconds to get 720 items

可見一樣是用解析html的方法，nodejs速度完虐python。字符串查找呢？

function getUrl(data) {
    mark += 1;
    var start = 5000
    while (true) {
        var index1 = data.indexOf("deal-show", start);
        if (index1 == -1)     
            break;
        var url = "http://www.zhongchou.com/deal-show/"+data.substring(index1+10,index1+19)+"\n";
        // console.log(url);
        if (urldata.indexOf(url)==-1) {
            urldata.push(url);
        }
        start = index1 + 1000;
    }
    if (mark==allpage) {//全部頁面執行完畢
        // console.log(urldata);
        no = urldata.length;
        fs.writeFile('./nodestandard.txt',urldata.join(""),function(err){
                    if(err) throw err;
        });
        var t2 = new Date().getTime();
        console.log("it takes " + ((t2-t1)/1000).toString() + " Seconds to get " + no.toString() + " items");
    }  
}

實驗5次的結果：

it takes 3.695 Seconds to get 720 items
it takes 3.781 Seconds to get 720 items
it takes 3.94 Seconds to get 720 items
it takes 3.705 Seconds to get 720 items
it takes 3.601 Seconds to get 720 items

可見和解析起來的時間是差很少的。

綜上由我本身瞭解的知識和本實驗而言，個人結論是：python用上多線程下載速度可以比過nodejs，可是解析網頁這種事python沒有nodejs快，畢竟js原生就是爲了寫網頁，並且複雜的爬蟲總不能都用字符串去找吧。