第一個python網絡爬蟲總結

時間 2019-11-18

原文原文鏈接

這個程序其實就是模仿用戶的網頁訪問操做。
先從主頁上獲取大的商品分類，再一級一級地遍歷全部的小分類。在最後獲得商品列表，再遍歷每一個商品頁，從商品頁是抓取有效的信息。

這裏，我對一些關鍵點作個總結，以便之後用到好回顧。

一，怎麼訪問網頁？html

# 根據url獲取網頁正文
def get_webpage(url):
    headers = {\
            'User-Agent' : 'Mozilla/5.0 (X11; Linux i686; rv:34.0) Gecko/20100101 Firefox/34.0',\
            'Accept'     : 'text/html',\
            'Connection' : 'keep-alive'}
    try:
        request = urllib2.Request(url, None, headers)
        response = urllib2.urlopen(request, timeout=120)
        webpage = response.read()
        response.close()
        return webpage

    #except urllib2.HTTPError, e:
    #    print('HTTPError: ' + str(e.code))
    #except urllib2.URLError, e:
    #    print('URLError: ' + str(e.reason))
    except Exception, e:
        print('發生異常: ' + str(e))

上面這個函數就是用uillib2.urlopen()函數獲取url網址的網頁內容。也能夠不用urllib2.Request()，直接用urllib2.urlopen()。這麼作爲是爲了仿造正常的瀏覽器的訪問操做。

二，數據保存
數據最好保存爲xls文件格式，若是沒有辦法保存爲csw文本格式也能夠，也能夠保存爲txt文本格式。
最好作成根據用戶輸入的文件名的後綴名進行自動識別。

（1）先分別定義函數 save_as_csw(), save_as_txt(), save_as_xls() 實現csw，txt，xls文件格式的保存。
python

def save_as_csw(prod_list, filename):
    if len(prod_list) == 0:
        return False

    #分類 商品 價格 聯繫人 手機 公司 座機 傳真 地址 公司網址 源自網頁 
    line_fmt = '"%s"\t"%s"\t"%s"\t"%s"\t"%s"\t"%s"\t"%s"\t"%s"\t"%s"\t"%s"\t"%s"\n'
    lines = []
    head_line = line_fmt % ('分類', '商品', '價格', '聯繫人', '手機號','公司', 
                            '電話', '傳真', '公司地址', '公司網址', '源網頁')
    lines.append(head_line)
    for item in prod_list:
        info = item['detail']
        if info == None:    #若是信息不全，則跳過
            continue

        prod_line = line_fmt % (item['path'], info['name'], info['price'],\
                                info['contact'], info['cell-phone'], info['company'], \
                                info['tel-phone'], info['fax'], info['address'], info['website'], item['url'])
        lines.append(prod_line)
    
    wfile = open(filename, 'w')
    wfile.writelines(lines)
    wfile.close()
    return True

def save_as_txt(prod_list, filename):
    if len(prod_list) == 0:
        return False

    #分類 商品 價格 聯繫人 手機 公司 座機 傳真 地址 公司網址 源自網頁 
    line_fmt = '%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n'
    lines = []
    head_line = line_fmt % ('分類', '商品', '價格', '聯繫人', '手機號','公司', 
                            '電話', '傳真', '公司地址', '公司網址', '源網頁')
    lines.append(head_line)
    for item in prod_list:
        info = item['detail']
        if info == None:    #若是信息不全，則跳過
            continue

        prod_line = line_fmt % (item['path'], info['name'], info['price'],\
                                info['contact'], info['cell-phone'], info['company'], \
                                info['tel-phone'], info['fax'], info['address'], info['website'], item['url'])
        lines.append(prod_line)
    
    wfile = open(filename, 'w')
    wfile.writelines(lines)
    wfile.close()
    return True

#保存數據到xls文件中，每一個大類放在不一樣的Worksheet中
def save_as_xls(prod_list, filename):
    if len(prod_list) == 0:
        return False

    workbook = xlwt.Workbook(encoding='utf-8')  #必須註明編碼格式，不然會保存失敗
    curr_category = ''
    worksheet = None
    row_index = 0
    for prod_item in prod_list:
        path = prod_item['path']
        this_category = path.split('/')[0]
        #若是當前的這個商品種類與上一個商品不一樣，則要新建worksheet
        if this_category != curr_category:
            worksheet = workbook.add_sheet(this_category)
            curr_category = this_category
            #填寫表頭
            header_cells = ('分類', '商品', '價格', '聯繫人', '手機號','公司', 
                     '電話', '傳真', '公司地址', '公司網址', '源網頁')

            column_index = 0
            for cell in header_cells:
                worksheet.write(0, column_index, header_cells[column_index])
                column_index += 1
            #建立了新了worksheet後，數據從第二行開始往下寫
            row_index = 1 

        #將數據填寫到worksheet的row_index行
        prod_info = prod_item['detail']
        #若是信息不全，則跳過
        if prod_info == None:
            continue

        prod_cells = (path, prod_info['name'], prod_info['price'], prod_info['contact'],\
                 prod_info['cell-phone'], prod_info['company'], prod_info['tel-phone'],\
                 prod_info['fax'], prod_info['address'], prod_info['website'], prod_item['url'])

        column_index = 0
        for cell in prod_cells:
            worksheet.write(row_index, column_index, prod_cells[column_index])
            column_index += 1

        row_index += 1
        pass
    workbook.save(filename)
    return True

（2）定義DataSaver類，實現統一的文件保存功能。並用 case_dict 根據後綴名進行分別保存。web

def get_filename_postfix(filename):
    basename = os.path.basename(filename)
    temp = basename.split('.')
    if len(temp) >= 2:
        return temp[-1]

class DataSaver:
    #後綴名與保存函數映射表
    case_dict = {'csw':save_as_csw,\
                 'txt':save_as_txt}
    if xlwt_enabled:
        case_dict['xls'] = save_as_xls

    #將商品列表數據‘喂’給DataSaver
    def feed(self, data):
        self.product_list = data
        pass

    def save_as(self, filename):
        if self.product_list == None or len(self.product_list) == 0:
            print('警告：記錄爲空，不保存')
            return

        print('正在保存……')
        while True:
            postfix = get_filename_postfix(filename)
            try:
                if self.case_dict[postfix](self.product_list, filename):
                    print('已保存到：' + filename)
                else:
                    print('保存失敗！')
                break
            except KeyError:
                print('警告：不支持 %s 文件格式。' % (postfix))
                print('支持的文件格式：' + ','.join(self.case_dict.keys()))
                try:
                    filename = raw_input('請輸入新文件名：')
                except KeyboardInterrupt:
                    print('用戶取消保存')
                    break
        pass
    pass

（3）若是沒有安裝 xlwt ，就不能支持 xls 文件的保存。
這裏的作法是：若是import xlwt成功，則將xls文件的保存函數添加到case_dict中。
若是文件格式不支持，就提示讓用戶另命個文件名進行保存。

瀏覽器

#若是沒有安裝xlwt，那麼保存爲xls文件就不可用
xlwt_enabled = True
try:
    import xlwt
except ImportError: 
    xlwt_enabled = False

見 DataSaver.save_as() 函數中對 xls 後綴名的處理。app

三，遇到的問題與解決方法函數

（1）python程序裏不能有中文的問題。
之前，只有python程序裏有中文，無論是在哪兒，都不能運行。原來是python解析器默認將文件識別爲ascii編碼格式的，有中文固然就不誤別。解決這問題的辦法是：明確告知解析器咱們文件的編碼格式。post

#!/usr/bin/env python
#-*- coding=utf-8 -*-

這樣就能夠了。ui

（2）安裝xlwt3不成功的問題。
從網上下載xlwt3進行安裝。python setup.py install 失敗，報print()函數不支持print("xxxx", file=f)這種格式。我看了一下，這個功能python 2.6是沒有的。因而從新下載了xlwt-0.7.5.tar.gz進行安裝。結果就能夠了。this

（3）在Windows爲亂碼的問題。
這個問題，我尚未解決。
編碼