這是我第一個全程本身動手作的項目,算得上是中小型的吧。正則表達式
直接進入正題,這個項目要求是:1. 從豆瓣爬取相關圖書標籤; 2. 將不一樣種類的圖書列爲幾個不一樣的列表,將各自種類的圖書標籤存進去; 3. 沒有IP代理池,採用了延時的笨方法。瀏覽器
直接上代碼:app
import requests import re import xlwt import time import random from requests.exceptions import RequestException def do_spider(book_tag_lists): book_lists=[] for book_tag in book_tag_lists: book_list=book_spider(book_tag) book_lists.append(book_list) return book_lists def book_spider(book_tag): page_num = 0 book_list = [] try_times = 0 while (page_num<10): url = 'https://book.douban.com/tag/'+ book_tag + '?start='+ str(page_num * 20) # time.sleep(random.uniform(0,3)) # 0 到 3 之間的隨機浮點數 # time.sleep(random.randint(0,5)) 0 到 5 之間的隨機整數,包括0,5 try: headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 'Accept - Language': 'zh - CN, zh;q = 0.9' } # 加上headers,網站認爲是瀏覽器訪問,就能爬取 response = requests.get(url, headers=headers) if response.text: # 這5行代碼是爲了爬取每種類型的全部的書使用的 response_text = response.text else: try_times+=1 continue except RequestException: return None # if try_times > 200:break # 在這兒使用了while (page_num<10)來退出 pattern = re.compile('<li.*?<a.*?title="(.*?)".*?<div.*?pub">(.*?)</div>.*?' '"rating_nums">(.*?)</span>.*?</li>', re.S) items = re.findall(pattern, response_text) # 從網頁中提取相關的標籤 for item in items: other = item[1].strip() other_2 = other.split('/') # 把字符串經過'/'符號拆分爲一個列表 try: title = item[0] except: title = '暫無' try: author = other_2[0].strip() except: author = '暫無' try: date = other_2[-2].strip() except: date = '暫無' try: rating_nums = item[2] except: rating_nums = '暫無' print(title,author,date,rating_nums) book_list.append([title, author, date, rating_nums]) try_times = 0 page_num += 1 print('Downloading Information From Page_' + str(page_num)) return book_list # 將爬取得內容寫到Excel中 def print_book_list_excel(book_lists, book_tag_lists): wb = xlwt.Workbook(encoding='utf-8') for i in range(len(book_tag_lists)): ws = wb.add_sheet(book_tag_lists[i]) ws.write(0, 1, '標題') ws.write(0, 2, '做者') ws.write(0, 3, '出版日期') ws.write(0, 4, '評分') count = 0 for bl in book_lists[i]: count+=1 ws.write(count,0,count) for j in range(4): ws.write(count,j+1,bl[j]) wb.save('excel.xls') def main(): book_tag_lists = ['漫畫','三毛','金庸','二戰','小說','科技'] book_lists = do_spider(book_tag_lists) # 已經調用過book_spider了 print_book_list_excel(book_lists,book_tag_lists) if __name__ == '__main__': main()
項目製做流程:1.先按照之前的經驗,爬取了「小說」的10頁圖書標籤,並寫在文本文檔中。(中間遇到許多挫折,尤爲是正則表達式這塊)dom
2. 文本文檔的內容不是按照「標題」、「做者」、「出版日期」、「評分」的順序來排列的,而是亂序(由於for循環從字典中讀數據原本就是亂序)。ide
因而看了下載的代碼,採用了 book_list.append([title, author, date, rating_nums]) 來代替字典形式(由於for循環從列表中讀數據是按順序來的)。網站
3. 將其寫入Excel表格,先用了下載的代碼方法,無論用,只有sheet表格,沒有寫入的內容。網上查了一些方法,結合下載的代碼,就實現了。url
4. 最後改寫代碼爲每一個種類的圖書建一個sheet表格。spa