本篇使用python 的BeautifulSoup 實現爬蟲,抓取知乎收藏夾的問題與神回覆,用於作天然語言處理研究。開始接觸python爬蟲,是從一段代碼分享《獲取城市的PM2.5濃度和排名(含單線程和多線程)》受啓發。html
#!/usr/bin/env python # by keyven # PM25.py import urllib2 from bs4 import BeautifulSoup def getPm25(cityname): site = 'http://www.pm25.com/' + cityname + '.html' html = urllib2.urlopen(site) soup = BeautifulSoup(html) city = soup.find(class_ = 'bi_loaction_city') # 城市名稱 aqi = soup.find("a",{"class","bi_aqiarea_num"}) # AQI指數 quality = soup.select(".bi_aqiarea_right span") # 空氣質量等級 result = soup.find("div",class_ ='bi_aqiarea_bottom') # 空氣質量描述 s = city.text + '\nAQI:' + aqi.text + '\nAir Quality:' + ' ' + result.text return s
#!/usr/bin/env python # by keyven # main.py from PM25 import getPm25 s = getPm25('xianggang') print s
效果圖:python
咱們從這個網址(http://www.zhihu.com/collection/27109279?page=1)來開始爬,打開網頁,仔細觀察源碼,能夠發現一些模塊規律:問題和答案做爲兩個子模塊放在同一個大模塊裏面。接下來開始寫代碼:多線程
#!/usr/bin/env python # by keyven # main.py import re import urllib2 from bs4 import BeautifulSoup for p in range(1,5): url = "http://www.zhihu.com/collection/27109279?page=" + str(p) page = urllib2.urlopen(url) soup = BeautifulSoup(page) allp = soup.findAll(class_ = 'zm-item') for each in allp: answer = each.findNext(class_ = 'zh-summary summary clearfix') if len(answer.text) > 100: continue # 答案太長了,有可能出現「顯示所有」狀況,直接跳過 problem = each.findNext(class_ = 'zm-item-title') print problem.text, print answer.text
效果圖:python爬蟲
咱們觀察運行效果,發現有些小bug,觀察原網頁:url
原來是同一個問題出現兩個如出一轍的答案所致使,這時咱們能夠作一下業務性處理,在同一個page 中加入一個set 來判斷答案是否出現過2 次。spa
如何正確的吐槽 code