爬蟲新手學習1-爬蟲基礎javascript
一、urllib和urllib2區別實例html
urllib和urllib2都是接受URL請求相關模塊,可是提供了不一樣的功能,兩個最顯著的不一樣以下:java
urllib能夠接受URL,不能建立設置headers的Request類實例,urlib2能夠。python
url轉碼web
https://www.baidu.com/s?wd=%E5%AD%A6%E7%A5%9Ejson
python字符集解碼加碼過程:瀏覽器
2.爬蟲GET提交實例服務器
#coding:utf-8 import urllib #負責url編碼處理 import urllib2 url = "https://www.baidu.com/s" word = {"wd": "繁華"} word = urllib.urlencode(word) #轉換成url編碼格式(字符串) newurl = url + "?" + word headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"} request = urllib2.Request(newurl, headers=headers) response = urllib2.urlopen(request) print response.read()
#coding:utf-8 import urllib #負責url編碼處理 import urllib2 url = "https://www.baidu.com/s" word = {"wd": "咖菲貓"} word = urllib.urlencode(word) #轉換成url編碼格式(字符串) newurl = url + "?" + word headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"} request = urllib2.Request(newurl, headers=headers) response = urllib2.urlopen(request) print response.read()
自定義爬蟲GET提交實例session
#coding:utf-8 import urllib #負責url編碼處理 import urllib2 url = "https://www.baidu.com/s" keyword = raw_input("請輸入要查詢的關鍵字:") word = {"wd": keyword} word = urllib.urlencode(word) #轉換成url編碼格式(字符串) newurl = url + "?" + word headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"} request = urllib2.Request(newurl, headers=headers) response = urllib2.urlopen(request) print response.read()
3.批量爬取貼吧頁面數據app
分析貼吧URL
http://tieba.baidu.com/f?kw=%E7%BB%9D%E5%9C%B0%E6%B1%82%E7%94%9F
pn遞增50爲一頁
0爲第一頁
http://tieba.baidu.com/f?kw=%E7%BB%9D%E5%9C%B0%E6%B1%82%E7%94%9F&pn=0
50爲第二頁
http://tieba.baidu.com/f?kw=%E7%BB%9D%E5%9C%B0%E6%B1%82%E7%94%9F&pn=50
100爲第三頁
http://tieba.baidu.com/f?kw=%E7%BB%9D%E5%9C%B0%E6%B1%82%E7%94%9F&pn=100
批量爬貼吧實例
#coding:utf-8 import urllib import urllib2 def loadPage(url, filename): ''' 做用:根據url發送請求, 獲取服務器響應文件 url:須要爬取的url地址 filename:文件名 ''' print "正在下載" + filename headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"} request = urllib2.Request(url, headers = headers) response = urllib2.urlopen(request) return response.read() def writeFile(html, filename): ''' 做用:保存服務器相應文件到本地硬盤文件裏 html:服務器相應文件 filename:本地磁盤文件名 ''' print "正在存儲" + filename with open(unicode(filename, 'utf-8'), 'w') as f: f.write(html) print "-" * 20 def tiebaSpider(url, beginPage, endPage, name): ''' 做用:負責處理url,分配每一個url去發送請求 url:須要處理的第一個url beginPage:爬蟲執行的起始頁面 endPage:爬蟲執行的截止頁面 ''' for page in range(beginPage, endPage + 1): pn = (page - 1) * 50 filename = "第" + name + "-" + str(page) + "頁.html" #組合爲完整的url,而且pn值每次增長50 fullurl = url + "&pn=" + str(pn) #print fullurl #調用loadPage()發送請求獲取HTML頁面 html = loadPage(fullurl, filename) #將獲取到的HTML頁面寫入本地磁盤文件 writeFile(html, filename) #模擬main函數" if __name__ == "__main__": kw = raw_input("請輸入須要爬取的貼吧:") beginPage = int(raw_input("請輸入起始頁:")) endPage = int(raw_input("請輸入終止頁:")) url = "http://tieba.baidu.com/f?" key = urllib.urlencode({"kw": kw}) #組合後的url示例:http://tieba.baidu.com/f?kw=絕地求生 newurl = url + key tiebaSpider(newurl, beginPage, endPage, kw)
4.Fidder使用安裝
下載fidder
安裝後,照下圖設置
5.有道翻譯POST分析
調用有道翻譯POST API實例
#coding:utf-8 import urllib import urllib2 #經過抓包的方式獲取的url,並非瀏覽器上顯示的url url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null" #完整的headers headers = { "Accept": "application/json, text/javascript, */*; q=0.01", "Origin": "http://fanyi.youdao.com", "X-Requested-With": "XMLHttpRequest", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36", "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8" } #用戶接口輸入 key = raw_input("請輸入須要翻譯的英文單詞:") #發送到web服務器的表單數據 formdata = { "i": key, "from": "AUTO", "to": "AUTO", "smartresult": "dict", "client": "fanyideskweb", "salt": "1513519318944", "sign": "29e3219b8c4e75f76f6e6aba0bb3c4b5", "doctype": "json", "version": "2.1", "keyfrom": "fanyi.web", "action": "FY_BY_REALTIME", "typoResult": "false" } #通過urlencode轉碼 data = urllib.urlencode(formdata) #若是Request()方法裏的data參數有值,那麼這個請求就是POST #若是沒有,就是Get request = urllib2.Request(url, data= data, headers= headers) print urllib2.urlopen(request).read()
6.Ajax加載方式的數據獲取
豆瓣分析Ajax
爬取豆瓣電影排行信息實例
#coding:utf-8 import urllib2 import urllib url = "https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action=" headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"} formdata = { "start": "0", "limit": "20" } data = urllib.urlencode(formdata) request = urllib2.Request(url, data = data, headers = headers) print urllib2.urlopen(request).read()