1.分析知乎日報網頁(http://daily.zhihu.com/)html
<span class="title">這樣一分析,你就明白該在哪裏投廣告了</span></a></div></div><div class="wrap"><div class="box"><a href="/story/9269183" class="link-button"><img src="http://pic2.zhimg.com/05c67496e38f662958a141847a734ffd.jpg" class="preview-image"><span class="title">有些熱鬧的「共享經濟」,恐怕只是一個美好的童話</span></a></div></div><div class="wrap"><div class="box"><a href="/story/9268452" class="link-button"><img src="http://pic3.zhimg.com/da8d8d3bf282170379c51ea0cf1ae4a6.jpg" class="preview-image"><span class="title">讓孩子擁有屬於本身的無聊時光,到底有多重要?</span></a></div></div><div class="wrap"><div class="box"><a href="/story/9269792" class="link-button"><img src="http://pic2.zhimg.com/245f3cf8dd4bfdd1bf0911ba4b486295.jpg" class="preview-image"><span class="title">看不懂,說人話,否則公司就虧大發了</span></a></div></div><div class="wrap"><div class="box"><a href="/story/9269818" class="link-button"><img src="http://pic3.zhimg.com/8b1588e23187bb05d160d599bbfe1752.jpg" class="preview-image"><span class="title">《金剛狼 3》中有哪些隱藏的彩蛋和有趣的細節?</span></a></div></div><div class="wrap"><div class="box"><a href="/story/9266807" class="link-button"><img src="http://pic1.zhimg.com/4cb4d5dec4a68553e41dcbd483010e84.jpg" class="preview-image"><span class="title">瞎扯 · 如何正確地吐槽</span></a></div></div><div class="wrap"><div class="box"><a href="/story/9259222" class="link-button"><img src="http://pic4.zhimg.com/16378a8129349aface9694cd27c71e2f.jpg" class="preview-image"><span class="title">小事 · 愛無能</span></a></div></div><div class="wrap"><div class="box"><a href="/story/9267167" class="link-button"><img src="http://pic1.zhimg.com/fdf5e0ff47de69d615f706559a260168.jpg" class="preview-image">python
每個話題和圖片都在一個span標籤裏git
<span class="title">這樣一分析,你就明白該在哪裏投廣告了</span></a></div></div><div class="wrap"><div class="box"><a href="/story/9269183" class="link-button"><img src="http://pic2.zhimg.com/05c67496e38f662958a141847a734ffd.jpg" class="preview-image">web
很簡單構造一個正則表達式去匹配上面(獲取標題,圖片,連接)正則表達式
pattern=re.compile(u'<span class="title">(.*?)</span>.*?'+ u'<a href="(.*?)".*?'+ u'<img src="(.*?)".*?' ,re.S)
2.下載頁面匹配正則表達式多線程
先建立一個請求頭app
self.header={ 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language':'zh-CN,zh;q=0.8', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', 'Host':'daily.zhihu.com', 'Referer':'https://www.baidu.com/link?url=Eh6CKs72Buyf0LEjPd1795QSL8ZK74kwItBvzaybausT6proAZIr3UkkmMPSDfk7&wd=&eqid=d1bd8fd9004118c90000000258bd8149', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/56.0.2924.76 Chrome/56.0.2924.76 Safari/537.36' }使用下載頁面使用正則匹配url
html=self.req.get(url=self.url,headers=self.header) pattern=re.compile(u'<span class="title">(.*?)</span>.*?'+ u'<a href="(.*?)".*?'+ u'<img src="(.*?)".*?' ,re.S) T=list() self.l=re.findall(pattern,html.text)
3.使用多線程下載圖片spa
T=list() self.l=re.findall(pattern,html.text) for i in self.l: self.w.write(str(i)+'\n') #if(self.n<29): t=Thread(target=self.getimg,args=(i[2],self.n)) #self.getimg(i[2],self.n) T.append(t) t.start() #print(i) self.n+=1 #time.sleep(1) for tt in T: tt.join()def getimg(self,src,n): try: h=self.req.get(url=src) s=open(str(n)+'.jpg','wb') s.write(h.content) s.close() except requests.exceptions.MissingSchema: print('這個url無效',n)
4.結果.net
所有代碼:
#!/usr/bin/python3 #coding:utf8 import requests import re import time from threading import Thread class main(object): def __init__(self): self.url='http://daily.zhihu.com/' self.l=list() self.n=0 self.req=requests.Session() self.header={ 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language':'zh-CN,zh;q=0.8', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', 'Host':'daily.zhihu.com', 'Referer':'https://www.baidu.com/link?url=Eh6CKs72Buyf0LEjPd1795QSL8ZK74kwItBvzaybausT6proAZIr3UkkmMPSDfk7&wd=&eqid=d1bd8fd9004118c90000000258bd8149', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/56.0.2924.76 Chrome/56.0.2924.76 Safari/537.36' } def getpage(self): html=self.req.get(url=self.url,headers=self.header) pattern=re.compile(u'<span class="title">(.*?)</span>.*?'+ u'<a href="(.*?)".*?'+ u'<img src="(.*?)".*?' ,re.S) T=list() self.l=re.findall(pattern,html.text) for i in self.l: self.w.write(str(i)+'\n') #if(self.n<29): t=Thread(target=self.getimg,args=(i[2],self.n)) #self.getimg(i[2],self.n) T.append(t) t.start() #print(i) self.n+=1 #time.sleep(1) for tt in T: tt.join() def getimg(self,src,n): try: h=self.req.get(url=src) s=open(str(n)+'.jpg','wb') s.write(h.content) s.close() except requests.exceptions.MissingSchema: print('這個url無效',n) if __name__=='__main__': p=main() p.w=open('zh.txt','w') p.getpage() p.w.close()
項目地址:https://git.oschina.net/nanxun/zhihuribao