說到協程,協程不是進程或線程,其執行過程更相似於子例程,或者說不帶返回值的函數調用。html
協程在執行過程當中遇到阻塞時轉而執行別的子程序,阻塞結束後再返回來接着執行。python
在gevent裏面,上下文切換是經過yielding來完成的數據庫
代碼中用到requests,xpathapp
若是有不懂xpath的小夥伴 --> 傳送門dom
requests不理解的小夥伴 -->傳送門函數
monkey.patch_all()
用來在運行時動態修改已有的代碼,而不須要修改原始代碼
官方文檔連接 --> monkey.patch_all()
附帶一篇中文gevent指南 -->傳送門
很少說直接上代碼url
程序實現了判斷域名,url去重spa
定義exp_url爲set()結構,達到去重效果,也能夠用list,dict,數據庫 線程
exp_url=set()
此處爲去重部分code
1 if domain in url: 2 if url in exp_url: 3 return
所有代碼
1 from gevent import monkey 2 import gevent 3 import requests 4 from lxml import etree 5 6 monkey.patch_all() 7 8 domain="quanxue.cn" 9 exp_url=set() 10 defeated_url=[] 11 12 13 def requ(url): 14 jobs=[] 15 if domain in url: 16 if url in exp_url: 17 return 18 else: 19 exp_url.add(url) 20 print "GET:%s"%url 21 try: 22 req = requests.get(url) 23 data=req.content 24 select=etree.HTML(data) 25 links=select.xpath("//a/@href") 26 for link in links: 27 if 'http://' not in link: 28 link=url[:url.rindex('/')+1]+link 29 jobs.append(gevent.spawn(requ,link)) 30 else: 31 jobs.append(gevent.spawn(requ,link)) 32 gevent.joinall(jobs) 33 print len(exp_url) 34 except Exception,e: 35 print "ERROR" 36 defeated_url.append(url) 37 38 39 if __name__ == '__main__': 40 try: 41 url="http://www.quanxue.cn" 42 requ(url) 43 except: 44 print exp_url 45 print defeated_url 46 finally: 47 print defeated_url 48 print exp_url