本文來自網易雲社區javascript
做者:王貝css
小學生如今都在學python了,做爲專業程序員固然不能落下了,因此,馬不停蹄,週六週末在家學起了python3,python3的基本語法比較簡單,相比於Java開發更加敏捷,python3的基礎就不講了,這裏主要講下我這裏的爬蟲小程序的實現邏輯吧html
上下模塊圖:java
一目瞭然,整體上就是這5步,涉及到python3的requests,bs4,re,sqlalchemy這四個模塊。python
(1)requests:mysql
是一個很強大的http客戶端庫,提供了豐富的api,好比發一個get請求:程序員
with requests.get(url,params={},headers={}) as rsp: res.text #返回值文本內容
發一個入參爲json的post請求:sql
with requests.post(url,json={},headers={}) as rsp: res.text #返回值文本內容
等等。數據庫
這裏值得說一下,爲何用with as,with會先執行__enter__()方法,其返回值就是as,requests裏返回值就是rsp,當with as 這一邏輯行執行結束時,就會執行__exit__()方法,requests裏__exit__()方法將request close掉了,這就是程序沒有顯示調用close的緣由。下面程序裏會有一個例子彰顯with as的功能。express
requests還有不少強大的功能,參考:https://www.cnblogs.com/lilinwei340/p/6417689.html。
(2)bs4 BeatifulSoup
學過java的都知道java有個jsoup,jsoup就是對html模版進行解析,變成各個標籤集合,這裏bs4和jsoup一模一樣,api也基本一致,好比,一下html代碼,咱們想獲取新聞,地圖,視頻,貼吧 這些內容,只要:
soup=BeautifulSoup(html,'html.parser') atags=soup.find('div',{'id':'u1'}).findChilren('a',{'class':'mnav'}) values=[]for atag in atags: values.append(atag.text)
以上程序便可實現咱們的要求,python解析html的還有一個scrapy框架的xpath,之後分享scrapy時再講。
<html> <head> <meta http-equiv=content-type content=text/html;charset=utf-8> <meta http-equiv=X-UA-Compatible content=IE=Edge> <meta content=always name=referrer> <link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css> <title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div > <div > <div > <div id=lg><img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129></div> <form id=form name=f action=//www.baidu.com/s > <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span ><input id=kw name=wd > autocomplete=off autofocus></span><span ><input type=submit id=su value=百度一下 ></span></form> </div> </div> <div id=u1><a href=http://news.baidu.com name=tj_trnews > name=tj_trhao123 > <a href=http://map.baidu.com name=tj_trmap > > href=http://tieba.baidu.com name=tj_trtieba > <noscript><a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login > <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=' + encodeURIComponent(window.location.href + (window.location.search === "" ? "?" : "&") + "bdorz_come=1") + '" name="tj_login" >登陸</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon > </div> </div> <div id=ftCon> <div id=ftConw><p id=lh><a href=http://home.baidu.com>關於百度</a> <a href=http://ir.baidu.com>About Baidu</a></p> <p id=cp>©2017 Baidu <a href=http://www.baidu.com/duty/>使用百度前必讀</a> <a href=http://jianyi.baidu.com/ > src=//www.baidu.com/img/gs.gif></p></div> </div> </div> </body> </html>
(3) re
re正則模塊很強大,有match search sub replace這些api,每一個都有本身的特長,能夠參考:http://www.runoob.com/python3/python3-reg-expressions.html
(4) sqlalchemy
一款python的數據庫orm框架,用了下,很好用,有點相似於java 的hibernate,但更靈活。
說了這麼多,該帖下爬蟲腳本的代碼了,下面是目錄結構,畢竟也是專業程序員,不能寫的一團糟,也要講究架構,哈哈。
------youku_any #包名
--------------datasource.py #專門管理數據源session
--------------youkubannerdao.py #程序裏抓取的優酷banner信息,這個是dao層
--------------youkuservice.py #不用說了,業務邏輯
還有一件事情,就是建表,很少說了:
CREATE TABLE `youku_banner` ( `id` bigint(22) NOT NULL AUTO_INCREMENT, `type` int(2) NOT NULL, #優酷banner類型 1:電視 2:電影 3.綜藝 `year` int(4) NOT NULL, `month` int(2) NOT NULL, `date` int(2) NOT NULL, `hour` int(2) NOT NULL, `minute` int(2) NOT NULL, `img` varchar(255) DEFAULT NULL, `title` varchar(255) DEFAULT NULL, `url` varchar(255) DEFAULT NULL, `create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `idx_uniq` (`year`,`month`,`date`,`hour`) USING BTREE ) ENGINE=InnoDB AUTO_INCREMENT=83 DEFAULT CHARSET=utf8mb4
接下來就是代碼實現了:
datasource.py
from sqlalchemy import create_enginefrom sqlalchemy.orm import sessionmaker dburl = 'mysql+pymysql://root:123@localhost/youku?charset=utf8'#pool_size 置爲100 session回收時間3600sds = create_engine(dburl,pool_size=100,pool_recycle=3600) Session = sessionmaker(bind=ds)# session=Session()#建立session管理類class SessionManager(): def __init__(self): self.session=Session() def __enter__(self): return self.session #鏈接池管理session,不須要顯示close def __exit__(self, exc_type, exc_val, exc_tb): # session.close() print('not close')
youkubannerdao.py
from sqlalchemy import Sequence, Column, Integer, BigInteger, String, TIMESTAMP, textfrom sqlalchemy.ext.declarative import declarative_basefrom youku_any.datasource import SessionManager Base = declarative_base()#繼承基類Baseclass YoukuBanner(Base): #指定表名 __tablename__ = 'youku_banner' #定義字段映射關係 id = Column(BigInteger, Sequence('id'), primary_key=True) type=Column(Integer) year = Column(Integer) month = Column(Integer) date = Column(Integer) hour = Column(Integer) minute = Column(Integer) img = Column(String(255)) title = Column(String(255)) url = Column(String(255)) createTime = Column('create_time', TIMESTAMP) def add(self): #with as 先執行SessionManager __enter__() 邏輯行結束執行__exit()__ with SessionManager() as session: try: session.add(self) session.commit() except: session.rollback() def addBatch(self,values): with SessionManager() as session: try: session.add_all(values) session.commit() except: session.rollback() def select(self,param): with SessionManager() as session: return session.query(YoukuBanner).select_from(YoukuBanner).filter(param) def remove(self,parma): with SessionManager() as session: try: session.query(YoukuBanner).filter(parma).delete(synchronize_session='fetch') session.commit() except: session.rollback() def update(self,param,values): with SessionManager() as session: try: session.query(YoukuBanner).filter(param).update(values, synchronize_session='fetch') session.commit() except: session.rollback()
youkuservice.py
import requestsimport jsonimport reimport datetimefrom bs4 import BeautifulSoupfrom sqlalchemy import textfrom youku_any.youkubannerdao import YoukuBannerdef getsoup(url): with requests.get(url, params=None, headers=None) as req: if req.encoding != 'utf-8': encodings = requests.utils.get_encodings_from_content(req.text) if encodings: encode = encodings[0] else: encode = req.apparent_encoding encode_content = req.content.decode(encode).encode('utf-8') soup = BeautifulSoup(encode_content, 'html.parser') return soupdef getbanner(soup): # soup = BeautifulSoup() # soup.findChild() bannerDivP = soup.find('div', {'id': 'm_86804', 'name': 'm_pos'}) bannerScript = bannerDivP.findChildren('script', {'type': 'text/javascript'})[1].text m = re.search('\[.*\]', bannerScript) banners = json.loads(m.group()) for banner in banners: time = datetime.datetime.now() youkubanner = YoukuBanner(type=1, year=time.year, month=time.month, date=time.day, hour=time.hour, minute=time.minute, img=banner['img'], title=banner['title'], url=banner['url']) youkubanner.add() soup=getsoup('http://tv.youku.com/') getbanner(soup) youkuBanner = YoukuBanner() youkuBanner.remove(parma=text('id=67 or id=71')) youkuBanner.update(param=text('id=70'),values={'title':YoukuBanner.title + '呼嘯山莊'})for i in range(0,10000): youkuBanner.update(param=text('id=70'), values={'title': YoukuBanner.title + '呼嘯山莊'}) bannerList = youkuBanner.select(param=text('id > 66 and id < 77 order by id asc limit 0,7')) print("lines--------%d" % i) # time.sleep(10) for banner in bannerList: print(banner.id,banner.minute,banner.img,banner.title)
到此,一個簡答的爬蟲腳本就寫完了,週末兩天的成果仍是有點小知足,不過這只是python的冰山一腳,還有好多等着咱們去探討呢。
網易雲免費體驗館,0成本體驗20+款雲產品!
更多網易研發、產品、運營經驗分享請訪問網易雲社區。
相關文章:
【推薦】 一個只有十行的精簡MVVM框架(上篇)
【推薦】 理解DDoS防禦本質:基於資源較量和規則過濾的智能化系統
【推薦】 關於扁平化視覺設計趨勢的一些小分享