先來首python之禪(嘿嘿)html
分析青年文摘官網精選欄目http://www.qnwz.cn/html/221/list_1.htmlpython
源碼app
<strong>當前位置:</strong><a href='http://www.qnwz.cn/'>主頁</a>><a href='/html/239/'>《青年文摘·快點》</a>><a href='/html/221/'>文章精選</a>> | |
</div> | |
<div class="listbox"> | |
<ul class="e2"> | |
<li> | |
<a href='/html/221/201603/618083.html' class='preview'><img src='http://www.qnwz.cn///uploads/allimg/160315/1-160315105620961-lp.jpg'/></a> | |
<a href="/html/221/201603/618083.html" class="title"><b>視野|歪果仁找工做也拼爹?</b></a> | |
<span class="info"> | |
<small>日期:</small>2016-03-15 10:54:49 | |
<small>好評:</small>0 | |
<small>得分:</small>0 | |
</span>url ‘’‘’‘’‘’spa |
發現全部文章標題和文章網址都在div(class=listbox)裏,該欄目有68頁htm
1.so,導入requests和Beautifulsoup倆個爬蟲經常使用庫get
#!/usr/bin/python3 #coding:utf8 import requests from bs4 import BeautifulSoup
2.簡單獲得全部頁面的地址(1到68頁)源碼
def geturl(self): for i in range(1,68): root_url='http://www.qnwz.cn/html/221/list_' root_url+=str(i)+'.html' self.l.append(root_url)
3.下載全部獲得的頁面(1到68頁)requests
text = self.req.get(url=url)
4.從下載的頁面中獲取標題和文章地址string
def parser(self,r): soup = BeautifulSoup(r.content, 'html.parser') ur = soup.find_all('div', class_='listbox') soup = BeautifulSoup(str(ur), 'html.parser') titleurl = soup.find_all('a', class_='title') s='' for i in titleurl: self.n=self.n+1 s='title=' + i.string + ',url=http://www.qnwz.cn' + i['href']+'\n' print(s)
運行結果:
源碼:
#!/usr/bin/python3 #coding:utf8 import requests from bs4 import BeautifulSoup class main(object): def __init__(self): self.l = list() self.req=requests.Session() self.T = [] self.n=0 self.geturl() for i in self.l: self.gethtml(i) print('總共' + str(self.n) + "篇") def geturl(self): for i in range(1,68): root_url='http://www.qnwz.cn/html/221/list_' root_url+=str(i)+'.html' self.l.append(root_url) def parser(self,r): soup = BeautifulSoup(r.content, 'html.parser') ur = soup.find_all('div', class_='listbox') soup = BeautifulSoup(str(ur), 'html.parser') titleurl = soup.find_all('a', class_='title') s='' for i in titleurl: self.n=self.n+1 s='title=' + i.string + ',url=http://www.qnwz.cn' + i['href']+'\n' print(s) def gethtml(self,url): text = self.req.get(url=url) self.parser(text) if __name__=='__main__': main()
文筆很差,代碼簡單,寫得也比較簡單‘ 。—— 。’ 有什麼錯誤,歡迎指正。。