python3爬取青年文摘999篇精選文章

先來首python之禪(嘿嘿)html

分析青年文摘官網精選欄目http://www.qnwz.cn/html/221/list_1.htmlpython

源碼app

<strong>當前位置:</strong><a href='http://www.qnwz.cn/'>主頁</a>><a href='/html/239/'>《青年文摘·快點》</a>><a href='/html/221/'>文章精選</a>>
  </div>
  <div class="listbox">
  <ul class="e2">
  <li>
  <a href='/html/221/201603/618083.html' class='preview'><img src='http://www.qnwz.cn///uploads/allimg/160315/1-160315105620961-lp.jpg'/></a>
  <a href="/html/221/201603/618083.html" class="title"><b>視野|歪果仁找工做也拼爹?</b></a>
  <span class="info">
  <small>日期:</small>2016-03-15 10:54:49
  <small>好評:</small>0
  <small>得分:</small>0
 

</span>url

‘’‘’‘’‘’spa

發現全部文章標題和文章網址都在div(class=listbox)裏,該欄目有68頁htm

1.so,導入requests和Beautifulsoup倆個爬蟲經常使用庫get

#!/usr/bin/python3
#coding:utf8
import requests
from bs4 import BeautifulSoup

2.簡單獲得全部頁面的地址(1到68頁)源碼

def geturl(self):
    for i in range(1,68):
        root_url='http://www.qnwz.cn/html/221/list_'
        root_url+=str(i)+'.html'
        self.l.append(root_url)

3.下載全部獲得的頁面(1到68頁)requests

text = self.req.get(url=url)

4.從下載的頁面中獲取標題和文章地址string

def parser(self,r):
    soup = BeautifulSoup(r.content, 'html.parser')
    ur = soup.find_all('div', class_='listbox')
    soup = BeautifulSoup(str(ur), 'html.parser')
    titleurl = soup.find_all('a', class_='title')
    s=''
    for i in titleurl:
        self.n=self.n+1
        s='title=' + i.string + ',url=http://www.qnwz.cn' + i['href']+'\n'
        print(s)

運行結果:

源碼:

#!/usr/bin/python3
#coding:utf8
import requests
from bs4 import BeautifulSoup
class main(object):
    def __init__(self):
        self.l = list()
        self.req=requests.Session()
        self.T = []
        self.n=0
        self.geturl()
        for i in self.l:
            self.gethtml(i)
        print('總共' + str(self.n) + "篇")
    def geturl(self):
        for i in range(1,68):
            root_url='http://www.qnwz.cn/html/221/list_'
            root_url+=str(i)+'.html'
            self.l.append(root_url)
    def parser(self,r):
        soup = BeautifulSoup(r.content, 'html.parser')
        ur = soup.find_all('div', class_='listbox')
        soup = BeautifulSoup(str(ur), 'html.parser')
        titleurl = soup.find_all('a', class_='title')
        s=''
        for i in titleurl:
            self.n=self.n+1
            s='title=' + i.string + ',url=http://www.qnwz.cn' + i['href']+'\n'
            print(s)

    def gethtml(self,url):
        text = self.req.get(url=url)
        self.parser(text)
if __name__=='__main__':
    main()

文筆很差,代碼簡單,寫得也比較簡單‘ 。——   。’    有什麼錯誤,歡迎指正。。

相關文章
相關標籤/搜索