團隊開發衝刺第一天

時間 2020-04-16

標籤團隊開發衝刺第一天简体版

原文原文鏈接

今天是團隊衝刺開發的第一天：html

　　我給咱們的團隊裏的每一個人分配了現階段的任務，我本身領取的任務是：python

①爬取鐵道大學官網的新聞mysql

②將新聞準確的展現在「鐵大新聞」板塊上，而且能夠點擊後觀看整篇新聞。sql

今天利用爬蟲去解析鐵道大學新聞官網：http://xcb.stdu.edu.cn/2009-05-05-02-26-33.html數據庫

分析後發現其標題都是在tr，class="sectiontableentry1"和class=「sectiontableentry2」的裏面，所以我就將這兩種所有獲取json

獲取後遍歷存儲他們的href（單個新聞的網址）app

經過對單個網址進行分析，打算存入數據庫裏的信息有：標題，日期，點擊次數，正文，圖片連接ide

對於標題，日期，點擊次數比較簡單url

難點在於：正文的爬取和圖片連接spa

①爬取正文時：正文都在div.article-content下的p標籤裏，所以咱們獲取全部的p標籤的集合

一開始我覺得就是在p標籤下的span裏的是正文，爬取了10個後發現漏了一點東西，原來b標籤裏的也是正文裏的內容

因爲整篇文章是有p標籤分割的，一個p標籤表明這一段，所以我將爬取的全部p標籤遍歷，以後經過p的contents來訪問他的子節點，根據子節點的name屬性來判斷是「span」仍是「p」，爬取完一段後存儲到正文集合裏。

r=requests.get(url,headers=headers)
    content=r.content.decode('utf-8')
    soup = BeautifulSoup(content, 'html.parser')
    trs=soup.find_all('tr',class_='sectiontableentry1')
    trs+=soup.find_all('tr',class_='sectiontableentry2')
    for i in range(len(trs)):
        strs='http://xcb.stdu.edu.cn/'
        link=strs+trs[i].a['href']
        r = requests.get(link, headers=headers)
        content = r.content.decode('utf-8')
        soup = BeautifulSoup(content, 'html.parser')
        title=soup.find('h2',class_='contentheading').text.strip()
        title=title.replace(' ', '')
        date=soup.find('span',class_='createdate').text.strip()[5:]
        click=soup.find('span',class_='hits').text.strip()[5:-1]
        #print(title,date,click)
        #用來截取文章內容

        doc=[]
        listp=soup.select('div.article-content > p')
        for j in range(len(listp)):
            #print(listp[j])
            pp=''
            for k in range(len(listp[j].contents)):
                if(listp[j].contents[k].name=='span'):
                    pp+=listp[j].contents[k].text
                elif(listp[j].contents[k].name=='b'):
                    pp+=listp[j].contents[k].span.text
            doc.append(pp)
        doc='\n'.join(doc)
        #print(doc)

②爬取圖片連接：

一開始也是覺得只是在span裏的input標籤裏，以後爬取10條數據後比較原文發現少了幾張圖片，因而回頭在取分析少的圖片，發現原來有的圖片在img標籤裏

而後再爬取image裏的圖片連接，以後將兩個爬取的連接合並，圖片連接之間用空格分割

單個網頁爬取完成後，分析我要是爬取10頁內容，發現他們之間的規律，每翻一頁後面就會發生變化：

第一頁：http://xcb.stdu.edu.cn/2009-05-05-02-26-33.html

第二頁：http://xcb.stdu.edu.cn/2009-05-05-02-26-33.html?start=10

第三頁：http://xcb.stdu.edu.cn/2009-05-05-02-26-33.html?start=20

發現start裏的數字都是10個10的加，所以咱們就能夠翻頁爬取，而後存入數據庫

數據庫截圖：

python源碼：

import requests
from bs4 import BeautifulSoup
import json
import pymysql
import time

headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}#建立頭部信息
newlist=[]


for page in range(10):
    number=10*page
    url='http://xcb.stdu.edu.cn/2009-05-05-02-26-33.html?start='+str(number)
    f=page+1
    print('這是第'+str(f)+'頁')

    r=requests.get(url,headers=headers)
    content=r.content.decode('utf-8')
    soup = BeautifulSoup(content, 'html.parser')
    trs=soup.find_all('tr',class_='sectiontableentry1')
    trs+=soup.find_all('tr',class_='sectiontableentry2')
    for i in range(len(trs)):
        strs='http://xcb.stdu.edu.cn/'
        link=strs+trs[i].a['href']
        r = requests.get(link, headers=headers)
        content = r.content.decode('utf-8')
        soup = BeautifulSoup(content, 'html.parser')
        title=soup.find('h2',class_='contentheading').text.strip()
        title=title.replace(' ', '')
        date=soup.find('span',class_='createdate').text.strip()[5:]
        click=soup.find('span',class_='hits').text.strip()[5:-1]
        #print(title,date,click)
        #用來截取文章內容

        doc=[]
        listp=soup.select('div.article-content > p')
        for j in range(len(listp)):
            #print(listp[j])
            pp=''
            for k in range(len(listp[j].contents)):
                if(listp[j].contents[k].name=='span'):
                    pp+=listp[j].contents[k].text
                elif(listp[j].contents[k].name=='b'):
                    pp+=listp[j].contents[k].span.text
            doc.append(pp)
        doc='\n'.join(doc)
        #print(doc)


        #用來截取input裏的圖片
        inputs=soup.find_all('input')
        imgs=''
        for k in range(len(inputs)):
            src='http://xcb.stdu.edu.cn/'
            img=src+inputs[k]['src']
            if(k!=0):
                imgs+=' '+img
            else:
                imgs+=img
        #用來截取img裏的圖片
        newimgs=soup.find_all('img')
        newimg=''
        for p in range(len(newimgs)):
            src = 'http://xcb.stdu.edu.cn/'
            imk=src+newimgs[p]['src']
            if (p != 0):
                newimg += ' ' + imk
            else:
                newimg += imk
        #獲取的總圖片
        url=imgs+' '+newimg
        #print(url)

        newvalue=(title,date,click,doc,url)
        newlist.append(newvalue)
#數據庫存儲的實現
tupnewlist=tuple(newlist)
print(tupnewlist)
db = pymysql.connect("localhost", "root", "fengge666", "baixiaosheng", charset='utf8')
cursor = db.cursor()
sql_news = "INSERT INTO tdnews values (%s,%s,%s,%s,%s)"
sql_clean_news = "TRUNCATE TABLE tdnews"


try:
    cursor.execute(sql_clean_news)
    db.commit()
except:
    print('執行失敗，進入回調1')
    db.rollback()

try:
    cursor.executemany(sql_news,tupnewlist)
    db.commit()
except:
      print('執行失敗，進入回調3')
      db.rollback()
db.close()

View Code

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。