python網絡爬蟲

第一部分:

請分析做業頁面,爬取已提交做業信息,並生成已提交做業名單,保存爲英文逗號分隔的csv文件。文件名爲:hwlist.csv 。
 
文件內容範例以下形式:
 
學號,姓名,做業標題,做業提交時間,做業URL
20194010101,張三,羊車門做業,2018-11-13 23:47:36.8, http://www.cnblogs.com/sninius/p/12345678.html
20194010102,李四,羊車門,2018-11-14 9:38:27.03, http://www.cnblogs.com/sninius/p/87654321.html
 
*注1:如製做按期爬去做業爬蟲,請注意爬取頻次不易太過密集;
*注2:本部分做業用到部分庫以下所示:
(1)requests —— 第3方庫
(2)json    —— 內置庫
 
咱們須要爬取的網址爲https://edu.cnblogs.com/campus/hbu/Python2018Fall/homework/2420,爬取信息爲:學號,姓名,做業標題,做業提交時間,做業URL。

 

經過瀏覽器檢查原代碼未發現提交信息,檢查元素,在XHR發現所須要的信息。html

  找到包含所需信息的網址:https://edu.cnblogs.com/Homework/GetAnswers?homeworkId=2420&_=1543629375998,剩下就是代碼的問題了。python

https://edu.cnblogs.com/campus/hbu/Python2018Fall/homework/2420json

https://edu.cnblogs.com/Homework/GetAnswers?homeworkId=2420&_=1543629375998瀏覽器

  不過 ,經過原網址和現網址對比,發現「2420」相同,遂猜測可經過網址最後編號獲取「博客園」全部做業的提取,經過代碼實踐,「https://edu.cnblogs.com/Homework/GetAnswers?homeworkId=2420」即可提取信息。網絡

 至此便完成這次網絡爬蟲的全部工做。可是,最近正在學習python的圖形界面,遂設計了一個簡單的爬取界面。app

輸入博客園的做業鏈接,點擊開始爬取,即可以將爬取信息顯示在下方輸出窗口。學習

更有意思的即是隻要改最後四個數字,即可以爬取其餘的做業連接,上圖即是小小的實驗。ui

最後爬了下網絡爬蟲做業的信息。url

 
from PyQt5 import QtCore, QtGui, QtWidgets

class Ui_Form(object):
    def setupUi(self, Form):
        Form.setObjectName("Form")
        Form.resize(1083, 667)
        self.label = QtWidgets.QLabel(Form)
        self.label.setGeometry(QtCore.QRect(110, 50, 91, 41))
        font = QtGui.QFont()
        font.setPointSize(12)
        self.label.setFont(font)
        self.label.setObjectName("label")
        self.lineEdit = QtWidgets.QLineEdit(Form)
        self.lineEdit.setGeometry(QtCore.QRect(210, 60, 441, 31))
        self.lineEdit.setObjectName("lineEdit")
        self.pushButton = QtWidgets.QPushButton(Form)
        self.pushButton.setGeometry(QtCore.QRect(650, 60, 91, 31))
        font = QtGui.QFont()
        font.setPointSize(12)
        self.pushButton.setFont(font)
        self.pushButton.setObjectName("pushButton")
        self.textBrowser = QtWidgets.QTextBrowser(Form)
        self.textBrowser.setGeometry(QtCore.QRect(70, 110, 891, 501))
        self.textBrowser.setObjectName("textBrowser")

        self.retranslateUi(Form)
        QtCore.QMetaObject.connectSlotsByName(Form)

    def retranslateUi(self, Form):
        _translate = QtCore.QCoreApplication.translate
        Form.setWindowTitle(_translate("Form", "Form"))
        self.label.setText(_translate("Form", "博客園連接:"))
        self.pushButton.setText(_translate("Form", "開始爬取"))


from PyQt5 import QtWidgets
from login import Ui_Form
from PyQt5.QtWidgets import QFileDialog
import requests
import json

class mywindow(QtWidgets.QWidget, Ui_Form):

    def  __init__ (self):
        super(mywindow, self).__init__()
        self.setupUi(self)
        self.pushButton.clicked.connect(self.fun)

    def fun(self):
        u = self.lineEdit.text()

        u = u.split('/')[-1]
        url = "https://edu.cnblogs.com/Homework/GetAnswers?homeworkId={}".format(u)
        r = requests.get(url)
        r.encoding = r.apparent_encoding
        jd = json.loads(r.text)['data']
        p = ""
        for i in jd:
            p += str(i['StudentNo']) + ',' + str(i['RealName']) + ',' + str(i['DateAdded']).replace('T', ' ').split('.')[0] + ',' + str(i['Title']) + ',' + str(i['Url'] + '\n')
        self.textBrowser.setText(p)
        f = open('D:\hwlist.csv', 'w')
        f.write(p)
        f.close()


if __name__=="__main__":
    import sys
    from PyQt5.QtGui import QIcon
    app=QtWidgets.QApplication(sys.argv)
    ui = mywindow()
    ui.show()
    sys.exit(app.exec_())
相關文章
相關標籤/搜索