本篇文章涉及到的知識點有:Python爬蟲,MySQL數據庫,html/css/js基礎,selenium和phantomjs基礎,MVC設計模式,ORM(對象關係映射)框架,django框架(Python的web開發框架),apache服務器,linux(centos 7爲例)基本操做。所以適合有以上基礎的同窗學習。javascript
聲明:本博文只是爲了純粹的技術交流,敏感信息本文會有所過濾,你們見諒(因爲任何緣故致使長江大學教務處網站出現問題,都與本人無關)。php
實現思路:在沒有教務處數據接口的前提下(學生的信息安全),那也只有本身寫爬蟲去模擬登錄教務處,而後爬數據,爲了防止教務處網站崩潰,致使爬蟲失敗,能夠進行數據緩存,下次能夠直接從本身的數據庫中取數據,而咱們要作的就是定時更新數據與教務處實現同步。css
技術架構:centos 7 + apache2.4 + mariadb5.5 + Python2.7.5 + mod_wsgi 3.4 + django1.11html
------------------------------------------------------------------------前端
1、Python爬蟲:java
一、先看一下登陸入口 python
咱們這裏用FireFox進行抓包分析,咱們發現登陸是post上去的,而且帶有7個參數,發現有驗證碼,此時有兩種解決辦法,一種是運用如今很火的技術用DL作圖片識別,一種是down下來讓用戶本身輸。第一種成本比較高。。等不忙了能夠試一下,記得Python有個庫叫Pillow仍是PIL能夠作圖片識別,,暑假用TF試一下。第二種很low就不說了。mysql
二、 還有種高大上的方式,,,能夠不用管驗證碼,這裏就不細說了,咱們模擬登錄上去:linux
#coding:utf8 from bs4 import BeautifulSoup import urllib import urllib2 import requests import sys reload(sys) sys.setdefaultencoding('gbk') loginURL = "教務處登錄地址" cjcxURL = "http://jwc2.yangtzeu.edu.cn:8080/cjcx.aspx" html = urllib2.urlopen(loginURL) soup = BeautifulSoup(html,"lxml") __VIEWSTATE = soup.find(id="__VIEWSTATE")["value"] __EVENTVALIDATION = soup.find(id="__EVENTVALIDATION")["value"] data = { "__VIEWSTATE":__VIEWSTATE, "__EVENTVALIDATION":__EVENTVALIDATION, "txtUid":"帳號", "btLogin":"%B5%C7%C2%BC", "txtPwd":"密碼", "selKind":"1" } header = { # "Host":"jwc2.yangtzeu.edu.cn:8080", "User-Agent":"Mozilla/5.0 (Windows NT 10.0;… Gecko/20100101 Firefox/54.0", "Accept":"text/html,application/xhtml+x…lication/xml;q=0.9,*/*;q=0.8", "Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3", "Accept-Encoding":"gzip, deflate", "Content-Type":"application/x-www-form-urlencoded", # "Content-Length":"644", "Referer":"http://jwc2.yangtzeu.edu.cn:8080/login.aspx", # "Cookie":"ASP.NET_SessionId=3zjuqi0cnk5514l241csejgx", # "Connection":"keep-alive", # "Upgrade-Insecure-Requests":"1", } UserSession = requests.session() Request = UserSession.post(loginURL,data,header) Response = UserSession.get(cjcxURL,cookies = Request.cookies,headers=header) soup = BeautifulSoup(Response.content,"lxml") print soup
接下來咱們能夠看到:nginx
再來post(此代碼接上面):
__VIEWSTATE2 = soup.find(id="__VIEWSTATE")["value"] __EVENTVALIDATION2 = soup.find(id="__EVENTVALIDATION")["value"] AllcjData = { "__EVENTTARGET":"btAllcj", "__EVENTARGUMENT":"", "__VIEWSTATE":__VIEWSTATE2, "__EVENTVALIDATION":__EVENTVALIDATION2, "selYear":"2017", "selTerm":"1", # "Button2":"%B1%D8%D0%DE%BF%CE%B3%C9%BC%A8" } AllcjHeader = { # "Host":"jwc2.yangtzeu.edu.cn:8080", "User-Agent":"Mozilla/5.0 (Windows NT 10.0;… Gecko/20100101 Firefox/54.0", "Accept":"text/html,application/xhtml+x…lication/xml;q=0.9,*/*;q=0.8", "Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3", "Accept-Encoding":"gzip, deflate", "Content-Type":"application/x-www-form-urlencoded", # "Content-Length":"644", "Referer":"http://jwc2.yangtzeu.edu.cn:8080/cjcx.aspx", # "Cookie":, "Connection":"keep-alive", "Upgrade-Insecure-Requests":"1", } Request1 = UserSession.post(cjcxURL,AllcjData,AllcjHeader) Response1 = UserSession.get(cjcxURL,cookies = Request.cookies,headers=AllcjHeader) soup = BeautifulSoup(Response1.content,"lxml") print soup
發現不行。。。此次get的頁面仍是原來的頁面。。。我以爲有兩種緣由致使此次post失敗:一是asp.net的__VIEWSTATE和__EVENTVALIDATION變量致使post失敗,二是一個form多個button用了js作判斷,致使爬蟲失敗,對於動態加載的頁面,普通爬蟲仍是不行。。。。
三、再來點高大上的用selenium(web自動化測試工具,能夠模擬鼠標點擊)+ phantomjs(沒有界面的瀏覽器,比chrome和Firefox都要快)
selenium安裝:pip install selenium
phantomjs安裝:
(1)地址:http://phantomjs.org/download.html(我下載的是Linux 64位的)
(2)解壓縮:tar -jxvf phantomjs-2.1.1-linux-x86_64.tar.bz2 /usr/share/
(3)安裝依賴:yum install fontconfig freetype libfreetype.so.6 libfontconfig.so.1
(4)配置環境變量:export PATH=$PATH:/usr/share/phantomjs-2.1.1-linux-x86_64/bin
(5)shell下輸入phantomjs,若是能進入命令行,安裝成功。
請忽略個人註釋:
#coding:utf8 from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.keys import Keys import time import urllib import urllib2 import sys reload(sys) sys.setdefaultencoding('utf8') driver = webdriver.PhantomJS(); driver.get("教務處登陸地址") driver.find_element_by_name('txtUid').send_keys('帳號') driver.find_element_by_name('txtPwd').send_keys('密碼') driver.find_element_by_id('btLogin').click() cookie=driver.get_cookies() driver.get("http://jwc2.yangtzeu.edu.cn:8080/cjcx.aspx") #print driver.page_source #driver.find_element_by_xpath("//input[@name='btAllcj'][@type='button']") #js = "document.getElementById('btAllcj').onclick=function(){__doPostBack('btAllcj','')}" #js = "var ob; ob=document.getElementById('btAllcj');ob.focus();ob.click();)" #driver.execute_script("document.getElementById('btAllcj').click();") #time.sleep(2) #讓操做稍微停一下 #driver.find_element_by_link_text("所有成績").click() #找到‘登陸’按鈕並點擊 #time.sleep(2) #js1 = "document.Form1.__EVENTTARGET.value='btAllcj';" #js2 = "document.Form1.__EVENTARGUMENT.value='';" #driver.execute_script(js1) #driver.execute_script(js2) #driver.find_element_by_name('__EVENTTARGET').send_keys('btAllcj') #driver.find_element_by_name('__EVENTARGUMENT').send_keys('') #js = "var input = document.createElement('input');input.setAttribute('type', 'hidden');input.setAttribute('name', '__EVENTTARGET');input.setAttribute('value', '');document.getElementById('Form1').appendChild(input);var input = document.createElement('input');input.setAttribute('type', 'hidden');input.setAttribute('name', '__EVENTARGUMENT');input.setAttribute('value', '');document.getElementById('Form1').appendChild(input);var theForm = document.forms['Form1'];if (!theForm) { theForm = document.Form1;}function __doPostBack(eventTarget, eventArgument) { if (!theForm.onsubmit || (theForm.onsubmit() != false)) { theForm.__EVENTTARGET.value = eventTarget; theForm.__EVENTARGUMENT.value = eventArgument; theForm.submit(); } }__doPostBack('btAllcj', '')" #js = "var script = document.createElement('script');script.type = 'text/javascript';script.text='if (!theForm) { theForm = document.Form1;}function __doPostBack(eventTarget, eventArgument) { if (!theForm.onsubmit || (theForm.onsubmit() != false)) { theForm.__EVENTTARGET.value = eventTarget; theForm.__EVENTARGUMENT.value = eventArgument; theForm.submit(); }}';document.body.appendChild(script);" #driver.execute_script(js) driver.find_element_by_name("Button2").click() html=driver.page_source soup = BeautifulSoup(html,"lxml") print soup tables = soup.findAll("table") for tab in tables:
for tr in tab.findAll("tr"):
print "--------------------"
for td in tr.findAll("td")[0:3]:
print td.getText()
如今只能拿到必修課成績。。。。。由於所有成績是ASP生成的js觸發的。。。而不是直接submit。。。正在尋找解決的辦法。下面開始咱們數據庫的設計。。。
2、Mariadb學生數據庫設計,,,這裏引用了咱們SQL server數據庫原理上機的內容。。。
個人建庫語句:
create database jwc character set utf8; use jwc; create table Student( Sno char(9) primary key, Sname varchar(20) unique, Sdept char(20), Spwd char(20) ); create table Course( Cno char(2) primary key, Cname varchar(30) unique, Credit numeric(2,1) ); create table SC( Sno char(9) not null, Cno char(2) not null, Grade int check(Grade>=0 and Grade<=100), primary key(Sno,Cno), foreign key(Sno) references Student(Sno), foreign key(Cno) references Course(Cno) );
3、Python web環境的搭建(LAMP):
一、由於此次選的http服務器時apache,因此要安裝mod_wsgi(python通用網關接口)來實現apache和Python程序的交互。。。若是用nginx就要安裝配置uwsgi。。。相似java的servlet和PHP的php-fpm。
安裝:yum install mod_wsgi
配置:vim /etc/httpd/conf/httpd.conf
這個配置花費了我很多心思和時間。。。網上的有不少錯誤。。。最標準的Python web django開發配置。。。拿走不謝。
#config python web LoadModule wsgi_module modules/mod_wsgi.so <VirtualHost *:8080> ServerAdmin root@Vito-Yan ServerName www.yuol.onlne ServerAlias yuol.online Alias /media/ /var/www/html/jwc/media/ Alias /static/ /var/www/html/jwc/static/ <Directory /var/www/html/jwc/static/> Require all granted </Directory> WSGIScriptAlias / /var/www/html/jwc/jwc/wsgi.py # DocumentRoot "/var/www/html/jwc/jwc" ErrorLog "logs/www.yuol.online-error_log" CustomLog "logs/www.yuol.online -access_log" common <Directory "/var/www/html/jwc/jwc"> <Files wsgi.py> AllowOverride All Options Indexes FollowSymLinks Includes ExecCGI Require all granted </Files> </Directory> </VirtualHost>
二、下面來安裝django。。。pip install django。。。。搞定。
查看django的版本:python -m django --version
官網地址:https://www.djangoproject.com
新建項目:django-admin.py startproject jwc(個人是在/var/www/html下建的,apache的網站根目錄)
三、apcehe的配置:就不貼了,把上面的jwc改爲jwc2,而後端口改爲9000,而後Listen 9000(爲何用9000呢,第一個項目jwc用的是8080,django自帶的服務器用python manage.py runserver能夠開啓,它的默認端口是8000,因此不用8000,以避免衝突,個人jsp項目的tomcat服務器用的是9090端口,以避免衝突,最好不用,常見的就9000端口了,其餘不敢亂用)。
四、 settings.py的配置:
DEBUG = True 調試開啓
ALLOWED_HOSTS = ['192.168.47.128'] 添加主機
五、wsgi.py配置,不要問我爲何。。。我也不知道。。用apache服務器啓動django項目這樣作就好了。。。若是用django自帶的server就不用改了。。。
""" WSGI config for jwc2 project. It exposes the WSGI callable as a module-level variable named ``application``. For more information on this file, see https://docs.djangoproject.com/en/1.11/howto/deployment/wsgi/ """ #import os #from django.core.wsgi import get_wsgi_application #os.environ.setdefault("DJANGO_SETTINGS_MODULE", "jwc2.settings") #application = get_wsgi_application() import os from os.path import join,dirname,abspath PROJECT_DIR = dirname(dirname(abspath(__file__))) import sys sys.path.insert(0,PROJECT_DIR) os.environ.setdefault("DJANGO_SETTINGS_MODULE", "jwc2.settings") from django.core.wsgi import get_wsgi_application application = get_wsgi_application()
而後就大功告成。。。。Python web環境算是搭建完成。。。
4、開啓咱們的第一個django項目應用。。。
一、新建成績查詢的應用 python manage.py startapp cjcx
二、在settings.py中添加應用
三、在views.py裏寫下寫下第一行代碼。。。。
# -*- coding: utf-8 -*- from __future__ import unicode_literals from django.http import HttpResponse from django.shortcuts import render # Create your views here. def index(request): return HttpResponse("Hello,YUOL!")
四、在urls.py下添加url
from django.conf.urls import url from django.contrib import admin import cjcx.views as cj urlpatterns = [ url(r'^admin/', admin.site.urls), url(r'^cjcx/',cj.index), ]
五、Hello,YUOL!
六、剛剛上面的4還能夠換種方法。。。。
在cjcx應用下面新建urls.py
from django.conf.urls import url from . import views urlpatterns = [ url(r'^$', views.index), ]
修改jwc2下面的urls.py(項目根路徑)
from django.conf.urls import url, include from django.contrib import admin urlpatterns = [ url(r'^admin/', admin.site.urls), url(r'^cjcx/', include('cjcx.urls')), ]
七、寫前端頁面。。。。。
在cjcx應用下面新建templates文件夾放咱們的html文件(請暫時忽略動態加載的代碼,我懶得刪了)
<html> <head> <title>YUOL成績查詢系統</title> <style type="text/css"> #border { margin: 0 auto; width: 500px; min-height: 500px; background-color: #FFFFFF; border: 1px solid #000000; } #button {} </style> </head> <body style="text-align:center"> <div id="border"> <h1>YUOL成績查詢系統</h1><br/> <form action="" method="post"> 帳號: <input type="text" id="xuehao" name="Sno" /><br/> 密碼: <input type="password" id="pwd" name="Spwd" /><br/><br/> <input type="submit" value="查詢" id="submit" /><br/> <div style="text-align:left;padding-left:50px;"> -----------------------------------------------------------<br/> 姓名:{{ student.Sname }}<br/> 學號:{{ student.Sno}}<br/> 班級:{{ student.Sdept }}<br/> </div> -----------------------------------------------------------<br/> <div> <br> <div style="display:inline-block;width:150px;"> 科目:<br> {{ course.Cname }} </div> <div style="display:inline-block;width:150px;"> 成績:<br> {{ sc.Grade }} </div> <div style="display:inline-block;width:150px;"> 學分:<br> {{ course.Credit }} </div> </div> </form> </div> </body> </html>
修改views.py:
# -*- coding: utf-8 -*- from __future__ import unicode_literals from django.http import HttpResponse from django.shortcuts import render # Create your views here. def index(request): return render(request, 'jwcjcx.html')
而後就成這樣了。。。。。
八、根據jwc數據庫設計Models。。。。
django默認支持的是sqllite,,如今換成 mariadb,修改settings.py
DATABASES = { 'default': { # 'ENGINE': 'django.db.backends.sqlite3', # 'NAME': os.path.join(BASE_DIR, 'db.sqlite3'), 'ENGINE': 'django.db.backends.mysql', 'NAME': 'jwc2', 'USER':'root', 'PASSWORD':'你的密碼', 'HOST':'localhost', 'PORT':'3306', } }
九、去models.py下面建表吧。。。。
# -*- coding: utf-8 -*- from __future__ import unicode_literals from django.db import models # Create your models here. class Student(models.Model): Sno=models.CharField(max_length=9,primary_key=True) Sname=models.CharField(max_length=20,unique=True) Sdept=models.CharField(max_length=20) Spwd=models.CharField(max_length=20) class Course(models.Model): Cno=models.CharField(max_length=2,primary_key=True) Cname=models.CharField(max_length=30,unique=True) Credit=models.DecimalField(max_digits=2, decimal_places=1) class SC(models.Model): Sno=models.CharField(max_length=9) Cno=models.CharField(max_length=2) Grade=models.IntegerField() def __unicode__(self): return self.Sno
這種ORM免去了寫sql語句的麻煩,直接把表封裝成一個類繼承model.Model,查詢字段直接‘點’操做。。。很方便。
而後生成數據模型表:python manage.py makemigrations
再將數據表遷移到mariadb數據庫:python manage.py migrate
生成cjcx_三個表,其餘是django默認的不用管,另外數據庫要本身先建(create database jwc2 charset=utf8;)
十、使用django admin作數據管理。。。。Admin真心好用這是django框架最顯著的一個優點。。。
建立用戶:python manage.py createsuperuser
而後在主機後面加/admin就能夠登陸。。。咱們發現它的css和img丟失了
解決辦法:
在jwc2下面建一個靜態文件夾:static
修改settings.py。。。在最後一行添加STATIC_ROOT = "/var/www/html/jwc2/static/",LANGUAGE_CODE = 'zh-Hans'(改爲中文的admin)
執行命令 :python manage.py collectstatic
上面apache的靜態文件配置取消註釋。。。
這樣進去看不到數據表,須要修改admin.py引入models
# -*- coding: utf-8 -*- from __future__ import unicode_literals from django.contrib import admin import models # Register your models here. admin.site.register(models.Student) admin.site.register(models.Course) admin.site.register(models.SC)
能夠直接操做數據庫了。。。django的強大之處。。
十一、下面開始咱們最重要的業務邏輯。。
數據入庫(MVC中的M,models):我這裏把Course表的Cno給刪了,把SC表的Cno換成Cname了。。。和上面有所不一樣,只須要把庫刪了從新生成數據表便可。。。
#encoding=utf-8 from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.keys import Keys import MySQLdb import time import urllib import urllib2 import sys reload(sys) sys.setdefaultencoding('utf8') conn= MySQLdb.connect( host='localhost', port = 3306, user='root', passwd='密碼', db ='jwc2', charset='utf8' ) cur = conn.cursor() driver = webdriver.PhantomJS(); driver.get("教務處登陸入口") driver.find_element_by_name('txtUid').send_keys('帳號') driver.find_element_by_name('txtPwd').send_keys('密碼') driver.find_element_by_id('btLogin').click() cookie=driver.get_cookies() driver.get("http://jwc2.yangtzeu.edu.cn:8080/cjcx.aspx") driver.find_element_by_name("Button2").click() html=driver.page_source #html = open("btAllcj.html","r") soup = BeautifulSoup(html,"lxml") Sno = str(soup.find(id="lbXH").getText()) Sname = str(soup.find(id="lbXm").getText()) Sdept = str(soup.find(id="lbBj").getText()) Student = (Sno,Sname,Sdept,'12345678') sql = "insert into cjcx_student values(%s,%s,%s,%s)" cur.execute(sql,Student) id = 0 tables = soup.findAll("table") for tab in tables[1:2]: for tr in tab.findAll("tr")[1:]: count = 0 for td in tr.findAll("td"): count += 1 if count==1: Cname = td.getText() if count==2: Grade = td.getText()
id += 1 sql = "insert into cjcx_sc values(%s,%s,%s,%s)" SC = (id,Sno,Cname,Grade) cur.execute(sql,SC) if count==3: Credit = td.getText() sql = "insert into cjcx_course values(%s,%s)" Course = (Cname,Credit) cur.execute(sql,Course) conn.commit() cur.close() conn.close()
業務邏輯views.py(MVC中的V,views)
# -*- coding: utf-8 -*- from __future__ import unicode_literals from django.http import HttpResponse from django.shortcuts import render from . import models # Create your views here. def index(request): return render(request, 'jwcjcx.html') def search_action(request): Sno = request.POST['Sno'] Spwd = request.POST['Spwd'] #這裏放爬蟲和數據入庫的代碼。。。。。 student = models.Student.objects.get(Sno=Sno) pwd = student.Spwd if Spwd==pwd: sc = models.SC.objects.filter(Sno=Sno) # course = models.Course.objects.filter(Cname=sc.Cname) return render(request,'jwcjcx.html',{'student':student, 'sc':sc})
修改urls.py(MVC中的C,Controller)
jwc2項目urls:
from django.conf.urls import url,include from django.contrib import admin urlpatterns = [ url(r'^admin/', admin.site.urls), url(r'^cjcx/',include('cjcx.urls', namespace='cjcx')), ]
cjcx應用urls:
from django.conf.urls import url from . import views urlpatterns = [ url(r'^$', views.index), url(r'^search/$',views.search_action,name='search_action'), ]
十二、前端數據渲染。。。。
<html> <head> <title>YUOL成績查詢系統</title> <style type="text/css"> #border { margin: 0 auto; width: 500px; min-height: 500px; background-color: #FFFFFF; border: 1px solid #000000; } #button {} </style> </head> <body style="text-align:center"> <div id="border"> <h1>YUOL成績查詢系統</h1><br/> <form action="{% url 'cjcx:search_action' %}" method="post">{% csrf_token %} 帳號: <input type="text" id="xuehao" name="Sno" /><br/> 密碼: <input type="password" id="pwd" name="Spwd" /><br/><br/> <input type="submit" value="查詢" id="submit" /><br/> <div style="text-align:left;padding-left:50px;"> -----------------------------------------------------------<br/> 姓名:{{ student.Sname }}<br/> 學號:{{ student.Sno}}<br/> 班級:{{ student.Sdept }}<br/> </div> -----------------------------------------------------------<br/> <div> <br/> <div style="display:inline-block;width:200px;"> 科目:<br/> {% for sc in sc %} {{ sc.Cname }}<br/> -------------------<br/> {% endfor %} </div> <div style="display:inline-block;width:100px;"> 成績:<br/> {% for sc in sc %} {{ sc.Grade }}<br/> ------<br/> {% endfor %} </div> <div style="display:inline-block;width:150px;"> 學分:<br/> {% for course in course %} {{ course.Credit }} {% endfor %} </div> </div> </form> </div> </body> </html>
收工。。。。。。
寫了兩天兩夜,實在卡不住了,後面學分就沒寫了。。。。。。。。。爬蟲還不穩定,邏輯判斷幾乎沒寫。。。只是簡單實現了功能。。。
最後附上一張照片: