背景:昨天一個學金融的同窗讓我幫她從一個網站上抓取數據,而後導出到excel,粗略看了下有1000+條記錄,人工統計的話確實不可能。雖然說不會,但做爲一個學計算機的,我仍是厚着臉皮答應了。 。html
剛開始想的是直接發送GET請求,而後再解析返回的html不就能夠獲取須要的信息嗎?的確,若是是不須要登陸的網站,這樣可行,但對於這個網站就行不通。因此首先咱們須要作的就是抓包,即分析用戶登陸時瀏覽器向服務器發送的POST請求。許多瀏覽器都自帶抓包工具,但我仍是更喜歡[httpwatch]瀏覽器
抓包過程:服務器
1.安裝httpwatchcookie
2.用IE瀏覽器進入網站的登陸頁面app
3.打開httpwatch的Record開始跟蹤工具
4.輸入帳號密碼,確認登陸,獲得下面的數據:post
重點看POST請求中的Url和postdata,以及服務器返回的cookies網站
cookie裏面包含有登陸信息,保險起見,咱們能夠把這4個cookie值都傳給服務器。url
首先給出C#發送POST請求的代碼:(目的是獲得服務器返回的cookie).net
string Url = "URL"; string postDataStr = "POST Data";//由於上面都是離散的鍵值對,咱們能夠從Stream中直接找到postDataStr //登陸並獲取cookie HttpPost(Url, postDataStr, ref cookie); private string HttpPost(string Url, string postDataStr, ref CookieContainer cookie) { HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url); request.Method = "POST"; request.ContentType = "application/x-www-form-urlencoded"; byte[] postData = Encoding.UTF8.GetBytes(postDataStr); request.ContentLength = postData.Length; request.CookieContainer = cookie; Stream myRequestStream = request.GetRequestStream(); myRequestStream.Write(postData, 0, postData.Length); myRequestStream.Close(); HttpWebResponse response = (HttpWebResponse)request.GetResponse(); response.Cookies = cookie.GetCookies(response.ResponseUri); Stream myResponseStream = response.GetResponseStream(); StreamReader myStreamReader = new StreamReader(myResponseStream, Encoding.GetEncoding("utf-8")); string retString = myStreamReader.ReadToEnd(); myStreamReader.Close(); myResponseStream.Close(); return retString; }
有了cookie後,就能夠從網站上抓取本身須要的數據了,接下來就是經過發送GET請求
private string HttpGet(string Url, string postDataStr, CookieContainer cookie) { HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url + (postDataStr == "" ? "" : "?") + postDataStr); request.Method = "GET"; request.ContentType = "text/html;charset=UTF-8"; request.CookieContainer = cookie; HttpWebResponse response = (HttpWebResponse)request.GetResponse(); Stream myResponseStream = response.GetResponseStream(); StreamReader myStreamReader = new StreamReader(myResponseStream, Encoding.GetEncoding("utf-8")); string retString = myStreamReader.ReadToEnd(); myStreamReader.Close(); myResponseStream.Close(); return retString; }
由於服務器返回的是html,如何快速從大量的html中獲取須要的信息呢?此處,咱們能夠引用一個高效且強大的第三方庫NSoup(網上也有人推薦使用htmlparser,但經過我我的比較發現,htmlparser不管是在效率仍是簡潔性上,都遠不如NSoup)
因爲網上對於NSoup的教程比較上,你們也能夠參考JSoup的教程:http://www.open-open.com/jsoup/
最後給出我從網站上抓取的部分數據:
紙上得來終覺淺,絕知此事要躬行。