java 模擬登陸新浪微博(經過cookie)

這幾天一直在研究新浪微博的爬蟲,發現爬取微博的數據首先要登陸。原本打算是經過帳號和密碼模擬瀏覽器登陸。可是如今微博的登陸機制比較複雜。經過帳號密碼尚未登陸成功QAQ。因此就先記錄下,經過cookie直接訪問本身的微博主頁html

微博登陸的認證過程

微博登陸的細節在其餘的博客裏已經有了詳細的介紹。大概就是用戶輸入帳號和密碼後與服務器產生幾回會話。若認證成功後,微博的服務器會返回給瀏覽器一個cookie。在以後訪問微博的其餘內容時,經過發送這個cookie就能正常訪問微博了。因此用過cookie訪問微博,過程就簡化爲了獲取cookie,而後經過程序模擬瀏覽器訪問微博首頁。java

獲取微博的cookie

經過抓包軟件或瀏覽器自帶的調試工具均可以抓取網頁的cookie。本文使用的是火狐瀏覽器的HttpFox 插件來獲取微博的cookie。正則表達式

1,打開微博首頁,打開HttpFox
這裏寫圖片描述
2,輸入用戶名和密碼,勾選「記住我」,點擊登陸。點擊登陸後咱們能夠看到HttpFox下產生了不少的URL。進入主頁後在HTTPFox中找到你主頁對應的URL,以下圖:
圖2
點擊主頁的URL後,咱們能夠看見左下方的一些信息。包括「Headers」,「Cookies」等。
3,在「Headers」中能夠看到有一條「Cookie」的信息。這個就是咱們所須要的cookie了。點擊右鍵保存cookie。
至此,就獲取了咱們登陸時所要的cookie了!apache

代碼實現

因爲咱們是直接經過cookie進行的登陸。因此少了不少認證的過程。直接使用HttpClient的相關包,帶上以前獲取的cookie就能夠訪問我的首頁。獲取了首頁,咱們就能夠經過正則表達式來分析微博數據了。瀏覽器

import java.io.IOException; import java.net.URI; import java.net.URISyntaxException; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.config.Registry; import org.apache.http.config.RegistryBuilder; import org.apache.http.cookie.CookieSpec; import org.apache.http.cookie.CookieSpecProvider; import org.apache.http.impl.client.BasicCookieStore; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.cookie.DefaultCookieSpec; import org.apache.http.message.BasicHeader; import org.apache.http.protocol.HttpContext; import org.apache.http.util.EntityUtils; /** * * * @author zkw * */ public class cookieLogin { private HttpClient client; private HttpPost post; private HttpGet get; private BasicCookieStore cookieStore; public cookieLogin() { //cookie策略,不設置會拒絕cookie rejected,設置策略保存cookie信息 cookieStore = new BasicCookieStore(); CookieSpecProvider myCookie = new CookieSpecProvider() { public CookieSpec create(HttpContext context) { return new DefaultCookieSpec(); } }; Registry<CookieSpecProvider> rg = RegistryBuilder.<CookieSpecProvider> create().register("myCookie", myCookie) .build(); client = HttpClients.custom().setDefaultCookieStore(cookieStore).setDefaultCookieSpecRegistry(rg).build(); get = new HttpGet(); post = new HttpPost(); } public void Login() throws ClientProtocolException, IOException, URISyntaxException { String LoginUrl = "你的微博主頁網址"; get.setURI(new URI(LoginUrl)); get.addHeader("Host", "weibo.com"); get.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0"); get.addHeader("Accept", "*/*"); get.addHeader("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3"); get.addHeader("Accept-Encoding", "gzip, deflate"); get.addHeader("Referer", "http://weibo.com/"); get.addHeader(new BasicHeader("Cookie", "上述獲取的cookie值")); HttpResponse resp = client.execute(get); HttpEntity entity = resp.getEntity(); String cont = EntityUtils.toString(entity); System.out.println("獲取的微博內容:" + cont); } public HttpClient getClient() { return client; } public void setClient(HttpClient client) { this.client = client; } public HttpPost getPost() { return post; } public void setPost(HttpPost post) { this.post = post; } public HttpGet getGet() { return get; } public void setGet(HttpGet get) { this.get = get; } public BasicCookieStore getCookieStore() { return cookieStore; } public void setCookieStore(BasicCookieStore cookieStore) { this.cookieStore = cookieStore; } public static void main(String[] args) throws ClientProtocolException, IOException, URISyntaxException { new cookieLogin().Login(); } } 

總結

經過cookie登陸微博是一種快捷方式,可是存在很多問題。因此博主還在研究微博帳號認證過程,但願過幾天能有所突破QAQ。服務器

相關文章
相關標籤/搜索