在爬取某些網站,有些須要登陸才能獲取訪問權限。若是僅僅只是須要登陸,這裏能夠推薦你們一個工具,很好用的
java
在火狐瀏覽其中有個插件firebug(須要安裝),經過這個插件能夠詳細的查看網站的訪問過程(連接的跳轉和訪問前後順序),以及每次連接的請求頭信息、響應頭信息,同時也能夠查看post提交的數據。固然在IE和谷歌瀏覽器中也有些開發工具,F12直接喚出,可是我的感受火狐的firebug比較好用,IE的和谷歌的,我也偶爾使用。瀏覽器
經過上面介紹的工具能夠獲取模擬的詳細過程,而後模擬登陸,都是很容易的事。cookie
這裏我是介紹的是登陸若是須要驗證碼,就有些麻煩了,我這裏想到一種解決辦法,比較經常使用,就是彈出驗證碼dom
實現以下,模擬登陸jsp
public class LoginByCode { public static void main(String[] args) { CloseableHttpClient httpClient = HttpClientBuilder.create().build(); SimpleDateFormat format = new SimpleDateFormat("yyyyMMddhhmmss"); String path = "d:/img/tmp/" + format.format(new Date()) + ".jpg"; try { String imgurl = "http://www.shanghaiip.cn/wasWeb/login/Random.jsp"; HttpUriRequest get = new HttpGet(imgurl); HttpResponse res = httpClient.execute(get); res.setHeader("Content-Type", "image/gif"); byte[] img = EntityUtils.toByteArray(res.getEntity());//下載驗證碼圖片 saveFile(path, img); String code = new ImgDialog().showDialog(null, path);//彈出驗證碼,獲取填寫驗證碼 String login = "http://www.shanghaiip.cn/wasWeb/login/loginServer.jsp"; HttpPost post = new HttpPost(login); List<NameValuePair> data = new ArrayList<NameValuePair>(); data.add(new BasicNameValuePair("username", "zhpatent")); data.add(new BasicNameValuePair("password", "5ca072839350b0733a2a456cc4004371"));//火狐裏面用firebug能夠查看密碼是加密後的 data.add(new BasicNameValuePair("newrandom", code)); post.setEntity(new UrlEncodedFormEntity(data)); res = httpClient.execute(post); Header[] headers = res.getHeaders("Location");//獲取跳轉連接 get = new HttpGet(headers[0].getValue()); res = httpClient.execute(get); String body = EntityUtils.toString(res.getEntity()); if (body.contains("zhpatent")) { System.out.println("模擬登陸成功:" + body.substring(body.indexOf("zhpatent") - 40, body.indexOf("zhpatent") + 40)); } } catch (Exception e) { System.out.println("異常:" + e.getMessage()); } finally { File file = new File(path); if (file.exists()) { file.delete(); } try { httpClient.close(); } catch (IOException e) { e.printStackTrace(); } } } private static void saveFile(String path, byte[] data) { int size = 0; byte[] buffer = new byte[10240]; try (BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(path)); ByteArrayInputStream is = new ByteArrayInputStream(data)) { while ((size = is.read(buffer)) != -1) { bos.write(buffer, 0, size); } } catch (IOException e) { e.printStackTrace(); } } }
驗證碼工具類ide
public class ImgDialog { public String message = null; private JButton confirm; private JDialog dialog = null; private TextField field; String result = ""; public String showDialog(JFrame father, String path) { JLabel label = new JLabel(); label.setBorder(new EtchedBorder(EtchedBorder.LOWERED, null, null)); label.setBounds(10, 10, 125, 51); label.setIcon(new ImageIcon(path)); field = new TextField(); field.setBounds(145, 10, 65, 20); confirm = new JButton("肯定"); confirm.setBounds(145, 40, 65, 20); confirm.addActionListener(new ActionListener() { @Override public void actionPerformed(ActionEvent e) { result = field.getText(); ImgDialog.this.dialog.dispose(); } }); dialog = new JDialog(father, true); dialog.setTitle("請輸入圖片中的驗證碼"); Container pane = dialog.getContentPane(); pane.setLayout(null); pane.add(label); pane.add(field); pane.add(confirm); dialog.pack(); dialog.setSize(new Dimension(235, 110)); dialog.setLocation(750, 430); // dialog.setLocationRelativeTo(father); dialog.setVisible(true); return result; } }
實驗效果以下工具
運行會下載驗證碼並彈出post
輸入驗證碼,在登陸後跳轉的頁面中獲取到個人用戶信息。開發工具
我這裏是使用的httpclient模擬登陸的,httpclient不用管理cookies,因此用起來方便,不會出現驗證碼對不上號的問題。網站
若是是使用Jsoup模擬登陸就稍微麻煩點,得本身管理cookies,在訪問驗證碼頁面的時候同時得下載驗證碼和拿到cookies,而後在模擬登陸的時候須要帶上cookies