其實一開始我是要獲取OSChina的logo的,不過我這幾天是否是用Httpclient請求的oschina的首頁太多了,如今請求就是403,緣由可能在於請求時沒有加瀏覽器的參數,致使網站檢測後把個人請求拒絕了。html
因此換個目標,獲取百度的LOGO。java
經過前三篇的熱身,這一篇開始正式使用正則和httpclient獲取目標了。正則表達式
我們複習一下步驟chrome
httpclient請求頁面資源apache
分析資源瀏覽器
正則表達式匹配合適字符串app
Java API捕獲輸出目標數據ide
第一步請求資源,HttpGetUtils.java ,上一篇寫了請求資源的工具類,我先貼下來,若是不清楚請求步驟,看Java簡單爬蟲系列(2)---HttpClient的使用函數
package com.hldh.river; import org.apache.http.HttpEntity; import org.apache.http.HttpStatus; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; /** * Created by liuhj on 2016/1/4. */ public class HttpGetUtils { /** * get 方法 * @param url * @return */ public static String get(String url){ String result = ""; try { CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpGet = new HttpGet(url); CloseableHttpResponse response = httpclient.execute(httpGet); try { if (response != null && response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) { System.out.println(response.getStatusLine()); HttpEntity entity = response.getEntity(); result = readResponse(entity, "utf-8"); } } finally { httpclient.close(); response.close(); } }catch (Exception e){ e.printStackTrace(); } return result; } /** * stream讀取內容,能夠傳入字符格式 * @param resEntity * @param charset * @return */ private static String readResponse(HttpEntity resEntity, String charset) { StringBuffer res = new StringBuffer(); BufferedReader reader = null; try { if (resEntity == null) { return null; } reader = new BufferedReader(new InputStreamReader( resEntity.getContent(), charset)); String line = null; while ((line = reader.readLine()) != null) { res.append(line); } } catch (Exception e) { e.printStackTrace(); } finally { try { if (reader != null) { reader.close(); } } catch (IOException e) { } } return res.toString(); } }
經過上面的方法咱們就能夠把頁面資源給請求下來了,固然了上面只是個工具類,還要配合下面幾個代碼使用。工具
第二步分析資源
分析資源確定不是分析輸出的結果,那樣太亂了,最好是去目標頁面去分析
打開www.baidu.com,chrome瀏覽器右鍵審查元素
上面的那一行就是百度LOGO所在的位置,查看以後只有hidefocus後面有的圖片就是LOGO,那麼我就能夠寫正則表達式了
第三步正則表達式
String regex = "hidefocus.+?src=\"//(.+?)\""; 以hidefocus開始,一直到src之間無論多少字符串都行, src後面雙引號裏的內容就是咱們要取的,用組的形式表示出來,就是用()包含住, 這樣方便在使用API時,用Java的group函數時對應。
第四步 RegexStringUtils.java Java API捕獲輸出,具體請求步驟請查看第三篇Java簡單爬蟲系列(3)---正則表達式和Java正則API的使用
package com.hldh.river; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * Created by liuhj on 2016/1/4. */ public class RegexStringUtils { public static String regexString(String targetStr, String patternStr){ Pattern pattern = Pattern.compile(patternStr); // 定義一個matcher用來作匹配 Matcher matcher = pattern.matcher(targetStr); // 若是找到了 if (matcher.find()) { // 打印出結果 // System.out.println(matcher.group(1)); return matcher.group(1); } return ""; } }
下面把主函數貼出來 Appjava
package com.hldh.river; /** * 正則表達式獲取百度LOGO */ public class App { public static void main( String[] args ){ String url = "http://www.baidu.com/"; String regex = "hidefocus.+?src=\"//(.+?)\""; System.out.println(regex); String result = HttpGetUtils.get(url); System.out.println(result); String src = RegexStringUtils.regexString(result, regex); System.out.println(src); } }
輸出結果
hidefocus.+?src="//(.+?)" HTTP/1.1 200 OK <!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8">.....中間的就省略了</script></body></html> www.baidu.com/img/bd_logo1.png
最後一行就是百度圖片的logo地址。
你們可能以爲有個地址管毛用,若是我下載的是美女那不是仍是看不了,不要着急,
下面就用httpclient下載圖片的工具類 DownloadUtils.java
package com.hldh.river; import org.apache.http.HttpEntity; import org.apache.http.HttpStatus; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import java.io.*; /** * Created by liuhj on 2016/1/8. * 把獲取的圖片下載 */ public class DownloadUtils { public static String get(String url){ String filename = ""; String tergetUrl = "http://" + url; try { CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpGet = new HttpGet(tergetUrl); CloseableHttpResponse response = httpclient.execute(httpGet); try { if (response != null && response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) { System.out.println(response.getStatusLine()); HttpEntity entity = response.getEntity(); filename = download(entity); } } finally { httpclient.close(); response.close(); } }catch (Exception e){ e.printStackTrace(); } return filename; } private static String download(HttpEntity resEntity) { //圖片要保存的路徑 String dirPath = "d:\\img\\"; //圖片名稱,能夠自定義生成 String fileName = "b_logo.png"; //若是沒有目錄先建立目錄,若是沒有文件名先建立文件名 File file = new File(dirPath); if(file == null || !file.exists()){ file.mkdir(); } String realPath = dirPath.concat(fileName); File filePath = new File(realPath); if (filePath == null || !filePath.exists()) { try { filePath.createNewFile(); } catch (IOException e) { e.printStackTrace(); } } //獲得輸入流,而後把輸入流放入緩衝區中,緩衝區--->輸出流flush,關閉資源 BufferedOutputStream out = null; InputStream in = null; try { if (resEntity == null) { return null; } in = resEntity.getContent(); out = new BufferedOutputStream(new FileOutputStream(filePath)); byte[] bytes = new byte[1024]; int len = -1; while((len = in.read(bytes)) != -1){ out.write(bytes,0,len); } out.flush(); out.close(); } catch (Exception e) { e.printStackTrace(); } finally { try { if (in != null) { in.close(); } } catch (IOException e) { } } return filePath.toString(); } }
下面就是保存圖片的結果
至此使用正則表達式爬取百度圖片就寫完了,下面會寫一寫擴展,使用Jsoup來獲取圖片保存下來。