Java簡單爬蟲系列（4）--- 正則表達式獲取百度LOGO

時間 2019-11-13

標籤 java 簡單爬蟲系列正則表達式獲取百度 logo 欄目 Java 简体版

原文原文鏈接

其實一開始我是要獲取OSChina的logo的，不過我這幾天是否是用Httpclient請求的oschina的首頁太多了，如今請求就是403，緣由可能在於請求時沒有加瀏覽器的參數，致使網站檢測後把個人請求拒絕了。html

因此換個目標，獲取百度的LOGO。java

經過前三篇的熱身，這一篇開始正式使用正則和httpclient獲取目標了。正則表達式

我們複習一下步驟chrome

httpclient請求頁面資源apache
分析資源瀏覽器
正則表達式匹配合適字符串app
Java API捕獲輸出目標數據ide

第一步請求資源，HttpGetUtils.java ，上一篇寫了請求資源的工具類，我先貼下來，若是不清楚請求步驟，看Java簡單爬蟲系列（2）---HttpClient的使用函數

package com.hldh.river;

import org.apache.http.HttpEntity;
import org.apache.http.HttpStatus;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

/**
 * Created by liuhj on 2016/1/4.
 */
public class HttpGetUtils {
    /**
     * get 方法
     * @param url
     * @return
     */
    public static String get(String url){
        String result = "";
        try {
            CloseableHttpClient httpclient = HttpClients.createDefault();
            HttpGet httpGet = new HttpGet(url);
            CloseableHttpResponse response = httpclient.execute(httpGet);
            try {
                if (response != null
                        && response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                    System.out.println(response.getStatusLine());
                    HttpEntity entity = response.getEntity();
                    result = readResponse(entity, "utf-8");
                }
            } finally {
                httpclient.close();
                response.close();
            }
        }catch (Exception e){
            e.printStackTrace();
        }
        return result;
    }

    /**
     * stream讀取內容，能夠傳入字符格式
     * @param resEntity
     * @param charset
     * @return
     */
    private static String readResponse(HttpEntity resEntity, String charset) {
        StringBuffer res = new StringBuffer();
        BufferedReader reader = null;
        try {
            if (resEntity == null) {
                return null;
            }

            reader = new BufferedReader(new InputStreamReader(
                    resEntity.getContent(), charset));
            String line = null;

            while ((line = reader.readLine()) != null) {
                res.append(line);
            }

        } catch (Exception e) {
            e.printStackTrace();
        } finally {

            try {
                if (reader != null) {
                    reader.close();
                }
            } catch (IOException e) {
            }

        }
        return res.toString();
    }
}

經過上面的方法咱們就能夠把頁面資源給請求下來了，固然了上面只是個工具類，還要配合下面幾個代碼使用。工具

第二步分析資源

分析資源確定不是分析輸出的結果，那樣太亂了，最好是去目標頁面去分析

打開www.baidu.com，chrome瀏覽器右鍵審查元素

上面的那一行就是百度LOGO所在的位置，查看以後只有hidefocus後面有的圖片就是LOGO，那麼我就能夠寫正則表達式了

第三步正則表達式

String regex = "hidefocus.+?src=\"//(.+?)\"";
以hidefocus開始，一直到src之間無論多少字符串都行，
src後面雙引號裏的內容就是咱們要取的，用組的形式表示出來，就是用()包含住，
這樣方便在使用API時，用Java的group函數時對應。

第四步 RegexStringUtils.java Java API捕獲輸出，具體請求步驟請查看第三篇Java簡單爬蟲系列（3）---正則表達式和Java正則API的使用

package com.hldh.river;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * Created by liuhj on 2016/1/4.
 */
public class RegexStringUtils {
    public static String regexString(String targetStr, String patternStr){
        Pattern pattern = Pattern.compile(patternStr);
        // 定義一個matcher用來作匹配
        Matcher matcher = pattern.matcher(targetStr);
        // 若是找到了
        if (matcher.find()) {
            // 打印出結果
            // System.out.println(matcher.group(1));
            return matcher.group(1);
        }
        return "";
    }
}

下面把主函數貼出來 Appjava

package com.hldh.river;

/**
 *  正則表達式獲取百度LOGO
 */
public class App {
    public static void main( String[] args ){

        String url = "http://www.baidu.com/";
        String regex = "hidefocus.+?src=\"//(.+?)\"";
        System.out.println(regex);
        String result = HttpGetUtils.get(url);
        System.out.println(result);
        String src = RegexStringUtils.regexString(result, regex);
        System.out.println(src);
    }
}

輸出結果

hidefocus.+?src="//(.+?)"
HTTP/1.1 200 OK
<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8">.....中間的就省略了</script></body></html>
www.baidu.com/img/bd_logo1.png

最後一行就是百度圖片的logo地址。

你們可能以爲有個地址管毛用，若是我下載的是美女那不是仍是看不了，不要着急，

下面就用httpclient下載圖片的工具類 DownloadUtils.java

package com.hldh.river;

import org.apache.http.HttpEntity;
import org.apache.http.HttpStatus;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;

import java.io.*;

/**
 * Created by liuhj on 2016/1/8.
 * 把獲取的圖片下載
 */
public class DownloadUtils {

    public static String get(String url){
        String filename = "";
        String tergetUrl = "http://" + url;
        try {
            CloseableHttpClient httpclient = HttpClients.createDefault();
            HttpGet httpGet = new HttpGet(tergetUrl);
            CloseableHttpResponse response = httpclient.execute(httpGet);

            try {
                if (response != null
                        && response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                    System.out.println(response.getStatusLine());
                    HttpEntity entity = response.getEntity();
                    filename = download(entity);
                }
            } finally {
                httpclient.close();
                response.close();
            }
        }catch (Exception e){
            e.printStackTrace();
        }
        return filename;
    }
    private static String download(HttpEntity resEntity) {
        //圖片要保存的路徑
        String dirPath = "d:\\img\\";
        //圖片名稱，能夠自定義生成
        String fileName = "b_logo.png";
        //若是沒有目錄先建立目錄，若是沒有文件名先建立文件名
        File file = new File(dirPath);
        if(file == null || !file.exists()){
            file.mkdir();
        }
        String realPath = dirPath.concat(fileName);
        File filePath = new File(realPath);
        if (filePath == null || !filePath.exists()) {
            try {
                filePath.createNewFile();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

        //獲得輸入流，而後把輸入流放入緩衝區中，緩衝區--->輸出流flush，關閉資源
        BufferedOutputStream out = null;
        InputStream in = null;
        try {
            if (resEntity == null) {
                return null;
            }
            in = resEntity.getContent();

            out = new BufferedOutputStream(new FileOutputStream(filePath));
            byte[] bytes = new byte[1024];
            int len = -1;
            while((len = in.read(bytes)) != -1){
                out.write(bytes,0,len);
            }
            out.flush();
            out.close();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (in != null) {
                    in.close();
                }
            } catch (IOException e) {
            }

        }
        return filePath.toString();
    }
}

下面就是保存圖片的結果

至此使用正則表達式爬取百度圖片就寫完了，下面會寫一寫擴展，使用Jsoup來獲取圖片保存下來。