Java爬取網絡博客文章

前言

近期本人在某雲上購買了我的域名,本想着之後購買與服務器搭建本身的我的網站,因爲須要籌備的太多,暫時先擱置了,想着先借用GitHub Pages搭建一個靜態的站,搭建的過程其實也曲折,主要是域名地址配置把人搞廢了,不過總的來講還算順利,網站地址  https://chenchangyuan.cn(空博客,樣式還挺漂亮的,後期會添磚加瓦)html

利用git+npm+hexo,再在github中進行相應配置,網上教程不少,若是有疑問歡迎評論告知。前端

本人之前也是搞過幾年java,因爲公司的崗位職責,後面漸漸地被掰彎,如今主要是作前端開發。java

因此想利用java爬取文章,再將爬取的html轉化成md(目前還未實現,歡迎各位同窗指導)。git

1.獲取我的博客全部url

查看博客地址https://www.cnblogs.com/ccylovehs/default.html?page=1github

根據你本身寫的博客數量進行遍歷npm

將博客的詳情頁地址存放在set集合中,詳情頁地址https://www.cnblogs.com/ccylovehs/p/9547690.html服務器

2.詳情頁url生成html文件

遍歷set集合,依次生成html文件hexo

文件存放在C://data//blog目錄下,文件名由捕獲組1生成網站

3.代碼實現

package com.blog.util;

import java.io.BufferedReader;
import java.io.File;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Iterator;
import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * @author Jack Chen
 * */
public class BlogUtil {

    /**
     * URL_PAGE:cnblogs url
     * URL_PAGE_DETAIL:詳情頁url
     * PAGE_COUNT:頁數
     * urlLists:全部詳情頁url Set集合(防止重複)
     * p:匹配模式
     * */
    public final static String URL_PAGE = "https://www.cnblogs.com/ccylovehs/default.html?page=";
    public final static String URL_PAGE_DETAIL = "https://www.cnblogs.com/ccylovehs/p/([0-9]+.html)";
    public final static int PAGE_COUNT = 3;
    public static Set<String> urlLists = new TreeSet<String>();
    public final static Pattern p = Pattern.compile(URL_PAGE_DETAIL);
    
    
    public static void main(String[] args) throws Exception {
        for(int i = 1;i<=PAGE_COUNT;i++) {
            getUrls(i);
        }
        for(Iterator<String> i = urlLists.iterator();i.hasNext();) {
            createFile(i.next());
        }
    }
    
    /**
     * @param url
     * @throws Exception
     */
    private static void createFile(String url) throws Exception {
        Matcher m = p.matcher(url);
        m.find();
        String fileName = m.group(1);
        String prefix = "C://data//blog//";
        File file = new File(prefix + fileName);
        PrintStream ps = new PrintStream(file);

        URL u = new URL(url);
        HttpURLConnection conn = (HttpURLConnection) u.openConnection();
        conn.connect();
        BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "utf-8"));
        String str;
        
        while((str = br.readLine()) != null){
            ps.println(str);
        }
        ps.close();
        br.close();
        conn.disconnect();
    }
    
    /**
     * @param idx 
     * @throws Exception
     */
    private static void getUrls(int idx) throws Exception{
        URL u = new URL(URL_PAGE+""+idx);
        HttpURLConnection conn = (HttpURLConnection) u.openConnection();
        conn.connect();
        BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "utf-8"));
        String str;
        while((str = br.readLine()) != null){
            if(null != str && str.contains("https://www.cnblogs.com/ccylovehs/p/")) {
                Matcher m = p.matcher(str);
                if(m.find()) {
                    System.out.println(m.group(1));
                    urlLists.add(m.group());
                }
            }
        }
        br.close();
        conn.disconnect();
    }
    
}

4.結語

若是以爲對您有用的話,麻煩動動鼠標給我一顆star,您的鼓勵是我最大的動力url

https://github.com/chenchangyuan/getHtmlForJava

因爲不想一篇篇的手動生成md文件,下一步須要將html文件批量的轉化成md文件,以便完善我的博客內容,未完待續~~~

 

個人博客即將搬運同步至騰訊雲+社區,邀請你們一同入駐:https://cloud.tencent.com/developer/support-plan?invite_code=2kglkq1jyzc4w

相關文章
相關標籤/搜索