近期本人在某雲上購買了我的域名,本想着之後購買與服務器搭建本身的我的網站,因爲須要籌備的太多,暫時先擱置了,想着先借用GitHub Pages搭建一個靜態的站,搭建的過程其實也曲折,主要是域名地址配置把人搞廢了,不過總的來講還算順利,網站地址 https://chenchangyuan.cn(空博客,樣式還挺漂亮的,後期會添磚加瓦)html
利用git+npm+hexo,再在github中進行相應配置,網上教程不少,若是有疑問歡迎評論告知。前端
本人之前也是搞過幾年java,因爲公司的崗位職責,後面漸漸地被掰彎,如今主要是作前端開發。java
因此想利用java爬取文章,再將爬取的html轉化成md(目前還未實現,歡迎各位同窗指導)。git
查看博客地址https://www.cnblogs.com/ccylovehs/default.html?page=1github
根據你本身寫的博客數量進行遍歷npm
將博客的詳情頁地址存放在set集合中,詳情頁地址https://www.cnblogs.com/ccylovehs/p/9547690.html服務器
遍歷set集合,依次生成html文件hexo
文件存放在C://data//blog目錄下,文件名由捕獲組1生成網站
package com.blog.util; import java.io.BufferedReader; import java.io.File; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.io.PrintStream; import java.net.HttpURLConnection; import java.net.URL; import java.util.Iterator; import java.util.Set; import java.util.TreeSet; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * @author Jack Chen * */ public class BlogUtil { /** * URL_PAGE:cnblogs url * URL_PAGE_DETAIL:詳情頁url * PAGE_COUNT:頁數 * urlLists:全部詳情頁url Set集合(防止重複) * p:匹配模式 * */ public final static String URL_PAGE = "https://www.cnblogs.com/ccylovehs/default.html?page="; public final static String URL_PAGE_DETAIL = "https://www.cnblogs.com/ccylovehs/p/([0-9]+.html)"; public final static int PAGE_COUNT = 3; public static Set<String> urlLists = new TreeSet<String>(); public final static Pattern p = Pattern.compile(URL_PAGE_DETAIL); public static void main(String[] args) throws Exception { for(int i = 1;i<=PAGE_COUNT;i++) { getUrls(i); } for(Iterator<String> i = urlLists.iterator();i.hasNext();) { createFile(i.next()); } } /** * @param url * @throws Exception */ private static void createFile(String url) throws Exception { Matcher m = p.matcher(url); m.find(); String fileName = m.group(1); String prefix = "C://data//blog//"; File file = new File(prefix + fileName); PrintStream ps = new PrintStream(file); URL u = new URL(url); HttpURLConnection conn = (HttpURLConnection) u.openConnection(); conn.connect(); BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "utf-8")); String str; while((str = br.readLine()) != null){ ps.println(str); } ps.close(); br.close(); conn.disconnect(); } /** * @param idx * @throws Exception */ private static void getUrls(int idx) throws Exception{ URL u = new URL(URL_PAGE+""+idx); HttpURLConnection conn = (HttpURLConnection) u.openConnection(); conn.connect(); BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "utf-8")); String str; while((str = br.readLine()) != null){ if(null != str && str.contains("https://www.cnblogs.com/ccylovehs/p/")) { Matcher m = p.matcher(str); if(m.find()) { System.out.println(m.group(1)); urlLists.add(m.group()); } } } br.close(); conn.disconnect(); } }
若是以爲對您有用的話,麻煩動動鼠標給我一顆star,您的鼓勵是我最大的動力url
https://github.com/chenchangyuan/getHtmlForJava
因爲不想一篇篇的手動生成md文件,下一步須要將html文件批量的轉化成md文件,以便完善我的博客內容,未完待續~~~
個人博客即將搬運同步至騰訊雲+社區,邀請你們一同入駐:https://cloud.tencent.com/developer/support-plan?invite_code=2kglkq1jyzc4w