定時抓取網頁鏈接,提取網頁內容,存入數據庫
流程
- 提供要抓取的網頁地址(列表)
- 提取網頁列表中目標全部LINK
- 抓取LINK中的全部網頁(爬蟲)
- 解析正文內容
- 存入數據庫
1、抓取任務(主程序)
- package com.test;
-
- import java.text.SimpleDateFormat;
- import java.util.Date;
- import java.util.List;
-
-
-
- public class CatchJob {
-
- public String catchJob(String url){
-
- String document= null;
- List allLinks = null;
- try {
-
-
- document = ExtractPage.getContentByUrl(url);
-
- allLinks = ExtractPage.getLinksByConditions(document, "http://www.free9.net/others/gift/");
- if(allLinks!=null&&!allLinks.isEmpty()){
- for(int i=0;i<allLinks.size();i++){
- String link = (String)allLinks.get(i);
- String content = ExtractPage.getContentByUrl(link);
- ExtractPage.readByHtml(content);
- }
- }
-
- } catch (Exception e) {
-
- e.printStackTrace();
- }
-
-
- return "success";
-
-
- }
-
-
-
-
-
-
-
- public static void main(String[] args){
- Long startTime = System.currentTimeMillis();
- System.out.println(">>start.......");
- String httpProxyHost = "211.167.0.131";
- String httpProxyPort = "80";
- System.getProperties().setProperty( "http.proxyHost", httpProxyHost);
- System.getProperties().setProperty( "http.proxyPort", httpProxyPort);
- CatchJob job = new CatchJob();
-
- System.out.println(job.catchJob("http://www.free9.net/others/gift/"));
- Date date = new Date(System.currentTimeMillis()-startTime);
- SimpleDateFormat sdf = new SimpleDateFormat("HH:mm:ss ");
- String s = sdf.format(date);
- System.out.println(">>end.......USE"+s+"秒");
- }
-
- }
-
2、抓取網頁內容,並解析
歡迎關注本站公眾號,獲取更多信息