Java爬蟲框架:SeimiCrawler——結構化解析與數據存儲

本文將介紹如何使用SeimiCrawler將頁面中信息提取爲結構化數據並存儲到數據庫中,這也是你們很是常見的使用場景。數據抓取將以抓取博客園的博客爲例。java

創建基本數據結構

爲了演示,簡單起見只創建一個用來存儲博客標題和內容兩個主要信息的表便可。表以下:mysql

CREATE TABLE `blog` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` varchar(300) DEFAULT NULL,
  `content` text,
  `update_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

同時建一個與之對應的Bean對象,以下:git

package cn.wanghaomiao.model;

import cn.wanghaomiao.seimi.annotation.Xpath;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.builder.ToStringBuilder;

/**
 * Xpath語法能夠參考 http://jsoupxpath.wanghaomiao.cn/
 */
public class BlogContent {
    @Xpath("//h1[@class='postTitle']/a/text()|//a[@id='cb_post_title_url']/text()")
    private String title;
    //也能夠這麼寫 @Xpath("//div[@id='cnblogs_post_body']//text()")
    @Xpath("//div[@id='cnblogs_post_body']/allText()")
    private String content;

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    @Override
    public String toString() {
        if (StringUtils.isNotBlank(content)&&content.length()>100){
            //方便查看截斷下
            this.content = StringUtils.substring(content,0,100)+"...";
        }
        return ToStringBuilder.reflectionToString(this);
    }
}

這裏面的@Xpath註解要着重介紹下,註解中配置的是針對對應字段的XPath提取規則,後面會介紹到SeimiCrawler會調用Response.render(Class<T> bean)來自動解析填充對應字段。對於開發者而言,提取結構化數據所要作的最主要的工做就在這裏,且就這麼多,接下來介紹的就是總體上如何串起來的。github

實現數據存儲

本文演示使用的是paoding-jade,一款人人網早期開源出來的ORM框架。因爲SeimiCrawler的對象池以及依賴管理是使用spring來實現的,因此SeimiCrawler自然支持一切能夠和spring整合的ORM框架。 要啓用Jade需添加pom依賴:spring

<dependency>
	<groupId>net.paoding</groupId>
	<artifactId>paoding-rose-jade</artifactId>
	<version>2.0.u01</version>
</dependency>
<dependency>
	<groupId>org.apache.commons</groupId>
	<artifactId>commons-dbcp2</artifactId>
	<version>2.1.1</version>
</dependency>
<dependency>
	<groupId>mysql</groupId>
	<artifactId>mysql-connector-java</artifactId>
	<version>5.1.37</version>
</dependency>

添加resources下seimi-jade.xml配置文件:sql

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">

       <bean id="dataSource" class="org.apache.commons.dbcp2.BasicDataSource">
           <property name="driverClassName" value="com.mysql.jdbc.Driver" />
           <property name="url" value="jdbc:mysql://127.0.0.1:3306/xiaohuo?useUnicode=true&characterEncoding=UTF8&autoReconnect=true&autoReconnectForPools=true&zeroDateTimeBehavior=convertToNull" />
           <property name="username" value="xx" />
           <property name="password" value="xx" />
       </bean>
       <!-- 啓用Jade配置 -->
       <bean class="net.paoding.rose.jade.context.spring.JadeBeanFactoryPostProcessor" />
</beans>

編寫DAO,數據庫

package cn.wanghaomiao.dao;

import cn.wanghaomiao.model.BlogContent;
import net.paoding.rose.jade.annotation.DAO;
import net.paoding.rose.jade.annotation.ReturnGeneratedKeys;
import net.paoding.rose.jade.annotation.SQL;

@DAO
public interface StoreToDbDAO {
    @ReturnGeneratedKeys
    @SQL("insert into blog (title,content,update_time) values (:1.title,:1.content,now())")
    public int save(BlogContent blog);
}

數據存儲搞定後接下來就是咱們的爬蟲規則類了apache

Crawler

直接上:數據結構

package cn.wanghaomiao.crawlers;

import cn.wanghaomiao.dao.StoreToDbDAO;
import cn.wanghaomiao.model.BlogContent;
import cn.wanghaomiao.seimi.annotation.Crawler;
import cn.wanghaomiao.seimi.struct.Request;
import cn.wanghaomiao.seimi.struct.Response;
import cn.wanghaomiao.seimi.def.BaseSeimiCrawler;
import cn.wanghaomiao.xpath.model.JXDocument;
import org.springframework.beans.factory.annotation.Autowired;

import java.util.List;

/**
 * 將解析出來的數據直接存儲到數據庫中
 */
@Crawler(name = "storedb")
public class DatabaseStoreDemo extends BaseSeimiCrawler {
    @Autowired
    private StoreToDbDAO storeToDbDAO;

    @Override
    public String[] startUrls() {
        return new String[]{"http://www.cnblogs.com/"};
    }

    @Override
    public void start(Response response) {
        JXDocument doc = response.document();
        try {
            List<Object> urls = doc.sel("//a[@class='titlelnk']/@href");
            logger.info("{}", urls.size());
            for (Object s:urls){
                push(Request.build(s.toString(),"renderBean"));
            }
        } catch (Exception e) {
            //ignore
        }
    }
    public void renderBean(Response response){
        try {
            BlogContent blog = response.render(BlogContent.class);
            logger.info("bean resolve res={},url={}",blog,response.getUrl());
            //使用神器paoding-jade存儲到DB
            int blogId = storeToDbDAO.save(blog);
            logger.info("store sus,blogId = {}",blogId);
        } catch (Exception e) {
            //ignore
        }
    }
}

Github上亦有完整的demo,你們能夠下載下來,自行嘗試,點擊直達框架

相關文章
相關標籤/搜索