Seimi基礎系列2-SeimiCrawler整合Mybatis存儲數據

最近關注SeimiCrawler整合Mybatis的朋友比較多,故僅以此文拋磚引玉。若是是不瞭解SeimiCrawler的朋友也能夠經過此文簡單瞭解下SeimiCrawlerhtml

SeimiCrawler簡介

SeimiCrawler是一個敏捷的,獨立部署的,支持分佈式的Java爬蟲框架,但願能在最大程度上下降新手開發一個可用性高且性能不差的爬蟲系統的門檻,以及提高開發爬蟲系統的開發效率。在SeimiCrawler的世界裏,絕大多數人只需關心去寫抓取的業務邏輯就夠了,其他的Seimi幫你搞定。設計思想上SeimiCrawler受Python的爬蟲框架Scrapy啓發,同時融合了Java語言自己特色與Spring的特性,並但願在國內更方便且廣泛的使用更有效率的XPath解析HTML,因此SeimiCrawler默認的HTML解析器是JsoupXpath(獨立擴展項目,非jsoup自帶),默認解析提取HTML數據工做均使用XPath來完成(固然,數據處理亦能夠自行選擇其餘解析器)。並結合SeimiAgent完全完美解決複雜動態頁面渲染抓取問題。java

項目源碼

Github託管mysql

下面正式開始整合Mybatis的內容。數據庫以MySQL爲例。android

依賴

<dependency>
	<groupId>cn.wanghaomiao</groupId>
          <artifactId>SeimiCrawler</artifactId>
          <version>1.2.0</version>
</dependency>
<dependency>
	<groupId>org.apache.commons</groupId>
	<artifactId>commons-dbcp2</artifactId>
	<version>2.1.1</version>
</dependency>
<dependency>
	<groupId>org.apache.commons</groupId>
	<artifactId>commons-pool2</artifactId>
	<version>2.4.2</version>
</dependency>
<dependency>
	<groupId>mysql</groupId>
	<artifactId>mysql-connector-java</artifactId>
	<version>5.1.37</version>
</dependency>
<dependency>
	<groupId>org.mybatis</groupId>
	<artifactId>mybatis-spring</artifactId>
	<version>1.3.0</version>
</dependency>
<dependency>
	<groupId>org.mybatis</groupId>
	<artifactId>mybatis</artifactId>
	<version>3.4.1</version>
</dependency>

數據表結構

假設建有數據庫,庫名爲xiaohuo,內含表結構以下:git

CREATE TABLE `blog` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` varchar(300) DEFAULT NULL,
  `content` text,
  `update_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

對應的Model對象

package cn.wanghaomiao.model;

import cn.wanghaomiao.seimi.annotation.Xpath;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.builder.ToStringBuilder;

/**
 * Xpath語法能夠參考 http://jsoupxpath.wanghaomiao.cn/
 * @since 2015/10/27.
 */
public class BlogContent {
    private Integer id;

    @Xpath("//h1[@class='postTitle']/a/text()|//a[@id='cb_post_title_url']/text()")
    private String title;

    //也能夠這麼寫 @Xpath("//div[@id='cnblogs_post_body']//text()")
    @Xpath("//div[@id='cnblogs_post_body']/allText()")
    private String content;

    public Integer getId() {
        return id;
    }

    public void setId(Integer id) {
        this.id = id;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    @Override
    public String toString() {
        if (StringUtils.isNotBlank(content)&&content.length()>100){
            //方便查看截斷下
            this.content = StringUtils.substring(content,0,100)+"...";
        }
        return ToStringBuilder.reflectionToString(this);
    }
}

整合Mybatis的配置文件

  • resources下添加 mybatis-config.xml文件

一些基本的全局設置github

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE configuration
        PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
    <settings>
        <setting name="mapUnderscoreToCamelCase" value="true"/>
    </settings>
</configuration>
  • resources下添加seimi-mybatis.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:context="http://www.springframework.org/schema/context"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd">


    <context:annotation-config />
    <bean id="mybatisDataSource" class="org.apache.commons.dbcp2.BasicDataSource">
        <property name="driverClassName" value="${database.driverClassName}"/>
        <property name="url" value="${database.url}"/>
        <property name="username" value="${database.username}"/>
        <property name="password" value="${database.password}"/>
    </bean>

    <bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean" abstract="true">
        <property name="configLocation" value="classpath:mybatis-config.xml"/>
    </bean>
    <bean id="seimiSqlSessionFactory" parent="sqlSessionFactory">
        <property name="dataSource" ref="mybatisDataSource"/>
    </bean>
    <bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">
        <property name="basePackage" value="cn.wanghaomiao.dao.mybatis"/>
        <property name="sqlSessionFactoryBeanName" value="seimiSqlSessionFactory"/>
    </bean>
</beans>

配置文件中的${database.driverClassName}是因爲SeimiCrawler的demo工程還有動態配置的相關設置,此處亦可直接寫死,沒必要再讀其餘配置。spring

  • cn.wanghaomiao.dao.mybatis目錄下添加DAO
package cn.wanghaomiao.dao.mybatis;

import cn.wanghaomiao.model.BlogContent;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Options;
import org.apache.ibatis.annotations.Param;

/**
 * @since 2016/7/27.
 */
public interface MybatisStoreDAO {

    @Insert("insert into blog (title,content,update_time) values (#{blog.title},#{blog.content},now())")
    @Options(useGeneratedKeys = true, keyProperty = "blog.id")
    int save(@Param("blog") BlogContent blog);
}

至此,Mybatis部分的已經就緒了。sql

使用

package cn.wanghaomiao.crawlers;

import cn.wanghaomiao.dao.mybatis.MybatisStoreDAO;
import cn.wanghaomiao.model.BlogContent;
import cn.wanghaomiao.seimi.annotation.Crawler;
import cn.wanghaomiao.seimi.def.BaseSeimiCrawler;
import cn.wanghaomiao.seimi.struct.Request;
import cn.wanghaomiao.seimi.struct.Response;
import cn.wanghaomiao.xpath.model.JXDocument;
import org.springframework.beans.factory.annotation.Autowired;

import java.util.List;

/**
 * 將解析出來的數據直接存儲到數據庫中,整合mybatis實現
 *
 * @author 汪浩淼 [et.tw@163.com]
 * @since 2016/07/27.
 */
@Crawler(name = "mybatis")
public class DatabaseMybatisDemo extends BaseSeimiCrawler {
    @Autowired
    private MybatisStoreDAO storeToDbDAO;

    @Override
    public String[] startUrls() {
        return new String[]{"http://www.cnblogs.com/"};
    }

    @Override
    public void start(Response response) {
        JXDocument doc = response.document();
        try {
            List<Object> urls = doc.sel("//a[@class='titlelnk']/@href");
            logger.info("{}", urls.size());
            for (Object s : urls) {
                push(Request.build(s.toString(), "renderBean"));
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public void renderBean(Response response) {
        try {
            BlogContent blog = response.render(BlogContent.class);
            logger.info("bean resolve res={},url={}", blog, response.getUrl());
            //使用神器paoding-jade存儲到DB
            int changeNum = storeToDbDAO.save(blog);
            int blogId = blog.getId();
            logger.info("store success,blogId = {},changeNum={}", blogId, changeNum);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

接下來簡單啓動下,數據庫

public class Boot {
    public static void main(String[] args){
        Seimi s = new Seimi();
        s.start("mybatis");
    }
}

能夠看到以下日誌:apache

00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 257,changeNum=1
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - bean resolve res=cn.wanghaomiao.model.BlogContent@3edc08c3[id=<null>,title=CoordinatorLayout自定義Bahavior特效及其源碼分析CoordinatorLayout自定義Bahavior特效及其源碼分析,content=@[CoordinatorLayout, Bahavior] CoordinatorLayout是android support design包中能夠算是最重要的一個東西,運用它能夠作出一些不錯的特效...],url=http://www.cnblogs.com/soaringEveryday/p/5711545.html
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 258,changeNum=1
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 259,changeNum=1
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 260,changeNum=1

整合完畢!

後記

生產環境工程打包部署以及啓動,推薦使用maven-seimicrawler-plugin打包插件,詳細請繼續參閱maven-seimicrawler-plugin或是「Seimi基礎系列1-SeimiCrawler打包部署工具使用」。

完整的Demo工程地址

完整版demo

相關文章
相關標籤/搜索