最近關注SeimiCrawler整合Mybatis的朋友比較多,故僅以此文拋磚引玉。若是是不瞭解SeimiCrawler的朋友也能夠經過此文簡單瞭解下SeimiCrawler。html
SeimiCrawler是一個敏捷的,獨立部署的,支持分佈式的Java爬蟲框架,但願能在最大程度上下降新手開發一個可用性高且性能不差的爬蟲系統的門檻,以及提高開發爬蟲系統的開發效率。在SeimiCrawler的世界裏,絕大多數人只需關心去寫抓取的業務邏輯就夠了,其他的Seimi幫你搞定。設計思想上SeimiCrawler受Python的爬蟲框架Scrapy啓發,同時融合了Java語言自己特色與Spring的特性,並但願在國內更方便且廣泛的使用更有效率的XPath解析HTML,因此SeimiCrawler默認的HTML解析器是JsoupXpath(獨立擴展項目,非jsoup自帶),默認解析提取HTML數據工做均使用XPath來完成(固然,數據處理亦能夠自行選擇其餘解析器)。並結合SeimiAgent完全完美解決複雜動態頁面渲染抓取問題。java
Github託管mysql
下面正式開始整合Mybatis的內容。數據庫以MySQL爲例。android
<dependency> <groupId>cn.wanghaomiao</groupId> <artifactId>SeimiCrawler</artifactId> <version>1.2.0</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-dbcp2</artifactId> <version>2.1.1</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-pool2</artifactId> <version>2.4.2</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.37</version> </dependency> <dependency> <groupId>org.mybatis</groupId> <artifactId>mybatis-spring</artifactId> <version>1.3.0</version> </dependency> <dependency> <groupId>org.mybatis</groupId> <artifactId>mybatis</artifactId> <version>3.4.1</version> </dependency>
假設建有數據庫,庫名爲xiaohuo
,內含表結構以下:git
CREATE TABLE `blog` ( `id` int(11) NOT NULL AUTO_INCREMENT, `title` varchar(300) DEFAULT NULL, `content` text, `update_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
package cn.wanghaomiao.model; import cn.wanghaomiao.seimi.annotation.Xpath; import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.builder.ToStringBuilder; /** * Xpath語法能夠參考 http://jsoupxpath.wanghaomiao.cn/ * @since 2015/10/27. */ public class BlogContent { private Integer id; @Xpath("//h1[@class='postTitle']/a/text()|//a[@id='cb_post_title_url']/text()") private String title; //也能夠這麼寫 @Xpath("//div[@id='cnblogs_post_body']//text()") @Xpath("//div[@id='cnblogs_post_body']/allText()") private String content; public Integer getId() { return id; } public void setId(Integer id) { this.id = id; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public String getContent() { return content; } public void setContent(String content) { this.content = content; } @Override public String toString() { if (StringUtils.isNotBlank(content)&&content.length()>100){ //方便查看截斷下 this.content = StringUtils.substring(content,0,100)+"..."; } return ToStringBuilder.reflectionToString(this); } }
mybatis-config.xml
文件一些基本的全局設置github
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE configuration PUBLIC "-//mybatis.org//DTD Config 3.0//EN" "http://mybatis.org/dtd/mybatis-3-config.dtd"> <configuration> <settings> <setting name="mapUnderscoreToCamelCase" value="true"/> </settings> </configuration>
seimi-mybatis.xml
文件<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:context="http://www.springframework.org/schema/context" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd"> <context:annotation-config /> <bean id="mybatisDataSource" class="org.apache.commons.dbcp2.BasicDataSource"> <property name="driverClassName" value="${database.driverClassName}"/> <property name="url" value="${database.url}"/> <property name="username" value="${database.username}"/> <property name="password" value="${database.password}"/> </bean> <bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean" abstract="true"> <property name="configLocation" value="classpath:mybatis-config.xml"/> </bean> <bean id="seimiSqlSessionFactory" parent="sqlSessionFactory"> <property name="dataSource" ref="mybatisDataSource"/> </bean> <bean class="org.mybatis.spring.mapper.MapperScannerConfigurer"> <property name="basePackage" value="cn.wanghaomiao.dao.mybatis"/> <property name="sqlSessionFactoryBeanName" value="seimiSqlSessionFactory"/> </bean> </beans>
配置文件中的${database.driverClassName}
是因爲SeimiCrawler的demo工程還有動態配置的相關設置,此處亦可直接寫死,沒必要再讀其餘配置。spring
cn.wanghaomiao.dao.mybatis
目錄下添加DAOpackage cn.wanghaomiao.dao.mybatis; import cn.wanghaomiao.model.BlogContent; import org.apache.ibatis.annotations.Insert; import org.apache.ibatis.annotations.Options; import org.apache.ibatis.annotations.Param; /** * @since 2016/7/27. */ public interface MybatisStoreDAO { @Insert("insert into blog (title,content,update_time) values (#{blog.title},#{blog.content},now())") @Options(useGeneratedKeys = true, keyProperty = "blog.id") int save(@Param("blog") BlogContent blog); }
至此,Mybatis部分的已經就緒了。sql
package cn.wanghaomiao.crawlers; import cn.wanghaomiao.dao.mybatis.MybatisStoreDAO; import cn.wanghaomiao.model.BlogContent; import cn.wanghaomiao.seimi.annotation.Crawler; import cn.wanghaomiao.seimi.def.BaseSeimiCrawler; import cn.wanghaomiao.seimi.struct.Request; import cn.wanghaomiao.seimi.struct.Response; import cn.wanghaomiao.xpath.model.JXDocument; import org.springframework.beans.factory.annotation.Autowired; import java.util.List; /** * 將解析出來的數據直接存儲到數據庫中,整合mybatis實現 * * @author 汪浩淼 [et.tw@163.com] * @since 2016/07/27. */ @Crawler(name = "mybatis") public class DatabaseMybatisDemo extends BaseSeimiCrawler { @Autowired private MybatisStoreDAO storeToDbDAO; @Override public String[] startUrls() { return new String[]{"http://www.cnblogs.com/"}; } @Override public void start(Response response) { JXDocument doc = response.document(); try { List<Object> urls = doc.sel("//a[@class='titlelnk']/@href"); logger.info("{}", urls.size()); for (Object s : urls) { push(Request.build(s.toString(), "renderBean")); } } catch (Exception e) { e.printStackTrace(); } } public void renderBean(Response response) { try { BlogContent blog = response.render(BlogContent.class); logger.info("bean resolve res={},url={}", blog, response.getUrl()); //使用神器paoding-jade存儲到DB int changeNum = storeToDbDAO.save(blog); int blogId = blog.getId(); logger.info("store success,blogId = {},changeNum={}", blogId, changeNum); } catch (Exception e) { e.printStackTrace(); } } }
接下來簡單啓動下,數據庫
public class Boot { public static void main(String[] args){ Seimi s = new Seimi(); s.start("mybatis"); } }
能夠看到以下日誌:apache
00:25:18 INFO c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 257,changeNum=1 00:25:18 INFO c.w.crawlers.DatabaseMybatisDemo - bean resolve res=cn.wanghaomiao.model.BlogContent@3edc08c3[id=<null>,title=CoordinatorLayout自定義Bahavior特效及其源碼分析CoordinatorLayout自定義Bahavior特效及其源碼分析,content=@[CoordinatorLayout, Bahavior] CoordinatorLayout是android support design包中能夠算是最重要的一個東西,運用它能夠作出一些不錯的特效...],url=http://www.cnblogs.com/soaringEveryday/p/5711545.html 00:25:18 INFO c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 258,changeNum=1 00:25:18 INFO c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 259,changeNum=1 00:25:18 INFO c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 260,changeNum=1
整合完畢!
生產環境工程打包部署以及啓動,推薦使用maven-seimicrawler-plugin
打包插件,詳細請繼續參閱maven-seimicrawler-plugin或是「Seimi基礎系列1-SeimiCrawler打包部署工具使用」。