背景:女票快畢業了(沒錯!我是有女票的!!!),寫論文,主題是兒童性教育,查看兒童性教育繪本數據死活找不到,沒辦法,就去噹噹網查詢下數據,可是數據怎麼弄下來呢,首先想到用Python,可是不會!!百度一番,最終決定仍是用java大法爬蟲,畢竟java熟悉點,話很少說,開工!:前端
實現:java
首先搭建框架,建立一個maven項目,使用框架是springboot和mybatis,開發工具是idea,pom.xml以下:mysql
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.1.4.RELEASE</version> <relativePath/> <!-- lookup parent from repository --> </parent> <groupId>cn.com.boco</groupId> <artifactId>demo</artifactId> <version>0.0.1-SNAPSHOT</version> <name>demo</name> <description>Demo project for Spring Boot</description> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-jdbc</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.mybatis.spring.boot</groupId> <artifactId>mybatis-spring-boot-starter</artifactId> <version>2.0.1</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <scope>runtime</scope> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> <dependency> <groupId>com.oracle</groupId> <artifactId>ojdbc6</artifactId> <version>11.2.0</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.5</version> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.3</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.45</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build> </project>
目錄結構以下:web
鏈接的數據庫是oracle本地的數據庫,配置文件以下spring
注意:application.yml文件中sql
spring:
profiles:
active:dev
指定的就是application_dev.yml文件,就是配置文件用的這個,在實際開發中,能夠經過這種方式配置幾份配置環境,這樣發佈的時候切換active屬性就行,不用修改配置文件了
application_dev.yml配置文件:數據庫
server: port: 8084 spring: datasource: username: system password: 123456 url: jdbc:oracle:thin:@localhost driver-class-name: oracle.jdbc.driver.OracleDriver mybatis: mapper-locations: classpath*:mapping/*.xml type-aliases-package: cn.com.boco.demo.entity #showSql logging: level: com: example: mapper : debug
application.yml文件:apache
spring: profiles: active: dev
啓動類以下,加上MapperScan註解,掃描dao層的接口:json
@MapperScan("cn.com.boco.demo.mapper") @SpringBootApplication public class DemoApplication { public static void main(String[] args) { SpringApplication.run(DemoApplication.class, args); } }
dao層接口:springboot
@Repository public interface BookMapper { void insertBatch(List<DangBook> list); }
xml文件:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd"> <mapper namespace="cn.com.boco.demo.mapper.BookMapper"> <insert id="insertBatch" parameterType="java.util.List"> INSERT ALL <foreach collection="list" item="item" index="index" separator=" "> into dangdang_message (title,img,author,publish,detail,price,parentUrl,inputTime) values (#{item.title,jdbcType=VARCHAR}, #{item.img,jdbcType=VARCHAR}, #{item.author,jdbcType=VARCHAR}, #{item.publish,jdbcType=VARCHAR}, #{item.detail,jdbcType=VARCHAR}, #{item.price,jdbcType=DOUBLE}, #{item.parentUrl,jdbcType=VARCHAR}, #{item.inputTime,jdbcType=DATE}) </foreach> select 1 from dual </insert> </mapper>
兩個實體類:
public class BaseModel { private int id; private Date inputTime; public Date getInputTime() { return inputTime; } public void setInputTime(Date inputTime) { this.inputTime = inputTime; } public int getId() { return id; } public void setId(int id) { this.id = id; } }
@Alias("dangBook") public class DangBook extends BaseModel { //標題 private String title; //圖片地址 private String img; //做者 private String author; //出版社 private String publish; //詳細說明 private String detail; //價格 private float price; //父連接,即請求連接 private String parentUrl; public String getParentUrl() { return parentUrl; } public void setParentUrl(String parentUrl) { this.parentUrl = parentUrl; } public String getAuthor() { return author; } public void setAuthor(String author) { this.author = author; } public String getPublish() { return publish; } public void setPublish(String publish) { this.publish = publish; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public String getImg() { return img; } public void setImg(String img) { this.img = img; } public String getDetail() { return detail; } public void setDetail(String detail) { this.detail = detail; } public float getPrice() { return price; } public void setPrice(float price) { this.price = price; } }
service層:
@Service public class BookService { @Autowired private BookMapper bookMapper; public void insertBatch(List<DangBook> list){ bookMapper.insertBatch(list); } }
controll層代碼:
@RestController @RequestMapping("/book") public class DangdangBookController { @Autowired private BookService bookService; private static Logger logger = LoggerFactory.getLogger(DemoApplication.class); //url解碼以後 private static final String URL = "http://search.dangdang.com/?key=性教育繪本&act=input&att=1000006:226&page_index="; //url解碼以前 private static final String URL2 = "http://search.dangdang.com/?key=%D0%D4%BD%CC%D3%FD%BB%E6%B1%BE&act=input&att=1000006%3A226&page_index="; @RequestMapping("/parse") public JSONObject parse(){ JSONObject jsonObject = new JSONObject(); for(int i =1;i<=10;i++){ List<DangBook> dangBooks = ParseUtils.dingParse(URL+i); if(dangBooks != null && dangBooks.size() >0){ logger.info("解析完數據,準備入庫"); bookService.insertBatch(dangBooks); logger.info("入庫完成,入庫數據條數"+ dangBooks.size()); jsonObject.put("code",1); jsonObject.put("result","success"); }else{ jsonObject.put("code",0); jsonObject.put("result","fail"); } } return jsonObject; } }
原本是前端傳入地址解析的,可是發現參數丟失了,用url編碼也不行,最後放到後臺了
ParseUtils和HttpGetUtils工具類:
public class HttpGetUtils { private static Logger logger = LoggerFactory.getLogger(HttpGetUtils.class); public static String getUrlContent(String url) { if (url == null) { logger.info("url地址爲空"); return null; } logger.info("url爲:" + url); logger.info("開始解析"); String contentLine = null; //最新版httpclient.jar已經捨棄new DefaultHttpClient() //可是仍是能夠用的 HttpClient httpClient = new DefaultHttpClient(); HttpResponse httpResponse = getResp(httpClient, url); if (httpResponse.getStatusLine().getStatusCode() == 200) { try { contentLine = EntityUtils.toString(httpResponse.getEntity(), "utf-8"); } catch (IOException e) { e.printStackTrace(); } } logger.info("解析結束"); return contentLine; } /** * 根據url 獲取response對象 */ public static HttpResponse getResp(HttpClient httpClient, String url) { logger.info("開始獲取response對象"); HttpGet httpGet = new HttpGet(url); HttpResponse httpResponse = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK"); try { httpResponse = httpClient.execute(httpGet); } catch (IOException e) { e.printStackTrace(); } logger.info("獲取對象結束"); return httpResponse; } }
public class ParseUtils { private static Logger logger = LoggerFactory.getLogger(ParseUtils.class); public static List<DangBook> dingParse(String url) { List<DangBook> list = new ArrayList<>(); Date date = new Date(); if (url == null) { logger.info("url爲空,數據獲取結束"); return null; } logger.info("開始獲取數據"); String content = HttpGetUtils.getUrlContent(url); if (content != null) logger.info("獲得解析數據"); else { logger.info("解析數據爲空,數據獲取結束"); return null; } Document document = Jsoup.parse(content); //遍歷噹噹圖書列表 for(int i =1;i<=60;i++){ Elements elements = document.select("ul[class=bigimg]").select("li[class=line"+i+"]"); for (Element e : elements) { String title = e.select("p[class=name]").select("a").text(); logger.info("書名:" + title); String img = e.select("a[class=pic]").select("img").attr("data-original"); logger.info("圖片地址:" + img); String authorAndPublish = e.select("p[class=search_book_author]").select("span").select("a").text(); String []a = authorAndPublish.split(" "); String author = a[0]; logger.info("做者:" + author); String publish = a[a.length - 1]; logger.info("出版社:" + publish); // String publish =e.select("p[class=name]").select("a").text(); String detail = e.select("p[class=detail]").text(); logger.info("圖書介紹:" + detail); String priceS = e.select("p[class=price]").select("span[class=search_now_price]").text(); float price = 0.0f; if(priceS.length()>1 && priceS != null){ price = Float.parseFloat(priceS.substring(1, priceS.length() - 1)); } logger.info("價格:" + price); logger.info("-------------------------------------------------------------------------"); DangBook dangBook = new DangBook(); dangBook.setTitle(title); dangBook.setImg(img); dangBook.setAuthor(author); dangBook.setPublish(publish); dangBook.setDetail(detail); dangBook.setPrice(price); dangBook.setParentUrl(url); dangBook.setInputTime(date); list.add(dangBook); } } return list; } }
最後表裏數據以下:
注意:建表的時候注意字段類型,orcale的var(255)不夠個人這個數據標題用,開始報錯,後來改了字段類型,還有注意ID的自增和入庫時間的自動添加,我的數據庫較差,百度一番才弄好