【轉】java爬蟲,爬取噹噹網數據

   背景:女票快畢業了(沒錯!我是有女票的!!!),寫論文,主題是兒童性教育,查看兒童性教育繪本數據死活找不到,沒辦法,就去噹噹網查詢下數據,可是數據怎麼弄下來呢,首先想到用Python,可是不會!!百度一番,最終決定仍是用java大法爬蟲,畢竟java熟悉點,話很少說,開工!:前端

  實現:java

  首先搭建框架,建立一個maven項目,使用框架是springboot和mybatis,開發工具是idea,pom.xml以下:mysql

複製代碼
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.1.4.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>cn.com.boco</groupId>
    <artifactId>demo</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>demo</name>
    <description>Demo project for Spring Boot</description>

    <properties>
        <java.version>1.8</java.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-jdbc</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.mybatis.spring.boot</groupId>
            <artifactId>mybatis-spring-boot-starter</artifactId>
            <version>2.0.1</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.oracle</groupId>
            <artifactId>ojdbc6</artifactId>
            <version>11.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.5</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.45</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>

</project>
複製代碼

目錄結構以下:web

鏈接的數據庫是oracle本地的數據庫,配置文件以下spring

注意:application.yml文件中sql

spring:
profiles:
active:dev
指定的就是application_dev.yml文件,就是配置文件用的這個,在實際開發中,能夠經過這種方式配置幾份配置環境,這樣發佈的時候切換active屬性就行,不用修改配置文件了

application_dev.yml配置文件:數據庫

複製代碼
server:
  port: 8084

spring:
  datasource:
    username: system
    password: 123456
    url: jdbc:oracle:thin:@localhost
    driver-class-name: oracle.jdbc.driver.OracleDriver

mybatis:
  mapper-locations: classpath*:mapping/*.xml
  type-aliases-package: cn.com.boco.demo.entity

#showSql
logging:
  level:
    com:
      example:
        mapper : debug
複製代碼

application.yml文件:apache

spring:
  profiles:
    active: dev

啓動類以下,加上MapperScan註解,掃描dao層的接口:json

複製代碼
@MapperScan("cn.com.boco.demo.mapper")
@SpringBootApplication
public class DemoApplication {

    public static void main(String[] args) {
        SpringApplication.run(DemoApplication.class, args);
    }

}
複製代碼

dao層接口:springboot

@Repository
public interface BookMapper {

    void insertBatch(List<DangBook> list);

}

xml文件:

複製代碼
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd">

<mapper namespace="cn.com.boco.demo.mapper.BookMapper">

    <insert id="insertBatch" parameterType="java.util.List">
        INSERT ALL
        <foreach collection="list" item="item" index="index" separator=" ">
            into dangdang_message (title,img,author,publish,detail,price,parentUrl,inputTime)  values
            (#{item.title,jdbcType=VARCHAR},
            #{item.img,jdbcType=VARCHAR},
            #{item.author,jdbcType=VARCHAR},
            #{item.publish,jdbcType=VARCHAR},
            #{item.detail,jdbcType=VARCHAR},
            #{item.price,jdbcType=DOUBLE},
            #{item.parentUrl,jdbcType=VARCHAR},
            #{item.inputTime,jdbcType=DATE})
            
        </foreach>
        select 1 from dual
    </insert>

</mapper>
複製代碼

 

兩個實體類:

複製代碼
public class BaseModel {

    private int id;
    private Date inputTime;

    public Date getInputTime() {
        return inputTime;
    }

    public void setInputTime(Date inputTime) {
        this.inputTime = inputTime;
    }

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }
}
複製代碼
複製代碼
@Alias("dangBook")
public class DangBook extends BaseModel {

    //標題
    private String title;
    //圖片地址
    private String img;
    //做者
    private String author;
    //出版社
    private String publish;
    //詳細說明
    private String detail;
    //價格
    private float price;
    //父連接,即請求連接
    private String parentUrl;

    public String getParentUrl() {
        return parentUrl;
    }

    public void setParentUrl(String parentUrl) {
        this.parentUrl = parentUrl;
    }

    public String getAuthor() {
        return author;
    }

    public void setAuthor(String author) {
        this.author = author;
    }

    public String getPublish() {
        return publish;
    }

    public void setPublish(String publish) {
        this.publish = publish;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getImg() {
        return img;
    }

    public void setImg(String img) {
        this.img = img;
    }

    public String getDetail() {
        return detail;
    }

    public void setDetail(String detail) {
        this.detail = detail;
    }

    public float getPrice() {
        return price;
    }

    public void setPrice(float price) {
        this.price = price;
    }

}
複製代碼

service層:

複製代碼
@Service
public class BookService {

    @Autowired
    private BookMapper bookMapper;

    public void insertBatch(List<DangBook> list){
        bookMapper.insertBatch(list);
    }

}
複製代碼

controll層代碼:

複製代碼
@RestController
@RequestMapping("/book")
public class DangdangBookController {

    @Autowired
    private BookService bookService;

    private static Logger logger = LoggerFactory.getLogger(DemoApplication.class);
    //url解碼以後
    private static final String URL = "http://search.dangdang.com/?key=性教育繪本&act=input&att=1000006:226&page_index=";
    //url解碼以前
    private static final String URL2 = "http://search.dangdang.com/?key=%D0%D4%BD%CC%D3%FD%BB%E6%B1%BE&act=input&att=1000006%3A226&page_index=";
    @RequestMapping("/parse")
    public JSONObject parse(){
        JSONObject jsonObject = new JSONObject();
        for(int i =1;i<=10;i++){
            List<DangBook> dangBooks = ParseUtils.dingParse(URL+i);
            if(dangBooks != null && dangBooks.size() >0){

                logger.info("解析完數據,準備入庫");
                bookService.insertBatch(dangBooks);
                logger.info("入庫完成,入庫數據條數"+ dangBooks.size());
                jsonObject.put("code",1);
                jsonObject.put("result","success");
            }else{
                jsonObject.put("code",0);
                jsonObject.put("result","fail");
            }

        }
        return jsonObject;
    }

}
複製代碼

原本是前端傳入地址解析的,可是發現參數丟失了,用url編碼也不行,最後放到後臺了


ParseUtils和HttpGetUtils工具類:
複製代碼
public class HttpGetUtils {

    private static Logger logger = LoggerFactory.getLogger(HttpGetUtils.class);

    public static String getUrlContent(String url) {
        if (url == null) {
            logger.info("url地址爲空");
            return null;
        }
        logger.info("url爲:" + url);
        logger.info("開始解析");
        String contentLine = null;
        //最新版httpclient.jar已經捨棄new DefaultHttpClient()
        //可是仍是能夠用的
        HttpClient httpClient = new DefaultHttpClient();
        HttpResponse httpResponse = getResp(httpClient, url);
        if (httpResponse.getStatusLine().getStatusCode() == 200) {
            try {
                contentLine = EntityUtils.toString(httpResponse.getEntity(), "utf-8");
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        logger.info("解析結束");
        return contentLine;
    }


    /**
     * 根據url 獲取response對象
     */
    public static HttpResponse getResp(HttpClient httpClient, String url) {
        logger.info("開始獲取response對象");
        HttpGet httpGet = new HttpGet(url);
        HttpResponse httpResponse = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK");
        try {
            httpResponse = httpClient.execute(httpGet);
        } catch (IOException e) {
            e.printStackTrace();
        }
        logger.info("獲取對象結束");
        return httpResponse;
    }

}
複製代碼
複製代碼
public class ParseUtils {

    private static Logger logger = LoggerFactory.getLogger(ParseUtils.class);

    public static List<DangBook> dingParse(String url) {
        List<DangBook> list = new ArrayList<>();
        Date date = new Date();
        if (url == null) {
            logger.info("url爲空,數據獲取結束");
            return null;
        }

        logger.info("開始獲取數據");
        String content = HttpGetUtils.getUrlContent(url);
        if (content != null)
            logger.info("獲得解析數據");
        else {
            logger.info("解析數據爲空,數據獲取結束");
            return null;
        }

        Document document = Jsoup.parse(content);
        //遍歷噹噹圖書列表
        for(int i =1;i<=60;i++){
            Elements elements = document.select("ul[class=bigimg]").select("li[class=line"+i+"]");
            for (Element e : elements) {
                String title = e.select("p[class=name]").select("a").text();
                logger.info("書名:" + title);
                String img = e.select("a[class=pic]").select("img").attr("data-original");
                logger.info("圖片地址:" + img);
                String authorAndPublish = e.select("p[class=search_book_author]").select("span").select("a").text();
                String []a = authorAndPublish.split(" ");
                String author = a[0];
                logger.info("做者:" + author);
                String publish = a[a.length - 1];
                logger.info("出版社:" + publish);
//            String publish =e.select("p[class=name]").select("a").text();
                String detail = e.select("p[class=detail]").text();
                logger.info("圖書介紹:" + detail);
                String priceS = e.select("p[class=price]").select("span[class=search_now_price]").text();
                float price = 0.0f;
                if(priceS.length()>1 && priceS != null){
                    price = Float.parseFloat(priceS.substring(1, priceS.length() - 1));
                }
                logger.info("價格:" + price);
                logger.info("-------------------------------------------------------------------------");
                DangBook dangBook = new DangBook();
                dangBook.setTitle(title);
                dangBook.setImg(img);
                dangBook.setAuthor(author);
                dangBook.setPublish(publish);
                dangBook.setDetail(detail);
                dangBook.setPrice(price);
                dangBook.setParentUrl(url);
                dangBook.setInputTime(date);
                list.add(dangBook);
            }
        }
        return list;
    }

}

複製代碼

最後表裏數據以下:

 

注意:建表的時候注意字段類型,orcale的var(255)不夠個人這個數據標題用,開始報錯,後來改了字段類型,還有注意ID的自增和入庫時間的自動添加,我的數據庫較差,百度一番才弄好

相關文章
相關標籤/搜索