Day 18: BoilerPipe —— Java開發者的文章提取工具

時間 2019-12-11

標籤 day boilerpipe java 開發者文章提取工具欄目 Java 简体版

原文原文鏈接

編者注：咱們發現了有趣的系列文章《30天學習30種新技術》，正在翻譯，一天一篇更新，年終禮包。下面是第 18 天的內容。html

今天我決定學習如何使用Java作網頁連接的文本和圖像提取。在大多數內容發現網站上（如Prismatic）這是一個很是常見的需求，今天就是學習如何使用一個名爲boilerpipe的Java庫來完成這個任務。java

準備

基本的Java知識是必需的，安裝最新的Java開發工具包（JDK ），能夠是OpenJDK 7或Oracle JDK 7。git
註冊一個OpenShift賬戶，它是徹底免費的，能夠分配給每一個用戶1.5 GB的內存和3 GB的磁盤空間。web
安裝RHC客戶端工具，須要有ruby 1.8.7或更新的版本，若是已經有ruby gem，輸入 sudo gem install rhc ，確保它是最新版本。要更新RHC的話，執行命令 sudo gem update rhc，如需其餘協助安裝RHC命令行工具，請參閱該頁面： https://www.openshift.com/developers/rhc-client-tools-installsegmentfault
經過 rhc setup 命令設置您的OpenShift賬戶，此命令將幫助你建立一個命名空間，並上傳你的SSH keys到OpenShift服務器。api

第1步：建立一個JBoss EAP的應用

首先從建立示例應用程序開始，把該應用稱做 newsapp。ruby

$ rhc create-app newsapp jbosseap

而後可使用以下命令：服務器

$ rhc create-app newsapp jbosseap -g medium

這樣會建立一個應用程序容器，設置好全部須要的SELinux政策和cgroup配置，OpenShift也將建立一個私人git倉庫並克隆到本地。最後，OpenShift會給外界提供一個DNS，該應用程序將在http://newsapp-{domain-name}.rhcloud.com/ 下能夠訪問（將 domain-name 更換爲本身的域名）。oracle

第2步：添加Maven依賴

在 pom.xml 文件裏添加以下依賴：app

<dependency>
    <groupId>de.l3s.boilerpipe</groupId>
    <artifactId>boilerpipe</artifactId>
    <version>1.2.0</version>
</dependency>
<dependency>
    <groupId>xerces</groupId>
    <artifactId>xercesImpl</artifactId>
    <version>2.9.1</version>
</dependency>

<dependency>
    <groupId>net.sourceforge.nekohtml</groupId>
    <artifactId>nekohtml</artifactId>
    <version>1.9.13</version>
</dependency>

同時也須要加一個新的庫：

<repository>
    <id>boilerpipe-m2-repo</id>
    <url>http://boilerpipe.googlecode.com/svn/repo/</url>
    <releases>
        <enabled>true</enabled>
    </releases>
    <snapshots>
        <enabled>false</enabled>
    </snapshots>
</repository>

經過更新 pom.xml 文件裏的幾個特性將Maven項目更新到Java 7：

<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>

如今就能夠更新Maven項目了（右鍵單擊>Maven>更新項目）。

第3步：啓用CDI

使用CDI來進行依賴注入。CDI、上下文和依賴注入是一個Java EE 6規範，可以使依賴注入在Java EE 6的項目中。

在 src/main/webapp/WEB-INF 文件夾下建一個名爲beans.xml中一個新的XML文件。更換beans.xml中的如下內容：

<beans xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/beans_1_0.xsd">

</beans>

第4步：建立Boilerpipe內容提取服務

如今建立一個Boilerpipe內容提取服務的服務類，這個類會用一個url，從這個url中提取標題和文章內容。

import java.net.URL;
import java.util.Collections;
import java.util.List;

import com.newsapp.boilerpipe.image.Image;
import com.newsapp.boilerpipe.image.ImageExtractor;

import de.l3s.boilerpipe.BoilerpipeExtractor;
import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.extractors.CommonExtractors;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
import de.l3s.boilerpipe.sax.HTMLDocument;
import de.l3s.boilerpipe.sax.HTMLFetcher;

public class BoilerpipeContentExtractionService {

    public Content content(String url) {
        try {
            final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(url));
            final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
            String title = doc.getTitle();

            String content = ArticleExtractor.INSTANCE.getText(doc);

            final BoilerpipeExtractor extractor = CommonExtractors.KEEP_EVERYTHING_EXTRACTOR;
            final ImageExtractor ie = ImageExtractor.INSTANCE;

            List<Image> images = ie.process(new URL(url), extractor);

            Collections.sort(images);
            String image = null;
            if (!images.isEmpty()) {
                image = images.get(0).getSrc();
            }

            return new Content(title, content.substring(0, 200), image);
        } catch (Exception e) {
            return null;
        }

    }
}

上述代碼執行如下操做：

首先在給定的url中讀取文件
而後解析HTML文檔並返回TextDocument
接下來從文本文件中提取標題
最後從文本中提取內容，返回一個應用的值對象的新實例(value object)

第5步：啓用JAX-RS

爲啓用JAX-RS，創建一個擴展 javax.ws.rs.core.Application 的類，並經過以下所示的 javax.ws.rs.ApplicationPath 註釋指定應用程序路徑。

import javax.ws.rs.ApplicationPath;
import javax.ws.rs.core.Application;

@ApplicationPath("/api/v1")
public class JaxrsInitializer extends Application{


}

第6步：建立ContentExtractionResource

建立ContentExtractionResource類，它會返回一個JSON內容對象。建立一個名爲ContentExtractionResource的新類，並用以下所示的內容替換：

import javax.inject.Inject;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.QueryParam;
import javax.ws.rs.core.MediaType;

import com.newsapp.service.BoilerpipeContentExtractionService;
import com.newsapp.service.Content;

@Path("/content")
public class ContentExtractionResource {

    @Inject
    private BoilerpipeContentExtractionService boilerpipeContentExtractionService;

    @GET
    @Produces(value = MediaType.APPLICATION_JSON)
    public Content extractContent(@QueryParam("url") String url) {
        return boilerpipeContentExtractionService.content(url);
    }
}

部署到OpenShift

最後，更改部署到OpenShift

$ git add .
$ git commit -am "NewApp"
$ git push

在代碼push和部署完成後，咱們能夠在 http://newsapp-{{domain-name}.rhcloud.com 查看正在運行的應用程序。個人示例應用程序展現以下。

今天就這些，歡迎反饋。

原文 Day 18: BoilerPipe--Article Extraction for Java Developers
翻譯整理 SegmentFault

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。