頁面下載器(個人Java爬蟲之一)

說點別的

maven打包

官方定製的打包方式

使用maven assembly plugin插件完成打包操做,插件配置在pom.xml文件的build標籤中,格式以下。java

<build>
    [...]
    <plugins>
      <plugin>
        <!-- NOTE: We don't need a groupId specification because the group is
             org.apache.maven.plugins ...which is assumed by default.
         -->
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.1.0</version>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>

executions用於將目標和maven的某個生命週期進行綁定apache

<executions>
  <execution>
    <id>make-assembly</id> <!-- this is used for inheritance merges -->
    <phase>package</phase> <!-- bind to the packaging phase -->
    <goals>
      <goal>single</goal>
    </goals>
  </execution>
</executions>

建立可執行的jar包

<build>
   [...]
   <plugins>
     <plugin>
       <artifactId>maven-assembly-plugin</artifactId>
       <version>3.1.0</version>
       <configuration>
         [...]
         <archive>
           <manifest>
             <mainClass>org.sample.App</mainClass>
           </manifest>
         </archive>
       </configuration>
       [...]
     </plugin>
     [...]

自定義打包方式

上文已提到使用官方定製的打包方式,使用<descriptorRefs></descriptorRefs>標籤便可;若是使用自定義的打包方式,使用<descriptors></descriptors>標籤。api

<project>
  [...]
  <build>
    [...]
    <plugins>
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.1.0</version>
        <configuration>
          <descriptors>
            <descriptor>src/assembly/src.xml</descriptor>
          </descriptors>
        </configuration>
        [...]
</project>

src.xml的格式大體以下maven

<assembly
        xmlns="http://maven.apache.org/ASSEMBLY/2.0.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/ASSEMBLY/2.0.0 http://maven.apache.org/xsd/assembly-2.0.0.xsd">
    <id>snapshot</id>
    <formats>
        <format>jar</format>
    </formats>
    <dependencySets>
        <dependencySet>
            <outputDirectory>/lib</outputDirectory>
        </dependencySet>
    </dependencySets>
</assembly>

使用<fileSets>容許用戶經過文件或目錄的粒度來控制打包,每每配置一個bin目錄,裏面存放可運行的腳本。這種方法打成的包如何運行?
兩種方法:ide

  1. 將依賴經過cp所有指定,而後運行,java -cp lib/dependency1:lib/dependency2 類全名
  2. java -Djava.ext.dirs=lib 類命名,此方法貌似java 9再也不支持

頁面下載器

前期準備

maven導入依賴ui

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.3</version>
</dependency>
<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>fluent-hc</artifactId>
    <version>4.5.3</version>
</dependency>

下載器初版

import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.RequestBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.nio.charset.Charset;

public void testGet1() {
  CloseableHttpClient clients = HttpClients.createDefault();
  RequestBuilder builder = requestBuilder.get("http://www.qq.com");
  HttpGet httpGet = new HttpGet(builder.build().getURI());
  CloseableHttpResponse execute = null;
  try {
    execute = clients.execute(httpGet);
    HttpEntity entity = execute.getEntity();
    //此處能夠本身寫個charset的解析方法
    String page = EntityUtils.toString(entity);
    System.out.println(page);
  } catch (Exception e) {
    e.printStackTrace();
  } finally {
    if (execute != null) {
      try {
        execute.close();
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}

第二版

匿名內部類版本this

public void testGet2() {
  CloseableHttpClient clients = HttpClients.createDefault();
  RequestBuilder builder = RequestBuilder.get("http://www.qq.com");
  HttpGet httpGet = new HttpGet(builder.build().getURI());
  try {
    String page = clients.execute(httpGet, new ResponseHandler<String>() {
      @Override
      public String handleResponse(HttpResponse HttpResponse) throws ClientProtocolException, IOException
      HttpEntity entity = httpResponse.getEntity();
      String s = EntityUtils.toString(entity);
      return s;
    });
    System.out.println(page);
  } catch (Exception e) {
    e.printStackTrace();
  }
}

匿名內部類能夠使用lambda表達式來替代,寫法爲插件

String page = clients.execute(httpGet, (HttpResponse HttpResponse) -> {
    HttpEntity entity = HttpResponse.getEntity();
    String s = EntityUtils.toString(entity);
    return s;
  });

第三版

使用org.apache.http.client.fluent包中的apicode

public void testGet3() {
  Response response = Request.Get("http://www.qq.com").execute();
  String s = response.returnContent().asString(Charset.forName("gb2312"));
  System.out.println(s);
}
相關文章
相關標籤/搜索