爬蟲初探（一）crawler4j的robots

時間 2019-11-17

標籤爬蟲初探 crawler4j crawler robots 欄目網絡爬蟲简体版

原文原文鏈接

最近剛剛開始研究爬蟲，身爲小白的我不知道應該從何處下手，網上查了查，發現主要的開源java爬蟲有nutch apache/nutch · GitHub，Heritrix internetarchive/heritrix3 · GitHub和Crawler4j yasserg/crawler4j · GitHub，還有WebCollectorCrawlScript/WebCollector · GitHub和 WebMagic code4craft/webmagic · GitHubhtml

因爲剛剛開始接觸爬蟲，所以決定先接觸個小型的項目Crawler4j 。先從git上clone下來，結果發現不會導入eclipse（實在是小白啊，見諒）。一點點了解發現這是個maven項目，直接導入maven項目便可。最後終於運行了給的例子。
java

在初步瞭解的過程當中，發現了一個robots協議，百度了一下，竟然是個爬蟲協議，有點吃驚。robots.txt是一個文本文件，放在網站的根目錄，所以我就去嘗試了讀取大衆點評的robots.txt ，發現還真有這麼個文件。不過這是個道德規範的文件，由於它沒法阻止「強盜」進入。具體的文件寫法能夠百度。git

能夠看到crawler4j也是支持robots.txt協議的，總共有如下這幾個類：
github

1.RobotstxtConfigweb

這個類十分簡單，裏面就3個變量，分別是是否開啓robots協議，user-agent 那麼以及緩存（這個緩存是指最多能緩存的robots.txt的數量，若是超過這個數量，會將最久不用的一個替換）。apache

2.HostDirectives緩存

這個類就是存放robots.txt的類，裏面主要存放了disallows和allows （這2個是做者寫的RuleSet，稍後說），還有個終止期限，超過這個期限要從新獲取對應的robots.txt。多線程

3. RuleSeteclipse

這個類是存放具體的robots規則的，繼承了TreeSet，由於TreeSet是按天然排序（這裏字符串比較升序排）的，而又要將前綴路徑覆蓋全部後續的路徑（做者思慮真周密啊），好比a/b覆蓋a/b/c。但其實這樣的話a/b/c1會覆蓋a/b/c12，所以其實也有點小缺陷。附上源碼：maven

public boolean add(String str) {
    SortedSet<String> sub = headSet(str);
    if (!sub.isEmpty() && str.startsWith(sub.last())) {
      // no need to add; prefix is already present
      return false;
    }
    boolean retVal = super.add(str);
    sub = tailSet(str + "\0");
    while (!sub.isEmpty() && sub.first().startsWith(str)) {
      // remove redundant entries
      sub.remove(sub.first());
    }
    return retVal;
  }

4. RobotstxtParser

顧名思義，就是將Robots.txt解析成HostDirectives，這裏只有一個靜態方法parse。這裏對每一行進行解析，首先對於協議指定的user-agent，若是包括咱們本身的user-agent，則下面的disallow或者allow才加入規則中。具體是如何解析的有興趣本身看源碼吧。

不過這一塊代碼不是很清楚。

int commentIndex = line.indexOf('#');
      if (commentIndex > -1) {
        line = line.substring(0, commentIndex);
      }

      // remove any html markup
      line = line.replaceAll("<[^>]+>", "");

但願有小夥伴指教下。

5.RobotstxtServer

這是Robots的主類，有個對外的方法。

public boolean allows(WebURL webURL) {
    if (config.isEnabled()) {
      try {
        URL url = new URL(webURL.getURL());
        String host = getHost(url);
        String path = url.getPath();

        HostDirectives directives = host2directivesCache.get(host);

        if ((directives != null) && directives.needsRefetch()) {
          synchronized (host2directivesCache) {
            host2directivesCache.remove(host);
            //這裏用雙重鎖更合適，否則可能會remove異常
            directives = null;
          }
        }

        if (directives == null) {
          directives = fetchDirectives(url);
        }

        return directives.allows(path);
      } catch (MalformedURLException e) {
        logger.error("Bad URL in Robots.txt: " + webURL.getURL(), e);
      }
    }

    return true;
  }

代碼很清晰，就是首先獲取HostDirectives，（若是必要的話解析robots.txt），而後判斷是否容許。

 public boolean allows(String path) {
    timeLastAccessed = System.currentTimeMillis();
    return !disallows.containsPrefixOf(path) || allows.containsPrefixOf(path);
  }

只要allows包含或者disallows不包含便可。

最後這個類有一個map存放各個host的robots.txt解析過來的HostDirectives，因爲涉及多線程，所以當把HostDirectives加入這個map的時候須要加鎖，否則remove可能會出異常。

synchronized (host2directivesCache) {
      if (host2directivesCache.size() == config.getCacheSize()) {
        String minHost = null;
        long minAccessTime = Long.MAX_VALUE;
        for (Map.Entry<String, HostDirectives> entry : host2directivesCache.entrySet()) {
          if (entry.getValue().getLastAccessTime() < minAccessTime) {
            minAccessTime = entry.getValue().getLastAccessTime();
            minHost = entry.getKey();
          }
        }
        host2directivesCache.remove(minHost);
      }
      host2directivesCache.put(host, directives);
    }

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。