當咱們配置Nutch抓取 http://yangshangchuan.iteye.com 的時候,抓取的全部頁面內容均爲:您的訪問請求被拒絕 ...... 這是最簡單的反爬蟲策略(該策略簡單地讀取HTTP請求頭User-Agent的值來判斷是人(瀏覽器)仍是機器爬蟲),咱們只須要簡單地配置Nutch來模擬瀏覽器(simulate web browser)就能夠繞過這種限制。java
在nutch-default.xml中有5項配置是和User-Agent相關的:web
<property> <name>http.agent.description</name> <value></value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value></value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value></value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> <property> <name>http.agent.name</name> <value></value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.agent.version</name> <value>Nutch-1.7</value> <description>A version string to advertise in the User-Agent header.</description> </property>
在類nutch1.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中能夠看到這5項配置是如何構成User-Agent的:apache
this.userAgent = getAgentString( conf.get("http.agent.name"), conf.get("http.agent.version"), conf.get("http.agent.description"), conf.get("http.agent.url"), conf.get("http.agent.email") );
private static String getAgentString(String agentName, String agentVersion, String agentDesc, String agentURL, String agentEmail) { if ( (agentName == null) || (agentName.trim().length() == 0) ) { // TODO : NUTCH-258 if (LOGGER.isErrorEnabled()) { LOGGER.error("No User-Agent string set (http.agent.name)!"); } } StringBuffer buf= new StringBuffer(); buf.append(agentName); if (agentVersion != null) { buf.append("/"); buf.append(agentVersion); } if ( ((agentDesc != null) && (agentDesc.length() != 0)) || ((agentEmail != null) && (agentEmail.length() != 0)) || ((agentURL != null) && (agentURL.length() != 0)) ) { buf.append(" ("); if ((agentDesc != null) && (agentDesc.length() != 0)) { buf.append(agentDesc); if ( (agentURL != null) || (agentEmail != null) ) buf.append("; "); } if ((agentURL != null) && (agentURL.length() != 0)) { buf.append(agentURL); if (agentEmail != null) buf.append("; "); } if ((agentEmail != null) && (agentEmail.length() != 0)) buf.append(agentEmail); buf.append(")"); } return buf.toString(); }
在類nutch1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent請求頭,這裏的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent:api
String userAgent = http.getUserAgent(); if ((userAgent == null) || (userAgent.length() == 0)) { if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); } } else { reqStr.append("User-Agent: "); reqStr.append(userAgent); reqStr.append("\r\n"); }
經過上面的分析可知:在nutch-site.xml中只須要增長以下幾種配置之一即可以模擬一個特定的瀏覽器(Imitating a specific browser):瀏覽器
一、模擬Firefox瀏覽器:app
<property> <name>http.agent.name</name> <value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value> </property> <property> <name>http.agent.version</name> <value>20100101 Firefox/27.0</value> </property>
二、模擬IE瀏覽器:ide
<property> <name>http.agent.name</name> <value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value> </property> <property> <name>http.agent.version</name> <value>6.0)</value> </property>
三、模擬Chrome瀏覽器:this
<property> <name>http.agent.name</name> <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value> </property> <property> <name>http.agent.version</name> <value>537.36</value> </property>
四、模擬Safari瀏覽器:url
<property> <name>http.agent.name</name> <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value> </property> <property> <name>http.agent.version</name> <value>534.57.2</value> </property>
五、模擬Opera瀏覽器:spa
<property> <name>http.agent.name</name> <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value> </property> <property> <name>http.agent.version</name> <value>19.0.1326.59</value> </property>
後記:查看User-Agent的方法:
一、http://www.useragentstring.com
三、http://www.enhanceie.com/ua.aspx