問題1. WARN crawl.Generator: Generator: 0 records selected for fetchingjava
出現可能緣由:node
1).regex-urlfilter.txt 裏面的正則表達式有問題;正則表達式
問題2. Bad Request
request: http://XXXXX:8080/solr/CultureSearch/update?wt=javabin&version=2apache
這個solr cloud的配置文件有問題形成的主要與相關的schema.xml有關c#
個人第一個緣由是缺乏相關的jar包,可是在schema.xml中配置了;第二個是_version_屬性的類型對應不上。tomcat
問題3.
oop
15/04/07 23:31:03 INFO mapreduce.Job: Task Id : attempt_1427710479955_0129_r_000000_0, Status : FAILED
Error: java.io.IOException
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:173)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:137)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:511)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:334)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:462)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated.
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:135)
... 14 more
Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated.
at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487)
at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114)
at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)
... 17 morepost
15/04/07 23:45:44 INFO mapreduce.Job: Task Id : attempt_1427710479955_0129_r_000000_1, Status : FAILED
Error: java.lang.RuntimeException: problem advancing post rec#954132
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1364)
at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:213)
at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:209)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:176)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:462)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.io.IOException: Cannot initialize the class: class org.apache.hadoop.io.NullWritable
at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:49)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1421)
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1361)
... 11 morefetch
15/04/08 00:08:39 ERROR indexer.IndexingJob: Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)url
這個問題很奇葩,從日誌來看彷佛是與solr有關係。一時沒有找到好方法,還須要將solr的配置和相關的jar包都加進去,而後也注意下tomcat的執行權限問題。
問題4:出現這個問題job就會中止了,爬蟲就不會再爬了。
Error: java.io.IOException
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:173)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:137)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:511)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:334)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:462)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated.
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:135)
... 14 more
Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated.
at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487)
at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2114)
at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1096)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)
... 17 more
Error: java.lang.RuntimeException: problem advancing post rec#954132
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1364)
at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:213)
at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:209)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:176)
at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:462)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.io.IOException: Cannot initialize the class: class org.apache.hadoop.io.NullWritable
at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:49)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1421)
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1361)
... 11 more
Job failed as tasks failed. failedMaps:0 failedReduces:1
File System Counters
FILE: Number of bytes read=4550678317
FILE: Number of bytes written=9135325272
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1948818838
HDFS: Number of bytes written=0
HDFS: Number of read operations=84
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed reduce tasks=4
Killed map tasks=10
Launched map tasks=31
Launched reduce tasks=4
Data-local map tasks=18
Rack-local map tasks=13
Total time spent by all maps in occupied slots (ms)=2958546
Total time spent by all reduces in occupied slots (ms)=3033513
Map-Reduce Framework
Map input records=12879328
Map output records=12879328
Map output bytes=4555697629
Map output materialized bytes=4582546030
Input split bytes=2693
Combine input records=0
Spilled Records=25446303
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=4776
CPU time spent (ms)=314730
Physical memory (bytes) snapshot=7077011456
Virtual memory (bytes) snapshot=25862344704
Total committed heap usage (bytes)=7811366912
File Input Format Counters
Bytes Read=1948816145
15/04/08 00:08:39 ERROR indexer.IndexingJob: Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
從網上搜了兩個解決方案(修改nutch-default.xml ,nutch、solr配置文件保持一致)都沒有解決這個問題,
問了某個羣裏頭的一個大牛,說看hadoop的日誌,結果hadoop日誌中確實有問題。按照這個問題又從網上找個
結果。
2015-04-10 04:04:33,899 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop4:50010:DataXceiver error processing WRITE_BLOCK operation src: /XXXXXXX dest: /XXXXX:50010
根據這個問題修改了下hadoop hdfs-site.xml的配置
<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>8192</value>
</property>
其餘解決辦法:
刪除hadoop裏頭的 /linkdb路徑;
修改nutch-default.xml的plugin.folders屬性值