謙先生-hadoop大數據運維紀實

時間 2020-05-14

原文原文鏈接

一、NN宕掉切不過去先看zkfc的log
引發緣由是dfs.ha.fencing.ssh.private-key-files的配置路徑配錯形成以至沒法找到公鑰

二、dfs.namenode.shared.edits.dir爲JN啓動的所在地址，在部署時必須啓動對應服務器的JN，不然沒法完成NN的元信息拷貝

三、zkfc爲zookeeper的客戶端，負責切換action的工做，當zkfc啓動了的時候standby的服務器纔會切爲active

四、dfs.ha.fencing.methods爲中斷宕機namenode的zookeeper鏈接

五、hadoop本地庫問題:從新編譯本地庫並替換（上海尚學堂Hadoop的本地庫簡介）：html

http://www.shsxt.com/it/Big-data/656.html

六、ERROR snappy.SnappyCompressor: failed to load SnappyCompressor
java.lang.UnsatisfiedLinkError: Cannot load libsnappy.so.1 (libsnappy.so.1: cannot open shared object file: No such file or directory)!

沒有這個環境變量

解決方法：
http://www.cnblogs.com/smartvessel/archive/2011/01/21/1940868.html

七、ssh 鏈接的時候須要確認（yes/no）才能使各個節點正常通信
解決辦法：把節點都連一遍

八、啓動hdfs的時候須要在/data/裏建立hdfs節點記錄版本號，若是權限不夠建立不到則會啓動失敗
解決辦法：修改權限或者手動建立hdfs節點

九、ssh的確認鏈接堵塞問題

StrictHostKeyChecking no
http://www.cnblogs.com/yuxc/archive/2012/11/15/2772484.html

十、No Route to Host from xxxx-xxxxxop-namenode01.node.kddi.op.xxxx.com/xxx.xxx.11.1 to xxxx-hadoop-datanode05.node.xxxx.op.xxxx.com:8485 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost

十一、找不到mysql.sock，mysql.sock丟失問題解決方法
找不到mysql.sock，mysql.sock丟失問題解決方法

十二、[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

解決辦法：把hadoop中lib的jline.jar換成hive的lib下得jline.jar

1三、Hive表遷移後沒法select * from xxxx（沒法查詢），報
FAILED: SemanticException Unable to determine if hdfs://hadoop2service/hive/warehouse/pv_tmpis encrypted: java.lang.IllegalArgumentException: Wrong FS: hdfs://hadoop2service/hive/warehouse/pv_tmp, expected: hdfs://hadoop2kddi

解決辦法：從備份sql中把全部舊集羣名字替換爲新集羣名字，再從新還原備份到存新hive元信息的數據庫中

1四、

緣由：語法錯誤，屬於python的報錯

不該該這樣寫，改爲

1五、SELECT clientid,url,COUNT(1)pv FROM pv_tmp GROUP BY clientid,url HAVING pv>50

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.ppd.ExprWalkerInfo.getConvertedNode(Lorg/apache/hadoop/hive/ql/lib/Node;)Lorg/apache/hadoop/hive/ql/plan/ExprNodeDesc;

解決措施：
調試語句：hive -hiveconf hive.root.logger=DEBUG,console

問題緣由：hadoop 2.6.0 與 hive 1.1.0以上存在不兼容問題

解決辦法：hive 版本換回 1.0.1，正常使用
注意：在遷移的時候必須確保安裝正常並可以正常使用，而後再作數據遷移，不然一次作完這兩步出錯時不能肯定是兼容性問題仍是操做問題。

1六、beeline 的使用（經過hiveserver去鏈接hive的一個客戶端）
優勢：查數據的時候有完整的表格式
beeline

!connect jdbc:hive2://tech-hadoop-namenode01.node.xxxx.op.xxxx.com:10000 hadoop RgWrXlKN9j3VkYQO org.apache.hive.jdbc.HiveDriver

1七、Mysql 5.1 改用 Mysql5.5 的語法問題java

不帶list的執行sql用query 帶values的用execute

1八、hive數據傾斜問題
解決方案之一：distribute by 指定map輸出的key爲一個散列列

2一、

解決辦法：
①看日誌—— Jobs histroy —— Map kill logs —— full logs ，發現以下報錯：node

Log Type: syslog
Log Upload Time: 25-Aug-2015 08:50:15
Log Length: 5503
2015-08-25 08:49:32,935 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2015-08-25 08:49:33,039 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2015-08-25 08:49:33,039 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2015-08-25 08:49:33,055 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2015-08-25 08:49:33,055 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1440463117446_0001, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@21683789)
2015-08-25 08:49:33,214 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2015-08-25 08:49:34,279 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:6819. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-25 08:49:35,280 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:6819. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-25 08:49:36,281 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:6819. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-25 08:49:37,282 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:6819. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-25 08:49:38,283 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:6819. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-25 08:49:39,283 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:6819. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-25 08:49:40,284 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:6819. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-25 08:49:41,285 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:6819. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-25 08:49:42,285 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:6819. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-25 08:49:43,286 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:6819. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-25 08:49:43,289 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From xxxx-xxxoop-datanode06.node.xxxx.op.xxxx.com/xxx.xxx.12.6 to localhost:6819 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
at com.sun.proxy.$Proxy9.getTask(Unknown Source)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:132)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)
... 4 more

2015-08-25 08:49:43,290 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2015-08-25 08:49:43,291 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2015-08-25 08:49:43,291 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

②這個報錯一直在localhost回溯，必定是在locathost丟失鏈接，很大可能被64位IP綁定影響，所以到hosts把64的localhost映射註釋掉，問題完全解決！

注：切記任什麼時候候必定必定要看日誌！！！！！！！
python