Alluxio HA環境,今天發生,用戶沒法寫入文件的狀況. 建立文件夾,是正常的.可是最後copyFromLocal 文件的時候,就沒有任何反應.最後能夠看到這個新建的文件.可是文件size是0.java
alluxio fs copyFromLocal test.txt /user/mytest/prefix2
最後決定重啓一下master看看結果.而後重啓,而後...就沒有而後了.....Master起不來了!!!!node
1. 查看master.log發現問題,剛開始,是正常的應用log file,在 inodeId 52,沒法正常打開.致使master啓動失敗.ios
2018-07-25 16:39:12,897 INFO logger.type (JournalTailer.java:processNextJournalLogFiles) - FileSystemMaster: Processing a completed log file. ....... 2018-07-25 16:39:21,461 INFO logger.type (JournalReader.java:getNextInputStream) - Opening journal log file: hdfs://azbeta/user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000007 2018-07-25 16:39:21,512 INFO logger.type (JournalTailer.java:processNextJournalLogFiles) - FileSystemMaster: Processing a completed log file. 2018-07-25 16:39:21,520 ERROR logger.type (ServerUtils.java:run) - Uncaught exception while running Alluxio master, stopping it and exiting. java.lang.RuntimeException: alluxio.exception.FileDoesNotExistException: inodeId 52 does not exist at alluxio.master.file.FileSystemMaster.processJournalEntry(FileSystemMaster.java:347) at alluxio.master.journal.JournalTailer.processNextJournalLogFiles(JournalTailer.java:123) at alluxio.master.AbstractMaster.start(AbstractMaster.java:148) at alluxio.master.file.FileSystemMaster.start(FileSystemMaster.java:419) at alluxio.master.DefaultAlluxioMaster.startMasters(DefaultAlluxioMaster.java:263) at alluxio.master.FaultTolerantAlluxioMaster.start(FaultTolerantAlluxioMaster.java:91) at alluxio.ServerUtils.run(ServerUtils.java:38) at alluxio.master.AlluxioMaster.main(AlluxioMaster.java:43) >>>>>>>>>>>>>>>Caused by: alluxio.exception.FileDoesNotExistException: inodeId 52 does not exist<<<<<<<<<<<<<<<<< at alluxio.master.file.meta.InodeTree.lockFullInodePath(InodeTree.java:351) at alluxio.master.file.FileSystemMaster.setAttributeFromEntry(FileSystemMaster.java:3006) at alluxio.master.file.FileSystemMaster.processJournalEntry(FileSystemMaster.java:345) ... 7 more 2018-07-25 16:39:21,522 INFO logger.type (DefaultAlluxioMaster.java:stop) - Stopping Alluxio master @ aznballuxiosl01.envazure.com/10.24.101.103:19998 2018-07-25 16:39:21,522 ERROR logger.type (LeaderSelectorClient.java:takeLeadership) - aznballuxiosl01.envazure.com:19998 was interrupted. java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at alluxio.LeaderSelectorClient.takeLeadership(LeaderSelectorClient.java:177) at org.apache.curator.framework.recipes.leader.LeaderSelector$3.run(LeaderSelector.java:328) at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:319) at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:376) at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:48) at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:197) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
2. 經過hdfs fsck / , 也沒有檢查到文件損壞.web
$ hdfs fsck /user/alluxio/ Connecting to namenode via https://azcbetannl02.envazure.com:50470/fsck?ugi=hdfs&path=%2Fuser%2Falluxio FSCK started by hdfs (auth:KERBEROS_SSL) from /10.24.101.76 for path /user/alluxio at Wed Jul 25 16:38:07 CST 2018 .........Status: HEALTHY Total size: 61438434 B (Total open files size: 275 B) Total dirs: 7 Total files: 9 Total symlinks: 0 (Files currently being written: 1) Total blocks (validated): 8 (avg. block size 7679804 B) (Total open file blocks (not validated): 1) Minimally replicated blocks: 8 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 6 Number of racks: 1 FSCK ended at Wed Jul 25 16:38:07 CST 2018 in 4 milliseconds The filesystem under path '/user/alluxio' is HEALTHY
3. 通過google查詢,也沒有什麼好的解決方案.apache
4. 最後使用殺手鐗. format master. 效果立杆見影. master起來了, 數據也全沒啦!!! 注意:數據全沒了!!json
因爲咱們的alluxio只是一箇中間臨時緩存,format的影響不大.緩存
5. 咱們看看, alluxio正常啓動,日誌是什麼樣子的:安全
2018-07-25 16:40:52,545 INFO logger.type (AbstractMaster.java:start) - FileSystemMaster: Starting leader master. 2018-07-25 16:40:52,550 INFO logger.type (JournalWriter.java:completeAllLogs) - Marking all logs as complete. 2018-07-25 16:40:52,557 INFO logger.type (AbstractMaster.java:start) - FileSystemMaster: journal checkpoint does not exist, nothing to process. 2018-07-25 16:40:52,562 INFO logger.type (JournalWriter.java:getCheckpointOutputStream) - Creating tmp checkpoint file: hdfs://azbeta/user/alluxio/journal/FileSystemMaster/checkpoint.data.tmp 2018-07-25 16:40:52,564 INFO logger.type (JournalWriter.java:getCheckpointOutputStream) - Latest journal sequence number: 0 Next journal sequence number: 1 2018-07-25 16:40:52,699 INFO logger.type (JournalWriter.java:close) - Successfully created tmp checkpoint file: hdfs://azbeta/user/alluxio/journal/FileSystemMaster/checkpoint.data.tmp 2018-07-25 16:40:52,720 INFO logger.type (CheckpointManager.java:updateCheckpoint) - Renamed the checkpoint file from hdfs://azbeta/user/alluxio/journal/FileSystemMaster/checkpoint.data.tmp to hdfs://azbeta/user/alluxio/journal/FileSystemMaster/checkpoint.data 2018-07-25 16:40:52,720 INFO logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting all completed log files... 2018-07-25 16:40:52,722 INFO logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting completed log: hdfs://azbeta/user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000000 2018-07-25 16:40:52,723 INFO logger.type (JournalWriter.java:deleteCompletedLogs) - Finished deleting all completed log files. 2018-07-25 16:40:52,735 INFO logger.type (MetricsSystem.java:startSinksFromConfig) - Starting sinks with config: {}. 2018-07-25 16:40:52,750 INFO util.log (Log.java:initialized) - Logging initialized @5111ms 2018-07-25 16:40:52,917 INFO server.Server (Server.java:doStart) - jetty-9.2.z-SNAPSHOT 2018-07-25 16:40:52,943 INFO handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@7ee0a1b3{/metrics/json,null,AVAILABLE} 2018-07-25 16:41:04,080 INFO handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.w.WebAppContext@4e3fbce1{/,file:/data1/alluxio-1.4.0/core/server/src/main/webapp/,AVAILABLE}{/data1/alluxio-1.4.0/core/server/src/main/webapp} 2018-07-25 16:41:04,087 INFO server.ServerConnector (AbstractConnector.java:doStart) - Started ServerConnector@47fe4ff8{HTTP/1.1}{0.0.0.0:19999} 2018-07-25 16:41:04,087 INFO server.Server (Server.java:doStart) - Started @16448ms 2018-07-25 16:41:04,087 INFO logger.type (WebServer.java:start) - Alluxio Master Web service started @ /0.0.0.0:19999 2018-07-25 16:41:04,088 INFO logger.type (DefaultAlluxioMaster.java:startServing) - Alluxio master version 1.4.0 started @ aznballuxiosl01.envazure.com/10.24.101.103:19998 (gained leadership)
總結:bash
Alluxio HA的使用,本來是爲了數據安全性和穩定性.app
可是已經發生過屢次,在hdfs上的journal文件夾的共享文件損壞的狀況. 穩定性反倒變差了.這個須要從新考量,是否使用hdfs,提供HA的方式了.
另外,每次發生問題的時候hdfs fsck都是正常的.也就是說,不是hdfs 的"鍋",而是alluxio寫入hdfs的時候,沒有成功,致使的文件損壞.