2014-3-10php
【需求】html
接受的工做須要處理海量數據,第一步先用工具作一些運營數據的產出,考慮採用hadoop方便之後跟隨數據量變大能夠補充機器,而不用動統計邏輯。java
當前的hadoop社區很是活躍,hadoop周邊工具不斷出新,如下是部分熱門工具的初步瞭解:node
hadoop,包含hdfs和mapreduce
hbase,支持大表,須要zk
zookeeper,分佈式集羣管理,簡稱zkpython
flume/sribe/Chukwa 分佈式日誌收集系統,從多個機器彙總到一個節點
sqoop,傳統db和hdfs/hbase之間數據傳輸mysql
hive,一個SQL查詢接口
pig,一個腳本查詢接口
hadoop流,標準輸入輸出的Mapreduce,使用腳本語言編寫邏輯代碼 shell/python/php/ruby
hadoop pipe,socket輸入輸出,使用C++編寫邏輯代碼web
avro,序列化工具
oozie,把幾個mr做業連一塊兒
snappy,壓縮工具
mahout,機器學習工具集sql
固然最新的工具層出不窮,好比Spark,Impala。如今的需求是打起單機僞分佈hadoop,而後當業務數據量增大時候,比較平滑切入多機分佈。shell
本着小步快跑的互聯網思想,在沒太多實際經驗下,先作簡單搭建,後期在不斷補充和調式加入新工具。ubuntu
初期搭建的有:hadoop(一切的必須),hive(查詢方便),hadoop流的簡單包裝(爲了方便使用腳本語言),sqoop(從傳統db導數據),pig(嘗試使用,沒必要須)。
後期搭建的可能包括:zookeeper,hbase(支持億級別大表),mahout(機器學習) 等。
[PS]工做環境 64位的 Ubuntu 12.04,線下實驗是桌面版,實際操做的是服務器版;
【Java 7】
首先是Java的安裝,Ubuntu可能默認裝了openjdk,最好仍是用oracle java,故卸之
sudo apt-get purge openjdk* sudo apt-get install software-properties-common
sudo apt-get install python-software-properties sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java7-installer
裝完了要加入JAVA_HOME到環境變量中。衆多設置環境變量的文件中,如/etc/profile,~/.bashrc等,Ubuntu推薦設置在/etc/enviroment,可是筆者用了以後出現各類詭異,多是格式問題,暫時推薦放在/etc/profile,這也是Programming Hive裏面使用的。
export JAVA_HOME="/usr/lib/jvm/java-7-oracle/" export JRE_HOME=$JAVA_HOME/jre export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib export PATH=$JAVA_HOME/bin:$PATH . /etc/profile
【Hadoop搭建】
各類不一樣的發行版有官方版本,Cloudera,MapR等免費開源版,商業版就不說了,反正用不上。一個數據:國內公司75%用cloudera,由於方便 vie 利用Cloudera實現Hadoop,官方版的安裝說明
筆者嘗試用Apache的Hadoop版本,在64位Ubuntu上搭建最新穩定版2.2.0(2014/3/10),竟然直接拿來用的庫文件不支持64位,要本身編譯,這水就很深了,編譯這種系統老是缺胳膊少腿的,不是少了編譯工具,就是少了依賴。實際狀況中碰到各類bug,各類問題不斷壓棧,使用成本不小。大概花了一天才把Hadoop搭建起來,並且感受這樣東拼西湊,感受隨時能夠崩潰。遂決定換Cloudera的CDH4。
在Cloudera官方網站查得支持的各類軟硬件條件,支持個人64位的Ubuntu (12.04) (使用命令 uname -a 和 cat /etc/issue)
# 官方僞分佈安裝說明 就幾步:
wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb sudo dpkg -i cdh4-repository_1.0_all.deb curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add - sudo apt-get update sudo apt-get install hadoop-conf-pseudo #這一步會列出要安裝的軟件,包括zookeeper,受網速影響可能比較慢,能夠用nohup放到後臺運行 sudo -u hdfs hdfs namenode -format #格式化NameNode.
# 啓動HSFS
for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
# 建立/tmp目錄和YARN與日誌目錄
sudo -u hdfs hadoop fs -rm -r /tmp sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging/history/done_intermediate sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
# check目錄
sudo -u hdfs hadoop fs -ls -R /
結果應該爲:
drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn
# Start YARN
sudo service hadoop-yarn-resourcemanager start sudo service hadoop-yarn-nodemanager start sudo service hadoop-mapreduce-historyserver start
#Create User Directories
sudo -u hdfs hadoop fs -mkdir /user/danny sudo -u hdfs hadoop fs -chown danny /user/danny
實際格式爲
sudo -u hdfs hadoop fs -mkdir /user/<user>
sudo -u hdfs hadoop fs -chown <user> /user/<user>
#Running an example application with YARN
hadoop fs -mkdir input hadoop fs -put /etc/hadoop/conf/*.xml input hadoop fs -ls input export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+' hadoop fs -ls hadoop fs -ls output23 hadoop fs -cat output23/part-r-00000 | head
結果應該爲
1 dfs.safemode.min.datanodes 1 dfs.safemode.extension 1 dfs.replication 1 dfs.permissions.enabled 1 dfs.namenode.name.dir 1 dfs.namenode.checkpoint.dir 1 dfs.datanode.data.dir
【hive】
安裝mysql
sudo apt-get install hive hive-metastore hive-server
sudo apt-get install mysql-server sudo service mysql start
若是須要修改密碼
$ sudo /usr/bin/mysql_secure_installation [...] Enter current password for root (enter for none): OK, successfully used password, moving on... [...] Set root password? [Y/n] y New password: Re-enter new password: Remove anonymous users? [Y/n] Y [...] Disallow root login remotely? [Y/n] N [...] Remove test database and access to it [Y/n] Y [...] Reload privilege tables now? [Y/n] Y All done!
To make sure the MySQL server starts at boot
須要apt get sysv-rc-conf (替代chkconfig的)
sudo apt-get install sysv-rc-conf
sudo sysv-rc-conf mysql on
建立metastore庫,註冊一個用戶,受權
$ mysql -u root -p Enter password: mysql> CREATE DATABASE metastore; mysql> USE metastore; mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;
mysql> create user 'hive'@'%' identified by 'hive';
mysql> create user 'hive'@'localhost' identified by 'hive';
mysql> revoke all privileges, grant option from 'hive'@'%';
mysql> revoke all privileges, grant option from 'hive'@'localhost';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'%';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'localhost';
mysql> FLUSH PRIVILEGES;
mysql> quit;
安裝mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/ directory.
sudo apt-get install libmysql-java sudo ln -s /usr/share/java/libmysql-java.jar /usr/lib/hive/lib/libmysql-java.jar
Configure the Metastore Service to Communicate with the MySQL Database,配置 hive-site.xml
sudo cp /etc/hive/conf/hive-site.xml /etc/hive/conf/hive-site.xml.bak sudo vim /etc/hive/conf/hive-site.xml
修改成
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value> <description>the URL of the MySQL database</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> </property>
啓動和初始化文件
sudo service hive-metastore start sudo service hive-server start sudo -u hdfs hadoop fs -mkdir /user/hive sudo -u hdfs hadoop fs -chown hive /user/hive sudo -u hdfs hadoop fs -mkdir /tmp sudo -u hdfs hadoop fs -chmod 777 /tmp #already exist sudo -u hdfs hadoop fs -chmod o+t /tmp sudo -u hdfs hadoop fs -mkdir /data sudo -u hdfs hadoop fs -chown hdfs /data sudo -u hdfs hadoop fs -chmod 777 /data sudo -u hdfs hadoop fs -chmod o+t /data sudo chown -R hive:hive /var/lib/hive