postgres-xl是個好東西?爲何呢?javascript
它能夠搞oltp, 抗衡mysql, mysql沒有分析函數功能html
它能夠搞oltp, 抗衡oracle, oracle生態弱, 無法實時java
它對json支持好, 抗衡mongodb, mongodb用得人愈來愈少了。。。node
它還恐怖地支持各類語言擴展, java, javascript, r, python, haskell。。。python
而且開源免費,簡單強大的沒朋友。。。mysql
它比greenpulm版本新: 它跟greenpulm本是一家人,都是MPP架構的postgresql集羣
可是原生改造, 版本基本上與postgresql一致, greenpulm的版本升不動啊。。。linux
它比oracle RAC/Teradata便宜,無償使用sql
它比hadoop省資源,沒有GC,基於C語言資源利用率高,而且生態圈豐富,可視化方便mongodb
它出道早,版本穩定性強。數據庫
postgres-xl分爲如下組件:
a. gtm 負責全局事務
b. coordinator 處理分發執行
c. datanode 負責底層處理
datanode跟coordinator都 鏈接到gtm,
客戶端鏈接到coordinator運行sql,
coordinator使用gtm進行一些事務功能分發給datanode執行
大數據的發展方向
什麼是MPP?
MPP (Massively Parallel Processing),即大規模並行處理,是分佈式、並行、結構化數據庫集羣,具有高性能、高可用、高擴展特性,能夠爲超大規模數據管理提供高性價比的通用計算平臺,普遍用於支撐各種數據倉庫系統、BI 系統。
MPP架構特徵:
•任務並行執行
•數據分佈式存儲(本地化)
•分佈式計算
•橫向擴展
思考對比
Oracle集羣的特色:每一個節點全量存儲
Mysql熱備的特色:主從,每一個節點全量存儲
MPP——分佈式關係數據庫
GreenPlum:基於Postgres XL 8.2進行封裝,多年沒升級過。
Mysql Cluster:剛出來兩年,穩定性和性能比較差。
Postgres XL 與 Postgres LL ,是Postgresql數據庫兩種集羣模式,Postgres XL是當今主流MPP
Postgres XL 集羣架構
GTM:Global Transaction Manager
Coordinator:協調器
Datanode:數據節點
GTM-Proxy:GTM代理器
組件介紹
Global Transaction Monitor (GTM)
全局事務管理器,確保羣集範圍內的事務一致性。 GTM負責發放事務ID和快照做爲其多版本併發控制的一部分。
集羣可選地配置一個備用GTM(GTM Standby),以改進可用性。此外,能夠在協調器間配置代理GTM, 可用於改善可擴展性,減小GTM的通訊量。
GTM Standby
GTM的備節點,在pgxc,pgxl中,GTM控制全部的全局事務分配,若是出現問題,就會致使整個集羣不可用,爲了增長可用性,增長該備用節點。當GTM出現問題時,GTM Standby能夠升級爲GTM,保證集羣正常工做。
GTM-Proxy
GTM須要與全部的Coordinators通訊,爲了下降壓力,能夠在每一個Coordinator機器上部署一個GTM-Proxy。
Coordinator
協調員管理用戶會話,並與GTM和數據節點進行交互。協調員解析,產生查詢計劃,並給語句中的每個組件發送下一個序列化的全局性計劃。
一般此服務和數據節點部署在一塊兒。
正式安裝
Note: 其實在生產環境,若是你集羣的數量少於20臺的話,甚至能夠不須要使用gtm-proxy
#1)System Initialization Optimization on every nodes cat >> /etc/security/limits.conf << EOF * hard memlock unlimited * soft memlock unlimited * - nofile 65535 EOF setenforce 0 sed -i 's/^SELINUX=.*$/SELINUX=disabled/' /etc/selinux/config systemctl stop firewalld.service systemctl disable firewalld.service cat >/etc/hosts <<EOF 172.31.1.81 neo4j01 172.31.4.146 neo4j02 172.31.3.178 neo4j03 172.31.8.178 neo4j04 EOF #2)create postgres user on every nodes useradd postgres echo Ad@sd119|passwd --stdin postgres echo 'postgres ALL=(ALL) NOPASSWD: ALL' >>/etc/sudoers ###################################################################### #3)Configure ssh authentication to avoid inputing password for pgxc_ctl(run this commad on every nodes) ###################################################################### su - postgres ssh-keygen -t rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 600 authorized_keys cat ~/.ssh/id_rsa.pub | ssh neo4j01 'cat >> ~/.ssh/authorized_keys' ############################################ #4)Install dependency packages on every nodes# ############################################ sudo yum install -y flex bison readline-devel zlib-devel openjade docbook-style-dsssl gcc bzip2 e2fsprogs-devel uuid-devel libuuid-devel make wget wget -c http://download.cashalo.com/schema/postgres-xl-9.5r1.6.tar.bz2 && tar jxf postgres-xl-9.5r1.6.tar.bz2 cd postgres-xl-9.5r1.6 ./configure --prefix=/home/postgres/pgxl9.5 --with-uuid=ossp --with-uuid=ossp && make && make install && cd contrib/ && make && make install #5)Configuring environment variables on every nodes cat >>/home/postgres/.bashrc <<EOF export PGHOME=/home/postgres/pgxl9.5 export LD_LIBRARY_PATH=\$PGHOME/lib:\$LD_LIBRARY_PATH export PATH=\$PGHOME/bin:\$PATH EOF source /home/postgres/.bashrc #6)create data dirsctory on every nodes mkdir -p /home/postgres/data/ #7) just run command pgxc_ctl on neo4j01 pgxc_ctl # then input command prepare prepare # finally run q to exit q #8) Edit file pgxc_ctl.conf (just do int on neo4j01) vim /home/postgres/pgxc_ctl/pgxc_ctl.conf ########################################### #!/usr/bin/env bash pgxcInstallDir=/home/postgres/pgxl9.5 pgxlDATA=/home/postgres/data #---- OVERALL ----------------------------------------------------------------------------- # pgxcOwner=postgres # owner of the Postgres-XC databaseo cluster. Here, we use this # both as linus user and database user. This must be # the super user of each coordinator and datanode. pgxcUser=postgres # OS user of Postgres-XC owner tmpDir=/tmp # temporary dir used in XC servers localTmpDir=$tmpDir # temporary dir used here locally configBackup=n # If you want config file backup, specify y to this value. #configBackupHost=pgxc-linker # host to backup config file #configBackupDir=$HOME/pgxc # Backup directory #configBackupFile=pgxc_ctl.bak # Backup file name --> Need to synchronize when original changed. #---- GTM ------------------------------------------------------------------------------------ # GTM is mandatory. You must have at least (and only) one GTM master in your Postgres-XC cluster. # If GTM crashes and you need to reconfigure it, you can do it by pgxc_update_gtm command to update # GTM master with others. Of course, we provide pgxc_remove_gtm command to remove it. This command # will not stop the current GTM. It is up to the operator. #---- GTM Master ----------------------------------------------- #---- Overall ---- gtmName=gtm1 gtmMasterServer=neo4j01 gtmMasterPort=6666 gtmMasterDir=$pgxlDATA/gtm1 #---- Configuration --- gtmExtraConfig=none # Will be added gtm.conf for both Master and Slave (done at initilization only) gtmMasterSpecificExtraConfig=none # Will be added to Master's gtm.conf (done at initialization only) #---- GTM Slave ----------------------------------------------- # Because GTM is a key component to maintain database consistency, you may want to configure GTM slave # for backup. #---- Overall ------ gtmSlave=y # Specify y if you configure GTM Slave. Otherwise, GTM slave will not be configured and # all the following variables will be reset. gtmSlaveName=gtm2 gtmSlaveServer=neo4j02 # value none means GTM slave is not available. Give none if you don't configure GTM Slave. gtmSlavePort=6666 # Not used if you don't configure GTM slave. gtmSlaveDir=$pgxlDATA/gtm2 # Not used if you don't configure GTM slave. # Please note that when you have GTM failover, then there will be no slave available until you configure the slave # again. (pgxc_add_gtm_slave function will handle it) #---- Configuration ---- gtmSlaveSpecificExtraConfig=none # Will be added to Slave's gtm.conf (done at initialization only) #---- GTM Proxy ------------------------------------------------------------------------------------------------------- # GTM proxy will be selected based upon which server each component runs on. # When fails over to the slave, the slave inherits its master's gtm proxy. It should be # reconfigured based upon the new location. # # To do so, slave should be restarted. So pg_ctl promote -> (edit postgresql.conf and recovery.conf) -> pg_ctl restart # # You don't have to configure GTM Proxy if you dont' configure GTM slave or you are happy if every component connects # to GTM Master directly. If you configure GTL slave, you must configure GTM proxy too. #---- Shortcuts ------ gtmProxyDir=$pgxlDATA/gtm_proxy #---- Overall ------- gtmProxy=n # Specify y if you conifugre at least one GTM proxy. You may not configure gtm proxies # only when you dont' configure GTM slaves. # If you specify this value not to y, the following parameters will be set to default empty values. # If we find there're no valid Proxy server names (means, every servers are specified # as none), then gtmProxy value will be set to "n" and all the entries will be set to # empty values. #gtmProxyNames=(gtm_pxy1 gtm_pxy2 gtm_pxy3 gtm_pxy4) # No used if it is not configured #gtmProxyServers=(neo4j01 neo4j02 neo4j03 neo4j04) # Specify none if you dont' configure it. #gtmProxyPorts=(6660 6666 6666 6666) # Not used if it is not configured. #gtmProxyDirs=($gtmProxyDir $gtmProxyDir $gtmProxyDir $gtmProxyDir) # Not used if it is not configured. #---- Configuration ---- gtmPxyExtraConfig=none # Extra configuration parameter for gtm_proxy. Coordinator section has an example. gtmPxySpecificExtraConfig=(none none none none) #---- Coordinators ---------------------------------------------------------------------------------------------------- #---- shortcuts ---------- coordMasterDir=$pgxlDATA/coord coordSlaveDir=$pgxlDATA/coord_slave coordArchLogDir=$pgxlDATA/coord_archlog #---- Overall ------------ coordNames=(coord1 coord2 coord3 coord4) # Master and slave use the same name coordPorts=(5432 5432 5432 5432) # Master ports poolerPorts=(6667 6667 6667 6667) # Master pooler ports #coordPgHbaEntries=(192.168.29.0/24) # Assumes that all the coordinator (master/slave) accepts coordPgHbaEntries=(0.0.0.0/0) # the same connection # This entry allows only $pgxcOwner to connect. # If you'd like to setup another connection, you should # supply these entries through files specified below. # Note: The above parameter is extracted as "host all all 0.0.0.0/0 trust". If you don't want # such setups, specify the value () to this variable and suplly what you want using coordExtraPgHba # and/or coordSpecificExtraPgHba variables. #coordPgHbaEntries=(::1/128) # Same as above but for IPv6 addresses #---- Master ------------- coordMasterServers=(neo4j01 neo4j02 neo4j03 neo4j04) # none means this master is not available coordMasterDirs=($coordMasterDir $coordMasterDir $coordMasterDir $coordMasterDir) coordMaxWALsernder=0 # max_wal_senders: needed to configure slave. If zero value is specified, # it is expected to supply this parameter explicitly by external files # specified in the following. If you don't configure slaves, leave this value to zero. coordMaxWALSenders=($coordMaxWALsernder $coordMaxWALsernder $coordMaxWALsernder $coordMaxWALsernder) # max_wal_senders configuration for each coordinator. #---- Slave ------------- coordSlave=n # Specify y if you configure at least one coordiantor slave. Otherwise, the following # configuration parameters will be set to empty values. # If no effective server names are found (that is, every servers are specified as none), # then coordSlave value will be set to n and all the following values will be set to # empty values. #coordSlaveSync=y # Specify to connect with synchronized mode. #coordSlaveServers=(node07 node08 node09 node06) # none means this slave is not available #coordSlavePorts=(20004 20005 20004 20005) # Master ports #coordSlavePoolerPorts=(20010 20011 20010 20011) # Master pooler ports #coordSlaveDirs=($coordSlaveDir $coordSlaveDir $coordSlaveDir $coordSlaveDir) #coordArchLogDirs=($coordArchLogDir $coordArchLogDir $coordArchLogDir $coordArchLogDir) #---- Configuration files--- # Need these when you'd like setup specific non-default configuration # These files will go to corresponding files for the master. # You may supply your bash script to setup extra config lines and extra pg_hba.conf entries # Or you may supply these files manually. coordExtraConfig=coordExtraConfig # Extra configuration file for coordinators. # This file will be added to all the coordinators' # postgresql.conf # Pleae note that the following sets up minimum parameters which you may want to change. # You can put your postgresql.conf lines here. cat > $coordExtraConfig <<EOF #================================================ # Added to all the coordinator postgresql.conf # Original: $coordExtraConfig log_destination = 'stderr' logging_collector = on log_directory = 'pg_log' listen_addresses = '*' max_connections = 50 EOF # Additional Configuration file for specific coordinator master. # You can define each setting by similar means as above. coordSpecificExtraConfig=(none none none none) coordExtraPgHba=none # Extra entry for pg_hba.conf. This file will be added to all the coordinators' pg_hba.conf coordSpecificExtraPgHba=(none none none none) #----- Additional Slaves ----- # # Please note that this section is just a suggestion how we extend the configuration for # multiple and cascaded replication. They're not used in the current version. # #coordAdditionalSlaves=n # Additional slave can be specified as follows: where you #coordAdditionalSlaveSet=(cad1) # Each specifies set of slaves. This case, two set of slaves are # # configured #cad1_Sync=n # All the slaves at "cad1" are connected with asynchronous mode. # # If not, specify "y" # # The following lines specifies detailed configuration for each # # slave tag, cad1. You can define cad2 similarly. #cad1_Servers=(node08 node09 node06 node07) # Hosts #cad1_dir=$HOME/pgxc/nodes/coord_slave_cad1 #cad1_Dirs=($cad1_dir $cad1_dir $cad1_dir $cad1_dir) #cad1_ArchLogDir=$HOME/pgxc/nodes/coord_archlog_cad1 #cad1_ArchLogDirs=($cad1_ArchLogDir $cad1_ArchLogDir $cad1_ArchLogDir $cad1_ArchLogDir) #---- Datanodes ------------------------------------------------------------------------------------------------------- #---- Shortcuts -------------- datanodeMasterDir=$pgxlDATA/dn_master datanodeSlaveDir=$pgxlDATA/dn_slave datanodeArchLogDir=$pgxlDATA/datanode_archlog #---- Overall --------------- #primaryDatanode=datanode1 # Primary Node. # At present, xc has a priblem to issue ALTER NODE against the primay node. Until it is fixed, the test will be done # without this feature. primaryDatanode=dn1 # Primary Node. datanodeNames=(dn1 dn2 dn3 dn4) datanodePorts=(5433 5433 5433 5433) # Master ports datanodePoolerPorts=(6668 6668 6668 6668) # Master pooler ports datanodePgHbaEntries=(0.0.0.0/0) # Assumes that all the coordinator (master/slave) accepts # the same connection # This list sets up pg_hba.conf for $pgxcOwner user. # If you'd like to setup other entries, supply them # through extra configuration files specified below. # Note: The above parameter is extracted as "host all all 0.0.0.0/0 trust". If you don't want # such setups, specify the value () to this variable and suplly what you want using datanodeExtraPgHba # and/or datanodeSpecificExtraPgHba variables. #datanodePgHbaEntries=(::1/128) # Same as above but for IPv6 addresses #---- Master ---------------- datanodeMasterServers=(neo4j01 neo4j02 neo4j03 neo4j04) # none means this master is not available. # This means that there should be the master but is down. # The cluster is not operational until the master is # recovered and ready to run. datanodeMasterDirs=($datanodeMasterDir $datanodeMasterDir $datanodeMasterDir $datanodeMasterDir) datanodeMaxWalSender=0 # max_wal_senders: needed to configure slave. If zero value is # specified, it is expected this parameter is explicitly supplied # by external configuration files. # If you don't configure slaves, leave this value zero. datanodeMaxWALSenders=($datanodeMaxWalSender $datanodeMaxWalSender $datanodeMaxWalSender $datanodeMaxWalSender) # max_wal_senders configuration for each datanode #---- Slave ----------------- datanodeSlave=n # Specify y if you configure at least one coordiantor slave. Otherwise, the following # configuration parameters will be set to empty values. # If no effective server names are found (that is, every servers are specified as none), # then datanodeSlave value will be set to n and all the following values will be set to # empty values. #datanodeSlaveServers=(node07 node08 node09 node06) # value none means this slave is not available #datanodeSlavePorts=(20008 20009 20008 20009) # value none means this slave is not available #datanodeSlavePoolerPorts=(20012 20013 20012 20013) # value none means this slave is not available #datanodeSlaveSync=y # If datanode slave is connected in synchronized mode #datanodeSlaveDirs=($datanodeSlaveDir $datanodeSlaveDir $datanodeSlaveDir $datanodeSlaveDir) #datanodeArchLogDirs=( $datanodeArchLogDir $datanodeArchLogDir $datanodeArchLogDir $datanodeArchLogDir ) # ---- Configuration files --- # You may supply your bash script to setup extra config lines and extra pg_hba.conf entries here. # These files will go to corresponding files for the master. # Or you may supply these files manually. datanodeExtraConfig=none # Extra configuration file for datanodes. This file will be added to all the # datanodes' postgresql.conf datanodeSpecificExtraConfig=(none none none none) datanodeExtraPgHba=none # Extra entry for pg_hba.conf. This file will be added to all the datanodes' postgresql.conf datanodeSpecificExtraPgHba=(none none none none) #----- Additional Slaves ----- datanodeAdditionalSlaves=n # Additional slave can be specified as follows: where you # datanodeAdditionalSlaveSet=(dad1 dad2) # Each specifies set of slaves. This case, two set of slaves are # configured # dad1_Sync=n # All the slaves at "cad1" are connected with asynchronous mode. # If not, specify "y" # The following lines specifies detailed configuration for each # slave tag, cad1. You can define cad2 similarly. # dad1_Servers=(node08 node09 node06 node07) # Hosts # dad1_dir=$HOME/pgxc/nodes/coord_slave_cad1 # dad1_Dirs=($cad1_dir $cad1_dir $cad1_dir $cad1_dir) # dad1_ArchLogDir=$HOME/pgxc/nodes/coord_archlog_cad1 # dad1_ArchLogDirs=($cad1_ArchLogDir $cad1_ArchLogDir $cad1_ArchLogDir $cad1_ArchLogDir) #---- WAL archives ------------------------------------------------------------------------------------------------- walArchive=n # If you'd like to configure WAL archive, edit this section. ####################################################### #8) when u first time to setup cluster(run it just on neo4j01) pgxc_ctl -c /home/postgres/pgxc_ctl/pgxc_ctl.conf init all #9) start cluster(run it just on neo4j01) pgxc_ctl -c /home/postgres/pgxc_ctl/pgxc_ctl.conf start all #10) stop cluster(run it just on neo4j01) pgxc_ctl -c /home/postgres/pgxc_ctl/pgxc_ctl.conf stop all #11) check every one is running(run it just on neo4j01) pgxc_ctl monitor all #12) view System Table (run it just on neo4j01) psql -p5432 select * from pgxc_node;
postgres-XL 下存在兩種數據表,分別是replication表和distribute表
REPLICATION複製表:各個datanode節點中,表的數據徹底相同,也就是說,插入數據時,會分別在每一個datanode節點插入相同數據。讀數據時,只須要讀任意一個datanode節點上的數據。小表採用。
建表語法:
postgres=# create table rep(col1 int,col2 int)distribute by replication;
DISTRIBUTE表 :會將插入的數據,按照拆分規則,分配到不一樣的datanode節點中存儲,也就是sharding技術。每一個datanode節點只保存了部分數據,經過coordinate節點能夠查詢完整的數據視圖。分佈式存儲,大表採用,默認
postgres=# CREATE TABLE dist(col1 int, col2 int) DISTRIBUTE BY HASH(col1);
如何驗證分佈式存儲?
分別插入100行數據:
postgres=# INSERT INTO rep SELECT generate_series(1,100), generate_series(101, 200); postgres=# INSERT INTO dist SELECT generate_series(1,100), generate_series(101, 200);
psql -p 5432,經過Coordinater 訪問查詢完整的數據視圖;
psql -p 5433,5433是Datanode的端口,此時只訪問該單個節點
如何連接到指定的數據庫呢?看下面的例子
psql -p 5432 aa ,aa爲指定的庫名,不指定時默認是postgres庫,至關於hive裏的default庫
查詢這個分佈表數據在每一個節點的分佈:
postgres=# SELECT xc_node_id, count(*) FROM dist GROUP BY xc_node_id; xc_node_id | count ------------+------- -700122826 | 19 352366662 | 27 -560021589 | 23 823103418 | 31 (4 rows)
查詢每一個節點的ID信息
postgres=# select * from pgxc_node;
咱們再來看看複製表
postgres=# select xc_node_id,count(*) from rep group by xc_node_id; xc_node_id | count ------------+------- -560021589 | 100 (1 row)
由於咱們是在neo4j01節點上查詢的,因此顯示的id就是neo4j01節點的id;同理,若是咱們是在其它節點查詢的話那就顯示其它節點的id。也就是說當咱們查詢複製表的時候,它只會走一個節點,不會走多個節點。針對這個特色,若是將來數據量很大,咱們查詢的時候,能夠走負載均衡。
https://www.cnblogs.com/sfnz/p/7908380.html
Psql是PostgreSQL的一個命令行交互式客戶端工具。PostgreSQL 一些命令、用法、語法,在Postgres xl集羣都是通用的。