ORACLE遷移GP實踐

最近在作oracle到greenplum的遷移實踐,步驟以下:
1. 使用ora2pg實現Oracle的數據結構遷移到GP的實現過程
2. Oracle的數據遷移到GP的實現過程
 
1. ora2pg的使用
 關係圖以下:
 
須要安裝DBD-oracle,DBD-pg,DBI模塊,配置conf後能夠把oracle的數據結構(table,view,package等)轉化成PG的數據結構.也能夠配置直接把oracle庫的數據導入到PG裏面.
環境參數:
OS RHEL6.5 64bit
Oracle client 10.2.0.5.0
GP 4.2.6.0
模塊的參數在圖上已經詳細標註出來了.模塊的安裝標準的perl安裝方法:
perl Makefile.PL
make
make test
make install
 
介紹一下配置文件:
 1 ORACLE_HOME    /home/oracle/client_1
 2 ORACLE_DSN    dbi:Oracle:host=192.168.11.1;sid=orcl
 3 ORACLE_USER    manager
 4 ORACLE_PWD    tiger
 5 SCHEMA        test
 6 TYPE        TABLE VIEW PACKAGE COPY
 7 PG_NUMERIC_TYPE    0
 8 PG_INTEGER_TYPE    1
 9 DEFAULT_NUMERIC float
10 SKIP    fkeys pkeys ukeys indexes checks
11 NLS_LANG    AMERICAN_AMERICA.UTF8
12 PG_DSN        dbi:Pg:dbname=easyetl;host=127.0.0.1;port=5432
13 PG_USER    easyetl
14 PG_PWD    password
15 OUTPUT        output.sql

1-4 配置源端Oracle的信息html

5    oracle的schema取值python

6    準備轉化的數據類型,也包括導數據的copy命令git

7-9  用來轉化oracle的number(p,s)到PG的類型:github

7表示是否使用PG內部的數據類型,0表示不使用,1表示使用sql

8表示在7設置爲0時,若是8設置爲1,則類型number(p)的定義變動爲integer;若是8設置爲0,則number(p)也轉化爲numeric(p)bash

9表示是8設置爲1的時候,類型number轉化爲float,若是8設置爲0,則9不起做用.數據結構

簡單的設置,若是7,8均設置爲0,那麼number(p) --> numeric(p),number(p,s) --> numeric(p,s), number --> numericoracle

10 約束們是否須要建立app

11 語言選擇函數

12-14 配置目的端PG(GP亦可),若是這三行信息不配置,也不要緊,能夠生成oracle轉化爲PG的腳本

15 生成文件

遷移中出現的狀況:

(1) 表能夠徹底遷移過去

(2) 視圖裏面若是沒有起別名的話,也須要手動添加別名

(3) package須要手動修改.注:ver13版本的package生成須要把perform/decode屏蔽掉,由於這二點未作好,模塊爲PLSQL.pm.

固然package轉化不單單只是這部分東西,主要的有:

a 別名須要顯式寫出

b 隱式轉化要顯式寫出

c 函數的差別(GP官方有一套Oracle的函數實現,基本上夠用)

d oracle裏面非標準寫法,如: a left join b寫成 a,b where a.xx=b.xx(+)

 

2. Oracle的數據遷移到GP的實現過程

 使用sqluldr2把數據從oracle unload出來到一個named pipe上,而後經過gpload把數據載入到GP裏面.
 
dataload.sh
關鍵點有二個:
(1) sqluldr先生成數據,傳到管道里面.gpload讀取配置文件,從管道取數據,本身啓動gpfdist,生成External table,載入GP庫
(2) 當數據量少的時候,即sqluldr進程結束後,gpload進程還沒徹底啓動.這個時候,gpload就一直等待管道里面的數據到來,hang住了.爲了解決這個問題,特地在sqluldr的presql裏面添加dbms_lock.sleep(2),這樣就能夠保證sqluldr進程結束前,gpload進程已經啓動了.或者能夠直接寫c來指定管道是否堵塞來判斷.
#!/bin/bash
if [ $# -lt 3 ];then
    echo 'Usage `basename $0` pipe tablename control'
    exit 1
fi

pipename=$1
tablename=$2
control=$3
condition=$4

mknod $pipename p
/root/software/sqluldr2 user=manager/tigerd@orcl query="select * from $tablename where $condition" field=0x7c file=$pipename 
charset=utf8 text=CSV safe=yes persql="begin dbms_lock.sleep(2); end;" & gpload -f $control -l gpload.log rm -rf $pipename

ora2gp.sh --生成control文件,包括管道文件名稱.而後調用上述進程實現載入過程.
#!/usr/bin/env python #-*- coding:utf-8 -*- import yaml import subprocess import sys import os # Script starts from here paramnum=len(sys.argv) datadt=20140820 condition="1=1" tplpath="/root/template/" pipepath="/tmp/pipe" batname="/root/script/dataload.sh" if (paramnum == 1): print 'Usage:'+ sys.argv[0]+' tablename ' sys.exit() elif(paramnum == 2): tablename=sys.argv[1] elif(paramnum == 3): tablename=sys.argv[1] datadt=sys.argv[2] elif(paramnum == 4): tablename=sys.argv[1] datadt=sys.argv[2] condition=sys.argv[3] else: print 'Usage:'+ sys.argv[0]+' tablename datadt condition. (datadt condition is optional)!' sys.exit() pid=os.getpid() pipename=pipepath+str(pid) f = open(tplpath+"gp_template_load.ctl") dataMap = yaml.load(f) f.close() dataMap['GPLOAD']['INPUT'][0]['SOURCE']['FILE'][0]=pipename dataMap['GPLOAD']['OUTPUT'][0]['TABLE']=tablename dataMap['GPLOAD']['INPUT'][6]['ERROR_TABLE']=tablename+'_err' filename=tplpath+tablename+'.ctl' f = open(filename,'w') yaml.dump(dataMap,f) f.close() handle=subprocess.Popen([batname,pipename,tablename,filename,condition]) handle.communicate()
control文件模板
VERSION: 1.0.0.1
DATABASE: dw
USER: manager
HOST: gp
PORT: 5432
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - gp
         FILE:
           - /tmp/mypipe
         PORT_RANGE: [8001,9000]
    - FORMAT: csv
    - DELIMITER: ','
    - QUOTE: '"'
    - HEADER: true
    - ERROR_LIMIT: 10000
    - ERROR_TABLE: tablename_err
   OUTPUT:
    - TABLE: tablename
    - MODE: INSERT
   PRELOAD:
    - TRUNCATE: true

後續操做:

上面的程序能夠看成同步使用,可是真正的在生產使用就會有點不太讓人放心.

緣由有三:

(1)dataload.sh裏面的sqluldr是放在後臺處理的.當sqluldr出現異常,gpload可能會等待.當gpload出現異常的時候,sqluldr仍是會載出文件.並且dataload.sh是fork出二個進程,當進程

出現異常,還須要手動尋找,kill掉.

(2)平常記錄與處理.

(3)oracle與gp的表結構要嚴格一致才行.

基於此,寫了能夠統一處理fork的進程,增長了獲取gp column list,加上日誌處理這幾部分.

oraconf文件格式:

#CONFNAME:USER^PASS^TNSNAME

gpconf文件格式:
#host:port:database:user:passwd

control文件看上面以及官方文檔吧.

#!/bin/sh
. greenplum_loaders_path.sh
. setenv

if [ $# -lt 4 ];then
        echo "Usage : `basename $0` confname etl_date mode src_tbname tgt_tbname " 
    echo "       confname  : configuration at ${PWD}/conf/oraconf "
    echo "       etl_date : YYYYMMDD "
    echo "       mode : 1 truncate; 2 append"
    echo "       src_tbname : oracle datasource tablename " 
    echo "       tgt_tbname(optional) : greenplum datasource tablename" 
        exit 1
fi

#trap the exception quit
trap 'log_info "TERM/INTERRUPT(subprocess) close";close_subproc' INT TERM

declare -a sublist

function log_info()
{
    DATETIME=`date +"%Y%m%d %H:%M:%S"`
    echo -e "S $DATETIME P[$$]: $*"| tee -a "$LOGPATH"/"$ETLDATE"/"$GPTABLE".log
}

function collect_subproc()
{
        local index
        if [ ${#sublist} -eq 0 ];then
                index=0
        else
                index=$[${#sublist}]+1
        fi
        sublist[$index]=$1

}

function close_subproc()
{
    for subid in ${sublist[@]}
    do
        log_info "kill processid: $subid"
        kill $subid
    done
}

function parse_yaml()
{
        local file=$1
        local tablename=$2
        local pipename=$3
    local etldate=$4
        sed -i -e "s/mypipe/"$pipename"/" -e "s/tablename_err/public."$tablename"_err/" -e "s/\<tablename\>/"$tablename"/" -e "s/etl_date/"$etldate"/" $file
}


if [ $(dirname $0) == '.' ];then
    PRIPATH=${PWD}
else 
        PRIPATH=$(dirname $0)
fi

TPLPATH="$PRIPATH"/template
LOGPATH="$PRIPATH"/log
CONFNAME=$1
ETLDATE=$2
MODE=$3
ORATABLE=$4
GPTABLE=$5

[ -z "$GPTABLE" ] && GPTABLE="$ORATABLE"
[ ! -d "$LOGPATH"/"$ETLDATE" ] && mkdir -p "$LOGPATH"/"$ETLDATE"
PIPENAME="P"$$"$GPTABLE"
eval `grep "^$CONFNAME" "$PRIPATH"/conf/oraconf |awk -F':' '{print $2}'|awk -F'^' '{print "ORACLE_USER="$1";ORACLE_PASS="$2";ORACLE_SID="$3}'`
eval $(eval `grep ^[^#] "$PRIPATH"/conf/gpconf |awk -F':' -v table=$GPTABLE '{printf("psql -h %s -p %d -U %s %s -tAc \047\\\d %s \047",$1,$2,$4,$3,table)}'`|awk -F"|" '{cmd=cmd$1","}END{print "collist="cmd}')
collist=`echo $collist|sed "s/,$//g"`


echo >> "$LOGPATH"/"$ETLDATE"/"$GPTABLE".log
#create and modify template for gpload use
log_info "create template "$LOGPATH"/"$ETLDATE"/"$GPTABLE".ctl."
cp "$TPLPATH"/gp_template_load_"$MODE".ctl "$LOGPATH"/"$ETLDATE"/"$GPTABLE".ctl
if [ $? -ne 0 ]; then
    log_info "create template "$LOGPATH"/"$ETLDATE"/"$GPTABLE".ctl failed."
        exit 2
fi

parse_yaml "$LOGPATH"/"$ETLDATE"/"$GPTABLE".ctl $GPTABLE $PIPENAME $ETLDATE
if [ $? -ne 0 ]; then
    log_info "modify template "$LOGPATH"/"$ETLDATE"/"$GPTABLE".ctl failed."
        exit 2
fi

#create pipename
log_info "create pipe /tmp/"$PIPENAME"."
mknod /tmp/"$PIPENAME" p
if [ $? -ne 0 ];then
    log_info "create pipe failed!"
        exit 3
fi

gpload -f "$LOGPATH"/"$ETLDATE"/"$GPTABLE".ctl -l "$LOGPATH"/"$ETLDATE"/"$GPTABLE".log &
collect_subproc $!
log_info "unload sql:select $collist from $ORATABLE"
sqluldr2 user="$ORACLE_USER"/"$ORACLE_PASS"@"$ORACLE_SID" query="select $collist from $ORATABLE" head=Yes field=0x7c file=/tmp/"$PIPENAME" charset=gb18030 text=CSV safe=yes presql="begin dbms_lock.sleep(5); end;" log=+"$LOGPATH"/"$ETLDATE"/"$GPTABLE".log &
collect_subproc $!


wait
if [ $? -ne 0 ];then
    log_info "$GPTABLE load failed!"
else
    log_info "$GPTABLE load succ!"    
fi
log_info "rm -rf /tmp/"$PIPENAME""
rm -rf /tmp/"$PIPENAME"
if [ $? -ne 0 ];then
    log_info "rm /tmp/"$PIPENAME" failed."
    exit 4
fi
相關文章
相關標籤/搜索