R語言做爲BI中ETL的工具

時間 2019-11-09

標籤語言做爲 etl 工具欄目大數據简体版

原文原文鏈接

R語言做爲BI中ETL的工具，增刪改web

R語言提供了強大的R_package與各類數據庫進行數據交互。
外加其強大數據變換清洗函數，爲ＥＴＬ提供一條方便快捷的道路。
RODBC
ROracal
RMysql
Rmongodb http://mirrors.ustc.edu.cn/CRAN/web/packages/rmongodb/vignettes/rmongodb_cheat_sheet.pdfsql

step1 新建鏈接con，並查看其信息

library(RODBC) 
con<-odbcConnect("LI")
con

RODBC Connection 1
Details:
  case=nochange
  DSN=LI
  UID=
  Trusted_Connection=Yes
  APP=RStudio
  WSID=LIYI-PC

step2 引入數據集USArrests

data(USArrests)
head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

step3 將USArrests保存到sqlserver

sqlSave(con, USArrests)

sqlSave(channel, dat, tablename = NULL, append = FALSE,
        rownames = TRUE, colnames = FALSE, verbose = FALSE,
        safer = TRUE, addPK = FALSE, typeInfo, varTypes,
        fast = TRUE, test = FALSE, nastring = NULL)
channel      數據庫鏈接通道
dat              data frame.要存入的數據集
tablename  character 數據庫中表名
index          character 索引列的名字
append       logical邏輯變量 ，是否數據集添加/寫入已存在的表
rownames  logical 邏輯變量 or character字符串,若是是logical，表示是否把rowname這個字符串做爲數據庫表首列列名，若是是字符串，則將新字符串做爲表首列列名
colnames  logical 邏輯變量 是否將數據集的列名保留做爲表的首行 (謹慎更改，可能致使，列名變成數據第一行，各列的數據類型所有變爲varchar(255)
verbose   display statements as they are sent to the server?
safer     logical邏輯變量.若是真，生成一個新表，不在已存在的表後添加。若是假，強制刪除已存在的同名表並新建，或者刪除表中已存在的數據，覆蓋寫入
addPK     logical邏輯變量,是否將首列做爲主鍵
typeInfo  list 數據框中數據類型。包括character ,double ,integer
varTypes  an optional named character vector giving the DBMSs datatypes to be used for some (or all) of the columns if a table is to be created.可選項涉及各列數據類型轉換，由於數據庫中的數據類型比R語言中的要多不少。
fast      logical. If false, write data a row at a time. If true, use a parametrized INSERT INTO or UPDATE query to write all the data in one operation. 邏輯變量，若是F，數據將一次一行地寫入，若是爲Ture，將用到變量插入INSERT INTO，或者UPDATE 將數據一次性寫入
nastring     optional character string to be used for writing NAs to the database. 選擇哪一種字符串，將缺失項在數據庫中填充

getSqlTypeInfo("Microsoft SQL Server")
$double
[1] "float"

$integer
[1] "int"

$character
[1] "varchar(255)"

$logical
[1] "varchar(5)"

sqlSave(con, USArrests,rownames = "city", addPK = T) # 原沒有列名的rownames更名爲city,並設置首列爲主鍵key

sqlSave(con, USArrests,'USA2',rownames = "city", addPK = T,fast=T,test=T)  #注意此操做可能在數據庫中create名爲USA2的空表

Binding: 'city' DataType 12, ColSize 255
Binding: 'Murder' DataType 6, ColSize 53
Binding: 'Assault' DataType 4, ColSize 10
Binding: 'UrbanPop' DataType 4, ColSize 10
Binding: 'Rape' DataType 6, ColSize 53
Parameters:
no: 1: city Alabama/***/no: 2: Murder 13.2/***/no: 3: Assault 236/***/no: 4: UrbanPop 58/***/no: 5: Rape 21.2/***/
no: 1: city Alaska/***/no: 2: Murder 10/***/no: 3: Assault 263/***/no: 4: UrbanPop 48/***/no: 5: Rape 44.5/***/
no: 1: city Arizona/***/no: 2: Murder 8.1/***/no: 3: Assault 294/***/no: 4: UrbanPop 80/***/no: 5: Rape 31/***/
no: 1: city Arkansas/***/no: 2: Murder 8.8/***/no: 3: Assault 190/***/no: 4: UrbanPop 50/***/no: 5: Rape 19.5/***/
no: 1: city California/***/no: 2: Murder 9/***/no: 3: Assault 276/***/no: 4: UrbanPop 91/***/no: 5: Rape 40.6/***/
no: 1: city Colorado/***/no: 2: Murder 7.9/***/no: 3: Assault 204/***/no: 4: UrbanPop 78/***/no: 5: Rape 38.7/***/
no: 1: city Connecticut/***/no: 2: Murder 3.3/***/no: 3: Assault 110/***/no: 4: UrbanPop 77/***/no: 5: Rape 11.1/***/
no: 1: city Delaware/***/no: 2: Murder 5.9/***/no: 3: Assault 238/***/no: 4: UrbanPop 72/***/no: 5: Rape 15.8/***/
no: 1: city Florida/***/no: 2: Murder 15.4/***/no: 3: Assault 335/***/no: 4: UrbanPop 80/***/no: 5: Rape 31.9/***/
no: 1: city Georgia/***/no: 2: Murder 17.4/***/no: 3: Assault 211/***/no: 4: UrbanPop 60/***/no: 5: Rape 25.8/***/
no: 1: city Hawaii/***/no: 2: Murder 5.3/***/no: 3: Assault 46/***/no: 4: UrbanPop 83/***/no: 5: Rape 20.2/***/
no: 1: city Idaho/***/no: 2: Murder 2.6/***/no: 3: Assault 120/***/no: 4: UrbanPop 54/***/no: 5: Rape 14.2/***/
no: 1: city Illinois/***/no: 2: Murder 10.4/***/no: 3: Assault 249/***/no: 4: UrbanPop 83/***/no: 5: Rape 24/***/
no: 1: city Indiana/***/no: 2: Murder 7.2/***/no: 3: Assault 113/***/no: 4: UrbanPop 65/***/no: 5: Rape 21/***/
no: 1: city Iowa/***/no: 2: Murder 2.2/***/no: 3: Assault 56/***/no: 4: UrbanPop 57/***/no: 5: Rape 11.3/***/
no: 1: city Kansas/***/no: 2: Murder 6/***/no: 3: Assault 115/***/no: 4: UrbanPop 66/***/no: 5: Rape 18/***/
# 此處省略10000字

sqlColumns    Enquire about the column structure of tables on an ODBC database connection. 訪問數據庫表的結構

columnsenquire<-sqlColumns(con,'USA2')

str(columnsenquire)

str(columnsenquire)
'data.frame':   5 obs. of  29 variables:
 $ TABLE_CAT                           : chr  "master" "master" "master" "master" ...
 $ TABLE_SCHEM                         : chr  "dbo" "dbo" "dbo" "dbo" ...
 $ TABLE_NAME                          : chr  "USA2" "USA2" "USA2" "USA2" ...
 $ COLUMN_NAME                         : chr  "city" "Murder" "Assault" "UrbanPop" ...
 $ DATA_TYPE                           : int  12 6 4 4 6
 $ TYPE_NAME                           : chr  "varchar" "float" "int" "int" ...
 $ COLUMN_SIZE                         : int  255 53 10 10 53
 $ BUFFER_LENGTH                       : int  255 8 4 4 8
 $ DECIMAL_DIGITS                      : int  NA NA 0 0 NA
 $ NUM_PREC_RADIX                      : int  NA 2 10 10 2
 $ NULLABLE                            : int  0 1 1 1 1
 $ REMARKS                             : chr  NA NA NA NA ...
 $ COLUMN_DEF                          : chr  NA NA NA NA ...
 $ SQL_DATA_TYPE                       : int  12 6 4 4 6
 $ SQL_DATETIME_SUB                    : int  NA NA NA NA NA
 $ CHAR_OCTET_LENGTH                   : int  255 NA NA NA NA
 $ ORDINAL_POSITION                    : int  1 2 3 4 5
 $ IS_NULLABLE                         : chr  "NO" "YES" "YES" "YES" ...
 $ SS_IS_SPARSE                        : int  0 0 0 0 0
 $ SS_IS_COLUMN_SET                    : int  0 0 0 0 0
 $ SS_IS_COMPUTED                      : int  0 0 0 0 0
 $ SS_IS_IDENTITY                      : int  0 0 0 0 0
 $ SS_UDT_CATALOG_NAME                 : chr  NA NA NA NA ...
 $ SS_UDT_SCHEMA_NAME                  : chr  NA NA NA NA ...
 $ SS_UDT_ASSEMBLY_TYPE_NAME           : chr  NA NA NA NA ...
 $ SS_XML_SCHEMACOLLECTION_CATALOG_NAME: chr  NA NA NA NA ...
 $ SS_XML_SCHEMACOLLECTION_SCHEMA_NAME : chr  NA NA NA NA ...
 $ SS_XML_SCHEMACOLLECTION_NAME        : chr  NA NA NA NA ...
 $ SS_DATA_TYPE                        : chr  "39" "109" "38" "38" ...

step4 查詢數據

sqlQuery(con,'select * from USArrests')

# 注意此時第一列的名字已經爲city了
             city Murder Assault UrbanPop Rape
1         Alabama   13.2     236       58 21.2
2          Alaska   10.0     263       48 44.5
3         Arizona    8.1     294       80 31.0
4        Arkansas    8.8     190       50 19.5
5      California    9.0     276       91 40.6
6        Colorado    7.9     204       78 38.7
7     Connecticut    3.3     110       77 11.1
8        Delaware    5.9     238       72 15.8
9         Florida   15.4     335       80 31.9
10        Georgia   17.4     211       60 25.8
11         Hawaii    5.3      46       83 20.2
12          Idaho    2.6     120       54 14.2
13       Illinois   10.4     249       83 24.0
14        Indiana    7.2     113       65 21.0
...

對於sql語句多是以‘XX’ 結尾則須要用以下形式來進行查詢mongodb

sqlQuery(con,paste('select * from USArrests',
         "where city = 'Alabama'"))

city Murder Assault UrbanPop Rape
1 Alabama   13.2     236       58 21.2

可是對於Update，如下倒是失效的數據庫

a<-paste("update [master].[dbo].[USArrests]",
         "set Murder =13.2","where city ='Alabama'")
sqlQuery(con,a)   # 失效
sqlUpdate(con,a)  # 失效

sqlUpdate()  sqlUpdate(channel, dat, tablename = NULL, index = NULL,
          verbose = FALSE, test = FALSE, nastring = NULL,
          fast = TRUE) 
不能進行腳本語句直接更新，可是能夠進行以下操做
foo <- cbind(city=row.names(USArrests), USArrests)[1:3, c(1,3)]
 foo
           city Assault
Alabama Alabama     236
Alaska   Alaska     263
Arizona Arizona     294
 foo[1,2] <- 200
foo
           city Assault
Alabama Alabama     200
Alaska   Alaska     263
Arizona Arizona     294
流程是先選定要更新的行列，將值更新，而後再將值update入庫

實例以下app

temp<-sqlQuery(con,paste('select * from USArrests',
"where city = 'Alabama'"))
temp
     city Murder Assault UrbanPop Rape
1 Alabama   13.2     300       58 21.2
  str(temp)
'data.frame':   1 obs. of  5 variables:
 $ city    : Factor w/ 1 level "Alabama": 1
 $ Murder  : num 13.2
 $ Assault : num 300
 $ UrbanPop: int 58
 $ Rape    : num 21.2
  temp[1,]
[1] 300

  temp[1,3]<-200
  sqlUpdate(con, temp, "USArrests")
  sqlQuery(con,paste('select * from USArrests',
"where city = 'Alabama'"))
     city Murder Assault UrbanPop Rape
1 Alabama   13.2     200       58 21.2

sqlFetch(con, "USArrests", rownames = "city", max = 20,rows_at_time = 10)

實踐後發現，單單對於簡單的ETL，sqlQuery,sqlUpdate是足夠了，
寫一些for循環+list.files/liset.dir+reshape/ddplyr/tidyr（進行數據篩選，清洗，變換）,
對於腳本是否執行的問題，能夠寫日誌文件，對ETL過程進行檢測。函數

step5 構建日誌文件
1. 監督機制，對於每次要寫入/更新的數據進行count（originaldata），保存數據，而後與每次執行寫入/變動的行數Rows進行比對，比對結果result爲邏輯變量。將三者連同執行時間，寫成一個數據框，並在數據中新建對應的日誌表，對執行結果進行監督。（對於追蹤Rows沒有找到很好的解決方法，send a letter ...)
2. 暴力美學，對於sqlSave / sqlQuery 的執行結果進行追蹤，成功爲「1」,失敗則"ON Errors"
step6 快速操做數據庫

sqlClear deletes all the rows of the table sqtable. #清楚表中數據

sqlDrop removes the table sqtable (if permitted). #刪除表

sqlClear(channel, sqtable, errors = TRUE)

sqlDrop(channel, sqtable, errors = TRUE)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。