R語言做爲BI中ETL的工具,增刪改web
R語言提供了強大的R_package與各類數據庫進行數據交互。
外加其強大數據變換清洗函數,爲ETL提供一條方便快捷的道路。
RODBC
ROracal
RMysql
Rmongodb http://mirrors.ustc.edu.cn/CRAN/web/packages/rmongodb/vignettes/rmongodb_cheat_sheet.pdfsql
library(RODBC) con<-odbcConnect("LI") con
RODBC Connection 1 Details: case=nochange DSN=LI UID= Trusted_Connection=Yes APP=RStudio WSID=LIYI-PC
data(USArrests) head(USArrests) Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7
sqlSave(con, USArrests)
sqlSave(channel, dat, tablename = NULL, append = FALSE, rownames = TRUE, colnames = FALSE, verbose = FALSE, safer = TRUE, addPK = FALSE, typeInfo, varTypes, fast = TRUE, test = FALSE, nastring = NULL) channel 數據庫鏈接通道 dat data frame.要存入的數據集 tablename character 數據庫中表名 index character 索引列的名字 append logical邏輯變量 ,是否數據集添加/寫入已存在的表 rownames logical 邏輯變量 or character字符串,若是是logical,表示是否把rowname這個字符串做爲數據庫表首列列名,若是是字符串,則將新字符串做爲表首列列名 colnames logical 邏輯變量 是否將數據集的列名保留做爲表的首行 (謹慎更改,可能致使,列名變成數據第一行,各列的數據類型所有變爲varchar(255) verbose display statements as they are sent to the server? safer logical邏輯變量.若是真,生成一個新表,不在已存在的表後添加。若是假,強制刪除已存在的同名表並新建,或者刪除表中已存在的數據,覆蓋寫入 addPK logical邏輯變量,是否將首列做爲主鍵 typeInfo list 數據框中數據類型。包括character ,double ,integer varTypes an optional named character vector giving the DBMSs datatypes to be used for some (or all) of the columns if a table is to be created.可選項涉及各列數據類型轉換,由於數據庫中的數據類型比R語言中的要多不少。 fast logical. If false, write data a row at a time. If true, use a parametrized INSERT INTO or UPDATE query to write all the data in one operation. 邏輯變量,若是F,數據將一次一行地寫入,若是爲Ture,將用到變量插入INSERT INTO,或者UPDATE 將數據一次性寫入 nastring optional character string to be used for writing NAs to the database. 選擇哪一種字符串,將缺失項在數據庫中填充
getSqlTypeInfo("Microsoft SQL Server") $double [1] "float" $integer [1] "int" $character [1] "varchar(255)" $logical [1] "varchar(5)"
sqlSave(con, USArrests,rownames = "city", addPK = T) # 原沒有列名的rownames更名爲city,並設置首列爲主鍵key
sqlSave(con, USArrests,'USA2',rownames = "city", addPK = T,fast=T,test=T) #注意此操做可能在數據庫中create名爲USA2的空表
Binding: 'city' DataType 12, ColSize 255 Binding: 'Murder' DataType 6, ColSize 53 Binding: 'Assault' DataType 4, ColSize 10 Binding: 'UrbanPop' DataType 4, ColSize 10 Binding: 'Rape' DataType 6, ColSize 53 Parameters: no: 1: city Alabama/***/no: 2: Murder 13.2/***/no: 3: Assault 236/***/no: 4: UrbanPop 58/***/no: 5: Rape 21.2/***/ no: 1: city Alaska/***/no: 2: Murder 10/***/no: 3: Assault 263/***/no: 4: UrbanPop 48/***/no: 5: Rape 44.5/***/ no: 1: city Arizona/***/no: 2: Murder 8.1/***/no: 3: Assault 294/***/no: 4: UrbanPop 80/***/no: 5: Rape 31/***/ no: 1: city Arkansas/***/no: 2: Murder 8.8/***/no: 3: Assault 190/***/no: 4: UrbanPop 50/***/no: 5: Rape 19.5/***/ no: 1: city California/***/no: 2: Murder 9/***/no: 3: Assault 276/***/no: 4: UrbanPop 91/***/no: 5: Rape 40.6/***/ no: 1: city Colorado/***/no: 2: Murder 7.9/***/no: 3: Assault 204/***/no: 4: UrbanPop 78/***/no: 5: Rape 38.7/***/ no: 1: city Connecticut/***/no: 2: Murder 3.3/***/no: 3: Assault 110/***/no: 4: UrbanPop 77/***/no: 5: Rape 11.1/***/ no: 1: city Delaware/***/no: 2: Murder 5.9/***/no: 3: Assault 238/***/no: 4: UrbanPop 72/***/no: 5: Rape 15.8/***/ no: 1: city Florida/***/no: 2: Murder 15.4/***/no: 3: Assault 335/***/no: 4: UrbanPop 80/***/no: 5: Rape 31.9/***/ no: 1: city Georgia/***/no: 2: Murder 17.4/***/no: 3: Assault 211/***/no: 4: UrbanPop 60/***/no: 5: Rape 25.8/***/ no: 1: city Hawaii/***/no: 2: Murder 5.3/***/no: 3: Assault 46/***/no: 4: UrbanPop 83/***/no: 5: Rape 20.2/***/ no: 1: city Idaho/***/no: 2: Murder 2.6/***/no: 3: Assault 120/***/no: 4: UrbanPop 54/***/no: 5: Rape 14.2/***/ no: 1: city Illinois/***/no: 2: Murder 10.4/***/no: 3: Assault 249/***/no: 4: UrbanPop 83/***/no: 5: Rape 24/***/ no: 1: city Indiana/***/no: 2: Murder 7.2/***/no: 3: Assault 113/***/no: 4: UrbanPop 65/***/no: 5: Rape 21/***/ no: 1: city Iowa/***/no: 2: Murder 2.2/***/no: 3: Assault 56/***/no: 4: UrbanPop 57/***/no: 5: Rape 11.3/***/ no: 1: city Kansas/***/no: 2: Murder 6/***/no: 3: Assault 115/***/no: 4: UrbanPop 66/***/no: 5: Rape 18/***/ # 此處省略10000字
sqlColumns Enquire about the column structure of tables on an ODBC database connection. 訪問數據庫表的結構
columnsenquire<-sqlColumns(con,'USA2') str(columnsenquire)
str(columnsenquire) 'data.frame': 5 obs. of 29 variables: $ TABLE_CAT : chr "master" "master" "master" "master" ... $ TABLE_SCHEM : chr "dbo" "dbo" "dbo" "dbo" ... $ TABLE_NAME : chr "USA2" "USA2" "USA2" "USA2" ... $ COLUMN_NAME : chr "city" "Murder" "Assault" "UrbanPop" ... $ DATA_TYPE : int 12 6 4 4 6 $ TYPE_NAME : chr "varchar" "float" "int" "int" ... $ COLUMN_SIZE : int 255 53 10 10 53 $ BUFFER_LENGTH : int 255 8 4 4 8 $ DECIMAL_DIGITS : int NA NA 0 0 NA $ NUM_PREC_RADIX : int NA 2 10 10 2 $ NULLABLE : int 0 1 1 1 1 $ REMARKS : chr NA NA NA NA ... $ COLUMN_DEF : chr NA NA NA NA ... $ SQL_DATA_TYPE : int 12 6 4 4 6 $ SQL_DATETIME_SUB : int NA NA NA NA NA $ CHAR_OCTET_LENGTH : int 255 NA NA NA NA $ ORDINAL_POSITION : int 1 2 3 4 5 $ IS_NULLABLE : chr "NO" "YES" "YES" "YES" ... $ SS_IS_SPARSE : int 0 0 0 0 0 $ SS_IS_COLUMN_SET : int 0 0 0 0 0 $ SS_IS_COMPUTED : int 0 0 0 0 0 $ SS_IS_IDENTITY : int 0 0 0 0 0 $ SS_UDT_CATALOG_NAME : chr NA NA NA NA ... $ SS_UDT_SCHEMA_NAME : chr NA NA NA NA ... $ SS_UDT_ASSEMBLY_TYPE_NAME : chr NA NA NA NA ... $ SS_XML_SCHEMACOLLECTION_CATALOG_NAME: chr NA NA NA NA ... $ SS_XML_SCHEMACOLLECTION_SCHEMA_NAME : chr NA NA NA NA ... $ SS_XML_SCHEMACOLLECTION_NAME : chr NA NA NA NA ... $ SS_DATA_TYPE : chr "39" "109" "38" "38" ...
sqlQuery(con,'select * from USArrests')
# 注意此時第一列的名字已經爲city了 city Murder Assault UrbanPop Rape 1 Alabama 13.2 236 58 21.2 2 Alaska 10.0 263 48 44.5 3 Arizona 8.1 294 80 31.0 4 Arkansas 8.8 190 50 19.5 5 California 9.0 276 91 40.6 6 Colorado 7.9 204 78 38.7 7 Connecticut 3.3 110 77 11.1 8 Delaware 5.9 238 72 15.8 9 Florida 15.4 335 80 31.9 10 Georgia 17.4 211 60 25.8 11 Hawaii 5.3 46 83 20.2 12 Idaho 2.6 120 54 14.2 13 Illinois 10.4 249 83 24.0 14 Indiana 7.2 113 65 21.0 ...
對於sql語句多是以‘XX’ 結尾則須要用以下形式來進行查詢mongodb
sqlQuery(con,paste('select * from USArrests', "where city = 'Alabama'"))
city Murder Assault UrbanPop Rape 1 Alabama 13.2 236 58 21.2
可是對於Update,如下倒是失效的數據庫
a<-paste("update [master].[dbo].[USArrests]", "set Murder =13.2","where city ='Alabama'") sqlQuery(con,a) # 失效 sqlUpdate(con,a) # 失效
sqlUpdate() sqlUpdate(channel, dat, tablename = NULL, index = NULL, verbose = FALSE, test = FALSE, nastring = NULL, fast = TRUE) 不能進行腳本語句直接更新,可是能夠進行以下操做 foo <- cbind(city=row.names(USArrests), USArrests)[1:3, c(1,3)] foo city Assault Alabama Alabama 236 Alaska Alaska 263 Arizona Arizona 294 foo[1,2] <- 200 foo city Assault Alabama Alabama 200 Alaska Alaska 263 Arizona Arizona 294 流程是先選定要更新的行列,將值更新,而後再將值update入庫
實例以下app
temp<-sqlQuery(con,paste('select * from USArrests', "where city = 'Alabama'")) temp city Murder Assault UrbanPop Rape 1 Alabama 13.2 300 58 21.2 str(temp) 'data.frame': 1 obs. of 5 variables: $ city : Factor w/ 1 level "Alabama": 1 $ Murder : num 13.2 $ Assault : num 300 $ UrbanPop: int 58 $ Rape : num 21.2 temp[1,] [1] 300 temp[1,3]<-200 sqlUpdate(con, temp, "USArrests") sqlQuery(con,paste('select * from USArrests', "where city = 'Alabama'")) city Murder Assault UrbanPop Rape 1 Alabama 13.2 200 58 21.2
sqlFetch(con, "USArrests", rownames = "city", max = 20,rows_at_time = 10)
實踐後發現,單單對於簡單的ETL,sqlQuery,sqlUpdate是足夠了,
寫一些for循環+list.files/liset.dir+reshape/ddplyr/tidyr(進行數據篩選,清洗,變換),
對於腳本是否執行的問題,能夠寫日誌文件,對ETL過程進行檢測。函數
sqlClear deletes all the rows of the table sqtable. #清楚表中數據 sqlDrop removes the table sqtable (if permitted). #刪除表 sqlClear(channel, sqtable, errors = TRUE) sqlDrop(channel, sqtable, errors = TRUE)