用 SQL 對關係型數據庫進行查詢

時間 2019-11-12

標籤 sql 關係數據庫進行查詢欄目 SQL 简体版

原文原文鏈接

前面幾節中，咱們已經掌握瞭如何向 SQLite 數據庫中寫入數據。這一節，咱們將學習
如何根據需求對數據庫進行查詢，進而從中獲取數據。接下來的例子中會使
用 data/datasets.sqlite（以前建立的）。
首先，須要與數據庫創建鏈接：
con <- dbConnect(SQLite( ), "data/datasets.sqlite")
dbListTables(con)sql

## [1] "diamonds" "flights"
數據庫中有兩張表，咱們用 select 語句來選取 diamonds 中全部的數據。這裏需
要選擇全部的列（字段）。因此，咱們調用 dbGetQuery( )，將數據庫鏈接 con 和查詢
語句做爲參數輸入：
db_diamonds <- dbGetQuery(con,
"select * from diamonds")
head(db_diamonds, 3)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
注意，* 這個符號表明全部的字段。若是咱們只須要字段的一個子集，也能夠依次列
出字段名：
db_diamonds <- dbGetQuery(con,
"select carat, cut, color, clarity,
depth, price
from diamonds")
head(db_diamonds, 3)
## carat cut color clarity depth price
## 1 0.23 Ideal E SI2 61.5 326
## 2 0.21 Premium E SI1 59.8 326
## 3 0.23 Good E VS1 56.9 327
若是想要選取數據中全部不重複的值，可使用 select distinct。例如，下面的
代碼會返回 diamonds 表中 cut 字段的全部不重複的取值：
dbGetQuery(con, "select distinct cut from diamonds")
## cut
## 1 Ideal
## 2 Premium
## 3 Good
## 4 Very Good
## 5 Fair
注意，dbGetQuery( ) 老是返回一個數據框，雖然它只有一列。爲使單列數據框還
原成原子向量，只需從數據框中取出第 1 列：
11.1 操做關係型數據庫 327
dbGetQuery(con, "select distinct clarity from diamonds")[[1]]
## [1] "SI2" "SI1" "VS1" "VS2" "VVS2" "VVS1" "I1" "IF"
當用 select 選擇列查詢時，原表中的列名可能並不合意。此時，能夠用 A as B 的
形式，獲得名爲 B 的列，但 B 中的數據與原表的 A 列一致：
db_diamonds <- dbGetQuery(con,
"select carat, price, clarity as clarity_level from diamonds")
head(db_diamonds, 3)
## carat price clarity_level
## 1 0.23 326 SI2
## 2 0.21 326 SI1
## 3 0.23 327 VS1
有時候，咱們想要的值不是直接存儲在數據庫中，而是須要通過一些計算才能獲得。
這時，也可使用 A as B 的語句形式，這裏的 A 是現有列之間的算術運算式：
db_diamonds <- dbGetQuery(con,
"select carat, price, x * y * z as size
from diamonds")
head(db_diamonds, 3)
## carat price size
## 1 0.23 326 38.20203
## 2 0.21 326 34.50586
## 3 0.23 327 38.07688
假如要用現有列生成一個新列，再用該新列生成另外一個列，咱們該怎麼辦呢？
db_diamonds <- dbGetQuery(con,
"select carat, price, x * y * z as size,
price / size as value_density
from diamonds")
## Error in sqliteSendQuery(con, statement, bind.data): error in statement:
no such column: size
上面的作法是行不通的。語句 A as B 中，A 必須由已存在的列構成。然而，若是確
實須要這樣作，能夠用嵌套查詢的辦法，即經過一個內嵌的 select 語句產生一個臨時表，
再從臨時表中選出所需列：
db_diamonds <- dbGetQuery(con,
"select *, price / size as value_density from
(select carat, price, x * y * z as size
from diamonds)")
head(db_diamonds, 3)
## carat price size value_density
## 1 0.23 326 38.20203 8.533578
## 2 0.21 326 34.50586 9.447672
## 3 0.23 327 38.07688 8.587887
這種狀況下，在計算 price/size 時，size 已經在臨時表中定義了。
數據庫查詢的另外一個重要部分就是條件查詢。咱們使用 where 指明查詢結果應知足的
條件。例如，選擇 cut 值爲 Good 的鑽石數據：
good_diamonds <- dbGetQuery(con,
"select carat, cut, price from diamonds
where cut = 'Good'")
head(good_diamonds, 3)
## carat cut price
## 1 0.23 Good 327
## 2 0.31 Good 335
## 3 0.30 Good 339
注意，cut 取值爲 Good 的記錄只有不多一部分：
nrow(good_diamonds) / nrow(diamonds)
## [1] 0.09095291
若是查詢須要同時知足多個條件，能夠用 and 來連結這些條件。例如，選出所
有 cut 爲 Good 且 color 值爲 E 的記錄：
good_e_diamonds <- dbGetQuery(con,
"select carat, cut, color, price from diamonds
where cut = 'Good' and color = 'E'")
head(good_e_diamonds, 3)
## carat cut color price
## 1 0.23 Good E 327
## 2 0.23 Good E 402
## 3 0.26 Good E 554
nrow(good_e_diamonds) / nrow(diamonds)
## [1] 0.017297
一樣的邏輯也適用於 or 和 not。
除了這些簡單的邏輯運算以外，也能夠經過檢查字段的值是否包含在給定集合中，可
以使用 in 來篩選記錄。例如，篩選出 color 爲 E 或 F 的記錄：
color_ef_diamonds <- dbGetQuery(con,
"select carat, cut, color, price from diamonds
where color in ('E', 'F')")
nrow(color_ef_diamonds)
## [1] 19339
咱們用下表驗證該結果：
table(diamonds$color)
##
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
使用 in 語句的時候，咱們須要爲它指定一個集合。而語句 between … and…則需
要指定一個區間：
some_price_diamonds <- dbGetQuery(con,
"select carat, cut, color, price from diamonds
where price between 5000 and 5500")
nrow(some_price_diamonds) /nrow(diamonds)
## [1] 0.03285132
實際上這個區間不必定是數值型的，只要字段的數據類型是可比的便可。而對於字符
串類型的列，咱們可用 between 'string1' to 'string2' 語句，按照字典的排列順
序來篩選記錄。
針對字符串字段，還有一個有用的運算符：like，它能夠用來篩選具備某種模式的字
段。例如，我們可以選出表中 cut 變量取值以 Good 結尾的記錄。它可以
是 Good 或 VeryGood。咱們用 like '%Good'，這裏的 % 符號能夠匹配任何字符串。
good_cut_diamonds <- dbGetQuery(con,
"select carat, cut, color, price from diamonds
where cut like '%Good' ")
nrow(good_cut_diamonds) / nrow(diamonds)
## [1] 0.3149425
數據庫查詢還有一個重要功能，即按照指定字段從新排列數據，可使用 order by 實
現這個功能。例如，檢索全部記錄的 carat 和 price 字段，並按照 price 字段升序排列：
cheapest_diamonds <- dbGetQuery(con,
"select carat, price from diamonds
order by price")
如此即可獲得一個鑽石數據的數據框，按照由便宜到昂貴的順序排列：
head(cheapest_diamonds)
## carat price
## 1 0.23 326
## 2 0.21 326
## 3 0.23 327
## 4 0.29 334
## 5 0.31 335
## 6 0.24 336
在指定排序字段時加一個 desc，就能夠進行降序排列，這裏咱們獲得一個順序徹底
相反的數據框：
most_expensive_diamonds <- dbGetQuery(con,
"select carat, price from diamonds
order by price desc")
head(most_expensive_diamonds)
## carat price
## 1 2.29 18823
## 2 2.00 18818
## 3 1.51 18806
## 4 2.07 18804
## 5 2.00 18803
## 6 2.29 18797
也能夠根據多個字段（或列）對記錄進行排序。例如，首先按照 price 進行升序排列，
若是兩條記錄的 price 取值相等，再按照 carat 進行降序排列：
cheapest_diamonds <- dbGetQuery(con,
"select carat, price from diamonds
order by price, carat desc")
head(cheapest_diamonds)
## carat price
## 1 0.23 326
## 2 0.21 326
## 3 0.23 327
## 4 0.29 334
## 5 0.31 335
## 6 0.24 336
就像 select 語句中用於排序的列能夠是根據已有列計算生成的：
dense_diamonds <- dbGetQuery(con,
"select carat, price, x * y * z as size from diamonds
order by carat /size desc")
head(dense_diamonds)
## carat price size
## 1 1.07 5909 47.24628
## 2 1.41 9752 74.41726
## 3 1.53 8971 85.25925
## 4 1.51 7188 133.10400
## 5 1.22 3156 108.24890
## 6 1.12 6115 100.97448
同時使用 where 和 order by 即可獲得一個排序的子集結果：
head(dbGetQuery(con,
"select carat, price from diamonds
where cut = 'Ideal' and clarity = 'IF' and color = 'J'
order by price"))
## carat price
## 1 0.30 489
## 2 0.30 489
## 3 0.32 521
## 4 0.32 533
## 5 0.32 533
## 6 0.35 569
若是隻關心前幾行結果，咱們能夠用 limit 來限制取出的記錄條數：
dbGetQuery(con,
"select carat, price from diamonds
order by carat desc limit 3")
## carat price
## 1 5.01 18018
## 2 4.50 18531
## 3 4.13 17329
除了字段選擇（按列選取）、條件篩選和排序，咱們還能夠在數據庫中對記錄進行分組
聚合。例如，計算每種顏色的記錄條數：
dbGetQuery(con,
"select color, count(*) as number from diamonds
group by color")
## color number
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
對原始數據調用 table( )，檢驗查詢結果：
table(diamonds$color)
##
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
除了彙總計數，其餘聚合函數還有 avg( )、max( )、min( ) 和 sum( )。例如，
計算不一樣透明度水平的平均價格：
dbGetQuery(con,
"select clarity, avg(price) as avg_price
from diamonds
group by clarity
order by avg_price desc")
## clarity avg_price
## 1 SI2 5063.029
## 2 SI1 3996.001
## 3 VS2 3924.989
## 4 I1 3924.169
## 5 VS1 3839.455
## 6 VVS2 3283.737
## 7 IF 2864.839
## 8 VVS1 2523.115
也能夠檢查一下，在最低的 5 個價格水平下，能買到的最大克拉數是多少：
dbGetQuery(con,
"select price, max(carat) as max_carat
from diamonds
group by price
order by price limit 5")
## price max_carat
## 1 326 0.23
## 2 327 0.23
## 3 334 0.29
## 4 335 0.31
## 5 336 0.24
還能夠在組內同時進行多個運算。如下代碼計算了每一個透明度水平下的價格區間和價
格平均值：
dbGetQuery(con,
"select clarity,
min(price) as min_price,
max(price) as max_price,
avg(price) as avg_price
from diamonds
group by clarity
order by avg_price desc")
## clarity min_price max_price avg_price
## 1 SI2 326 18804 5063.029
## 2 SI1 326 18818 3996.001
## 3 VS2 334 18823 3924.989
## 4 I1 345 18531 3924.169
## 5 VS1 327 18795 3839.455
## 6 VVS2 336 18768 3283.737
## 7 IF 369 18806 2864.839
## 8 VVS1 336 18777 2523.115
接下來的例子，用重量進行加權，計算了不一樣透明度水平下每克拉鑽石的平均價格：
dbGetQuery(con,
"select clarity,
sum(price * carat) / sum(carat) as wprice
from diamonds
group by clarity
order by wprice desc")
## clarity wprice
## 1 SI2 7012.257
## 2 VS2 6173.858
## 3 VS1 6059.505
## 4 SI1 5919.187
## 5 VVS2 5470.156
## 6 I1 5233.937
## 7 IF 5124.584
## 8 VVS1 4389.112
就像能夠根據多個字段進行排序，咱們也能夠根據多個字段進行分組。如下代碼計算
了不一樣透明度水平和顏色種類下鑽石的平均價格，並展現了最昂貴的 5 種組合：
dbGetQuery(con,
"select clarity, color,
avg(price) as avg_price
from diamonds
group by clarity, color
order by avg_price desc
limit 5")
## clarity color avg_price
## 1 IF D 8307.370
## 2 SI2 I 7002.649
## 3 SI2 J 6520.958
## 4 SI2 H 6099.895
## 5 VS2 I 5690.506
關係型數據中，最能體現「關係」概念的運算是表的鏈接（join），即將若干表經過某
些字段鏈接起來。例如，建立一個新的數據框 diamond_selector，包含字段 cut、color
和 clarity 的共有 3 條記錄，以後咱們將根據這 3 條記錄篩選數據：
diamond_selector <- data.frame(
cut = c("Ideal", "Good", "Fair"),
color = c("E", "I", "D"),
clarity = c("VS1", "T1", "IF"),
stringsAsFactors = FALSE
)
diamond_selector
## cut color clarity
## 1 Ideal E VS1
## 2 Good I T1
## 3 Fair D IF
建立好數據框後，咱們將它寫入數據庫，而後鏈接 diamonds 表和 diamond_
selector 表，再篩選出合意的記錄：
dbWriteTable(con, "diamond_selector", diamond_selector,
row.names=FALSE, overwrite=TRUE)
## [1] TRUE
經過鏈接子句（join-clause）聲明相匹配的列：
subset_diamonds <- dbGetQuery(con,
"select cut, color, clarity, carat, price
from diamonds
join diamond_selector using (cut, color, clarity)")
head(subset_diamonds)
## cut color clarity carat price
## 1 Ideal E VS1 0.60 2774
## 2 Ideal E VS1 0.26 556
## 3 Ideal E VS1 0.70 2818
## 4 Ideal E VS1 0.70 2837
## 5 Ideal E VS1 0.26 556
## 6 Ideal E VS1 0.26 556
總的來講，符合 3 個篩選條件其中之一的，只有不多一部分記錄：
nrow(subset_diamonds) / nrow(diamonds)
## [1] 0.01104931
最後，不要忘記斷開數據庫鏈接，以確保全部資源被正確釋放：
dbDisconnect(con)
## [1] TRUE
在前面的例子中，咱們只展現了 SQL 用於查詢關係型數據庫（以 SQLite 爲例）的基
本用法。實際上，SQL 遠比咱們演示的更加豐富和強大。若想了解更多細節，請訪
問 http://www.w3schools.com/sql。數據庫