本書中的例子包括在nutshell的R包中,使用數據,需加載nutshell包html
install.packages("nutshell") ios
R provides a way to run a large set of commands in sequence and save the results to a file.web
以batch mode運行R的一種方式是:使用系統命令行(不是R控制檯)。,經過命令行運行R的好處是不用啓動R就能夠運行一系列命令。這對於自動化分析很是有幫助。,sql
更多關於從命令行運行R的信息運行如下命令查看shell
$ R CMD BATCH+R腳本數據庫
$ R --helpexpress
# 批運行R腳本的第二個命令apache
$ RScript+R腳本編程
# 在R內批運行R腳本,使用:windows
Source命令
RExcel軟件(http://rcom.univie.ac.at / http://rcom.univie.ac.at/download.html)
若是已經安裝了R,直接能夠安裝RExcel包,下面的代碼執行如下路徑:
Download RExcelàconfigure the RCOM服務器—>安裝RDCOMà啓動RExcel安裝器
> install.packages("RExcelInstaller", "rcom", "rsproxy") # 這種安裝方式不行 > # configure rcom > library(rcom) > comRegisterRegistry() > library(RExcelInstaller) > # execute the following command in R to start the installer for RDCOM > installstatconnDCOM() > # execute the following command in R to start the installer for REXCEL > installRExcel() |
安裝了RExcel以後,就能夠在Excel的菜單項中訪問RExcel啦!!!
As a web application
The rApache software allows you to incorporate analyses from R into a web
application. (For example, you might want to build a server that shows sophisticated
reports using R lattice graphics.) For information about this project, see
http://biostat.mc.vanderbilt.edu/rapache/.
As a server
The Rserve software allows you to access R from within other applications. For
example, you can produce a Java program that uses R to perform some calculations.
As the name implies, Rserver is implemented as a network server, so a
single Rserve instance can handle calculations from multiple users on different
machines. One way to use Rserve is to install it on a heavy-duty server with lots
of CPU power and memory, so that users can perform calculations that they
couldn't easily perform on their own desktops. For more about this project, see
http://www.rforge.net/Rserve/index.html.
As we described above, you can also use R Studio to run R on a server and access
if from a web browser.
Inside Emacs
The ESS (Emacs Speaks Statistics) package is an add-on for Emacs that allows
you to run R directly within Emacs. For more on this project, see http://ess.r-project.org/
向量是最簡單的數據結構,數組是一個多維向量,矩陣是一個二維數據;
數據框一個列表(包含了多個長度相同的命名向量!),很像一個電子表格或數據庫表。
@定義一個數組
> a <- array(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), dim=c(3, 4))
Here is what the array looks like:
> a
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
And here is how you reference one cell:
> a[2,2]
[1] 5
@定義一個矩陣
> m <- matrix(data=c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=3,ncol=4)
> m
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
@定義一個數據框
> teams <- c("PHI","NYM","FLA","ATL","WSN")
> w <- c(92, 89, 94, 72, 59)
> l <- c(70, 73, 77, 90, 102)
> nleast <- data.frame(teams,w,l)
> nleast
teams w l
1 PHI 92 70
2 NYM 89 73
3 FLA 94 77
4 ATL 72 90
5 WSN 59 102
R中的每個對象都有一個類型。此外,每個對象都是一個類的成員。
可使用class函數來肯定一個對象的類,例如:
> class(class)
[1] "function"
> class(mtcars)
[1] "data.frame"
> class(letters)
[1] "character"
不一樣類的方法能夠有相同的名稱,這些方法被稱爲泛函數(generic function)。
好比,+是一個adding objects的泛函。它能夠執行數值相加,日期相加等,以下
> 17 + 6
[1] 23
> as.Date("2009-09-08") + 7
[1] "2009-09-15"
順便提一下,R解釋器會調用print(x)函數來打印結果,這意味着,若是咱們定義了一個新的類,能夠定義一個print方法來指定從該新類中生成的對象如何顯示在控制檯上!
To statisticians, a model is a concise way to describe a set of data, usually with a mathematical formula. Sometimes, the goal is to build a predictive model with training data to predict values based on other data. Other times, the goal is to build a descriptive model that helps you understand the data better.
R has a special notation for describing relationships between variables. Suppose that
you are assuming a linear model for a variable y, predicted from the variables x1,
x2, ..., xn. (Statisticians usually refer to y as the dependent variable, and x1, x2, ...,
xn as the independent variables.)。在方程中,能夠表示爲
在R中,將這種關係寫成,這是公式對象的一種形式。
以base包中的car數據集爲例,簡單解釋一下公式對象的用法。Car數據集顯示了不一樣車的speed和stopping distance。咱們假設stopping distance是speed的一個線性函數,所以,使用線性迴歸來估計二者的關係。公式能夠寫成:dist~speed。使用lm函數來估計模型的參數,該函數返回一個lm類對象。
For some more information,使用summary函數
能夠看到,summary函數顯示了function call,擬合參數的分佈(the distribution
of the residuals from the fit),相關係數(coefficients)以及擬合信息。
R包括了各類數據可視化包:graphics、grid、lattice。爲了簡單解釋一下圖形功能,使用國家足球隊的射門得分嘗試(field goal attempts)數據(來自nutshell包)來演示。一個隊在一組球門中踢球,進球得3分。若是丟掉一個射門球,則將足球交給其餘隊。
@首先,來看看距離(distance)的分佈。這裏使用hist函數
進一步叫作breaks參數來向直方圖中添加更多的bins
@ 使用lattice包來一個例子
數據集(we'll look at how American eating habits changed between 1980
and 2005:來自nutshell包)。
具體地說,咱們要查看how amount(The amount of food consumed) varies by year,同時還要針對Food變量的每個值分別繪製趨勢。在lattice包中,咱們經過一個公式來指定想要繪圖的數據,在本例中,形如:Amount ~ Year | Food。
然而,默認圖形可讀性弱,axis標籤(lables)太大,每幅圖的scale(縱橫比)相同,所以須要作一些微調。
> library(lattice)
> data(consumption)
> dotplot(Amount~Year|Food,consumption,aspect="xy",scales=list(relation='sliced',cex=.4))
# The aspect option changes the aspect ratios of each plot to try to show changes from
45° angles (making changes easier to see). The scales option changes how the axes
are drawn.
@ 獲取關於一個函數的幫助,如glm
# help(glm)------>?glm
@ 對於特殊字符如 +,須要將其放在backquotes(反引號)中
#?'+'
@ 查看一個幫助文件中的例子,好比查看glm中的例子
# example(glm)
@ 能夠搜索關於一個話題(topic)的吧幫助,好比"迴歸",使用help.searceh函數
# help.search("regression")
一種簡便的方式是直接使用:??regression
@ 獲取一個包的幫助文件,好比獲取grDevices包的幫助,使用
# library(help='grDevices')
@ vignette函數
1)一些包(特別是來自Bioconductor)會包含至少一個vignette,一個vignette是關於如何使用包的簡短描述(帶例子)。例如,查看affy包的vignette(前提是要已安裝affy包),使用
# vignette("affy")
2)查看全部附加包(attached packages)的可用vignettes,使用
# vignette(all=FALSE)
3)查看全部已安裝包的vignettes,使用
# vignette(all=TRUE)
使用包的第一步:將包安裝到本例庫中(local library);第二步:將包加載到當前工做區中(current session)。
R的幫助系統會隨着愈來愈多的search包而變得異常慢。two packages may both use
functions with names like "fit" that work very differently, resulting in strange and
unexpected results. By loading only packages that you need, you can minimize the
chance of these conflicts。
@ To get the list of packages loaded by default,
# getOption("defaultPackages")
This command omits the base package; the base package implements many key
features of the R language and is always loaded.
@ 查看當前已加載包的列表,使用
# (.packages())
@ 查看全部可用包,使用
(.packages(all.available=TRUE))
@ 還可使用不帶參數的library( )命令,這會彈出一個新窗口,顯示可用包的集合。
兩個最大的包來源是:CRAN (Comprehensive R Archive Network) and Bioconductor,另一個是R-Forge。還有好比:GitHub。
所有可用包的查詢地址
Repository URL
CRAN See http://cran.r-project.org/web/packages/ for an authoritative list, but you should try to find your local
mirror and use that site instead
Bioconductor http://www.bioconductor.org/packages/release/Software.html
R-Forge http://r-forge.r-project.org/
> install.packages(c("tree","maptree")) #安裝包到默認位置
> remove.packages(c("tree", "maptree"),.Library) #從庫中刪除包
經常使用的包相關命令
建立一個包目錄(package directory)
建立包時,須要將全部的包文件(代碼、數據、文檔等)放在一個單個的目錄中。可使用package.skeleton函數來建立合適的目錄結構,以下:
package.skeleton(name = "anRpackage", list, environment = .GlobalEnv, path = ".", force = FALSE, namespace = FALSE, code_files = character()) |
這個函數還能夠將R一組R對象複製到該目錄下。下面是其參數的一些描述
Package.skeleton函數會建立幾個文件:名稱man的幫助文件目錄,R源文件,data數據文件,DESCPRITION文件
R includes a set of functions that help automate the creation of help files for packages:
prompt (for generic documentation), promptData (for documenting data files),
promptMethods (for documenting methods of a generic function), and promptClass
(for documenting a class). See the help files for these functions for additional
information.
You can add data files to the data directory in several different forms: as R data files
(created by the save function and named with either a .rda or a .Rdata suffix), as
comma-separated value files (with a .csv suffix), or as an R source file containing R
code (with a .R suffix).
建立包
在將全部的資料(materials)添加到包以後。能夠經過命令行來創建包,在這以前,請確保,建立的包符合CRAN規則。使用check命令
# $ R CMD check nutshell
# $ R CMD CHECK –help :獲取更多CMD check命令的信息
# $ R CMD build nutshell:建立包(build the package)
更多可用的建包參考http://cran.r-project.org/doc/manuals/R-exts.pdf.
表達式包括assignment statements, conditional statements, and arithmetic expressions
看幾個例子:
> x <- 1
> if (1 > 2) "yes" else "no"
[1] "no"
> 127 %% 10
[1] 7
表達式由對象和函數構成,可用經過換行或用分號(semicolons)來分隔表達式,例如
> "this expression will be printed"; 7 + 13; exp(0+1i*pi)
[1] "this expression will be printed"
[1] 20
[1] -1+0i
R中對象的例子包括:numeric,vectors, character vectors, lists, and functions
> # a numerical vector (with five elements)
> c(1,2,3,4,5)
[1] 1 2 3 4 5
> # a character vector (with one element)
> "This is an object too"
[1] "This is an object too"
> # a list
> list(c(1,2,3,4,5),"This is an object too", " this whole thing is a list")
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] "This is an object too"
[[3]]
[1] " this whole thing is a list"
> # a function
> function(x,y) {x + y}
function(x,y) {x + y}
R中的變量名被稱爲符號。當你對一個變量名賦予一個對象時,其實是將對象賦給一個當前環境中的符號。例如: x <- 1(assigns the symbol "x" to the object "1" in the current environment)。
A function is an object in R that takes some input objects (called the arguments of
the function) and returns an output object。例如
> animals <- c("cow", "chicken", "pig", "tuba")
> animals[4] <- "duck" #將第四個元素改爲duck
上面的語句被解析成對[<-函數的調用,等價於
> `[<-`(animals,4,"duck")
一些其餘R語法和響應函數調用的例子
四個特殊值:NA+Inf/-Inf+NaN+NULL
NA用於表明缺失值(not available)。以下:
> v <- c(1,2,3)
> v
[1] 1 2 3
> length(v) <- 4 # 擴展向量/矩陣/數組的大小超過了值定義的範圍。新的空間就會用NA來代替.
> v
[1] 1 2 3 NA
Inf和-Inf表明positive and negative infinity
當一個計算結果太大時,R就會返回該值,例如
> 2 ^ 1024
[1] Inf
> - 2 ^ 1024
[1] –Inf
當除以0時也會返回該值
> 1 / 0
[1] Inf
NaN表明無心義的結果(not a number)
當計算的結果無心義時,返回該值,以下(a computation will produce a result that makes little sense)
> Inf - Inf
[1] NaN
> 0 / 0
[1] NaN
NULL常常被用做函數的一個參數,用來表明no value was assigned to the argument。有一些函數也可能返回NULL值。
下面是強轉規則的概述
• Logical values are converted to numbers: TRUE is converted to 1 and FALSE to 0.
• Values are converted to the simplest type required to represent all information.
• The ordering is roughly logical < integer < numeric < complex < character < list.
• Objects of type raw are not converted to other types.
• Object attributes are dropped when an object is coerced from one type to
another.
當傳遞參數給函數時,可使用AsIs函數(或I函數)來阻止強轉。
> if (x > 1) "orange" else "apple"
[1] "apple"
對於上面這句話,爲了展現這個表達式如何被解析的,使用quote()函數,該函數會解析參數,調用quote,一個R表達式會返回一個language對象。
> typeof(quote(if (x > 1) "orange" else "apple"))
[1] "language"
> quote(if (x > 1) "orange" else "apple")
if (x > 1) "orange" else "apple"
當從這句話看不出什麼思路,能夠將一個language對象轉化成一個列表,獲得上述表達式的解析樹(parse tree)
> as(quote(if (x > 1) "orange" else "apple"),"list") # as函數將一個對象轉化成一個指定的類
[[1]]
`if`
[[2]]
x > 1
[[3]]
[1] "orange"
[[4]]
[1] "apple"
還能夠對列表中的每個元素運行typeof函數以便查看每一個對象的類型
> lapply(as(quote(if (x > 1) "orange" else "apple"), "list"),typeof)
[[1]]
[1] "symbol"
[[2]]
[1] "language"
[[3]]
[1] "character"
[[4]]
[1] "character"
能夠看到if-then語句沒有被包括在解析表達式中(特別是else關鍵字)。
逆句法分析(deparse)函數
The deparse function can take the parse tree and turn it back into properly formatted R code(The deparse function will use proper R syntax when translating a language object back into the original code)
> deparse(quote(x[2]))
[1] "x[2]"
> deparse(quote(`[`(x,2)))
[1] "x[2]"
As you read through this book, you might want to try using quote, substitute,typeof, class, and methods to see how the R interpreter parses expressions
Constants are the basic building blocks for data objects in R: numbers, character values, and symbols.
Many functions in R can be written as operators. An operator is a function that takes one or two arguments and can be written without parentheses.
加減乘除+取模等
用戶自定二元運算符由一個包括在兩個%%字符之間的字符串構成,以下
> `%myop%` <- function(a, b) {2*a + 2*b}
> 1 %myop% 1
[1] 4
> 1 %myop% 2
[1] 6
運算符順序
• Function calls and grouping expressions
• Index and lookup operators
• Arithmetic
• Comparison
• Formulas
• Assignment
• Help
Table 6-1 shows a complete list of operators in R and their precedence.
大多數賦值都是將一個對象簡單地賦給一個符號(即變量),例如
> x <- 1
> y <- list(shoes="loafers", hat="Yankees cap", shirt="white")
> z <- function(a, b, c) {a ^ b / c}
> v <- c(1, 2, 3, 4, 5, 6, 7, 8)
有一種賦值語句,和常見的賦值語句不通。由於帶函數的賦值在賦值運算符的左邊。例如
> dim(v) <- c(2, 4)
> v[2, 2] <- 10
> formals(z) <- alist(a=1, b=2, c=3)
這背後的邏輯是,形以下面的賦值語句
fun(sym) <- val #通常說來,fun表示由sym表明的對象的一個屬性。
R提供了對錶達式分組的不通方式:分號,括號,大括號(semicolons,
parentheses, and curly braces)。
@ 分隔表達式(separating expressions)
You can write a series of expressions on separate lines:
> x <- 1
> y <- 2
> z <- 3
Alternatively, you can place them on the same line, separated by semicolons:
> x <- 1; y <- 2; z <- 3
@ 括號(parentheses)
圓括號(parentheses notation)返回括號中表達式計算後的結果,可用於複寫運算符默認的順序!
> 2 * (5 + 1)
[1] 12
> # equivalent expression
> f <- function (x) x
> 2 * f(5 + 1)
[1] 12
@ 大花括號(curly braces)
{expression_1; expression_2; ... expression_n}
一般,用於將一組操做分組在函數體中
> f <- function() {x <- 1; y <- 2; x + y}
> f()
[1] 3
然而,圓括號還能夠用於如下狀況
> {x <- 1; y <- 2; x + y}
[1] 3
區別在於:
咱們已經討論了兩個重要的結構集:operators和groupin brackets。繼續深刻介紹
@ 條件語句(conditional statements)
# 兩種形式
if (condition) true_expression else false_expression;或者
if (condition) expression
由於其中的真假表達式不老是被估值,因此if的類型是special
> typeof(`if`)
[1] "special"
例子
> if (FALSE) "this will not be printed"
> if (FALSE) "this will not be printed" else "this will be printed"
[1] "this will be printed"
> if (is(x, "numeric")) x/2 else print("x is not numeric")
[1] 5
在R中,條件語句不能使向量操做,如條件語句是一個超過一個邏輯值的向量,僅僅第一項會被使用
> x <- 10
> y <- c(8, 10, 12, 3, 17)
> if (x < y) x else y
[1] 8 10 12 3 17
Warning message:
In if (x < y) x else y :
the condition has length > 1 and only the first element will be used
若是要使用向量操做,使用ifelse函數
> a <- c("a", "a", "a", "a", "a")
> b <- c("b", "b", "b", "b", "b")
> ifelse(c(TRUE, FALSE, TRUE, FALSE, TRUE), a, b)
[1] "a" "b" "a" "b" "a"
一般,根據一個輸入值來返回不一樣的值(或調用不用的函數)
> switcheroo.if.then <- function(x) {
+ if (x == "a")
+ "camel"
+ else if (x == "b")
+ "bear"
+ else if (x == "c")
+ "camel"
+ else
+ "moose"
+ }
可是,這顯然有點囉嗦(verbose),能夠用switch函數代替
> switcheroo.switch <- function(x) {
+ switch(x,
+ a="alligator",
+ b="bear",
+ c="camel",
+ "moose") # 未命名的參數指定了默認值
+ }
> switcheroo.if.then("a")
[1] "camel"
> switcheroo.if.then("f")
[1] "moose"
> switcheroo.switch("a")
[1] "camel"
> switcheroo.switch("f")
[1] "moose"
@循環(loops)
在R中有三種不一樣的循環結構,最簡單的是repeat,僅僅簡單的重複相同的表達式
repeat expression
阻止repeat,使用關鍵字break;跳到循環中的下一次迭代,使用next命令。例如:
> i <- 5
> repeat {if (i > 25) break else {print(i); i <- i + 5;}}
另一個循環結構是while循環,which repeat an expression while a condition
is true。
while (condition) expression
> i <- 5;while (i <= 25) {print(i); i <- i + 5}
一樣,能夠在while循環中,使用break和next。
最後,即是for循環,which iterate through each item in a vector (or a list):
for (var in list) expression
例子
> for (i in seq(from=5, to=25, by=5)) print(i)
一樣,能夠在for循環中使用break和next函數
關於循環語句,有兩點須要謹記。一是:除非你調用print函數,不然結果不會打印輸出,例如
> for (i in seq(from=5, to=25, by=5)) i
二是:the variable var that is set in a for loop is changed in the calling environment
和條件語句同樣,循環函數:repeat,while和for的類型都是special,由於expression is not necessarily evaluated。
很遺憾,R未提供iterators和foreach循環。可是能夠經過附加包(add-on packags)來完成此功能。
對於iterators,安裝iterators包,Iterators can return elements of a vector, array, data frame, or other object。
格式:iter(obj, checkFunc=function(...) TRUE, recycle=FALSE,...)
參數obj指定對象,recycle指定當它遍歷完元素時iterator是否應該重置(reset)。若是下一個值匹配checkFunc,該值被返回,不然函數會繼續嘗試其餘值。NextElem將會check values直到它找到匹配checkFunc的值或它run out of values。When there are no elements left, the iterator calls stop with the message "StopIteration."。例如,建立一個返回1:5之間的一個迭代器。
第二個即是foreach循環,須要加載foreach包。Foreach provides an elegant way to loop through multiple elements of another object (such as a vector, matrix, data frame, or iterator), evaluate an expression for each element, and return the results.下面是foreach函數的原型。
foreach(..., .combine, .init, .final=NULL, .inorder=TRUE,
.multicombine=FALSE,
.maxcombine=if (.multicombine) 100 else 2,
.errorhandling=c('stop', 'remove', 'pass'),
.packages=NULL, .export=NULL, .noexport=NULL,
.verbose=FALSE)
Foreach函數返回一個foreach對象,爲了對循環估值(evaluate),須要將foreach循環運用到一個R表達式中(使用%do% or %dopar%操做符)。例如,使用foreach循環來計算1:5數值的平方根。
The %do% operator evaluates the expression in serial, while the %dopar% can be used
to evaluate expressions in parallel。
You can fetch items by location within a data structure or by name.
@ 數據結構運算符
Table 6-2 shows the operators in R used for accessing objects in a data structure.
知識點:單方括號和雙方括號的區別
@ 經過整數向量索引(Indexing by Integer Vector)
例子
> v <- 100:119
> v[5]
[1] 104
> v[1:5]
[1] 100 101 102 103 104
> v[c(1, 6, 11, 16)]
[1] 100 105 110 115
特別地,可使用雙方框括號來reference單個元素(在該例中,做用於single bracket同樣)
> v[[3]]
[1] 102
還可使用負整數來返回一個向量包含出了指定元素的全部元素的向量
> # exclude elements 1:15 (by specifying indexes -1 to -15)
> v[-15:-1]
[1] 115 116 117 118 119
向量的符號一樣適用於列表
多維數據結構,一樣也使用,如matrix,array等,對於矩陣
對於數組
取子集時,R會自動強轉結果爲最合適的維數,If you select a subset of elements that corresponds to a matrix, R will return a matrix object; if you select a subset that corresponds to only a vector, R will return a vector object,To disable(禁用) this behavior, you can use the
drop=FALSE option。
甚至可使用這種符號擴展數據結構。A special NA element is used to represent values that are not defined:
@經過邏輯向量索引(Indexing by Logical Vector)
例如
一般,it is useful to calculate a logical vector from the vector itself。
> # trivial example: return element that is equal to 103
> v[(v==103)]
> # more interesting example: multiples of three
> v[(v %% 3 == 0)]
[1] 102 105 108 111 114 117
須要注意的是,索引向量沒必要和向量自己長度同樣,R會將短向量重複,並返回匹配值。
@經過名稱索引(Indexing by Name)
在列表,可使用名稱來索引元素
> l <- list(a=1, b=2, c=3, d=4, e=5, f=6, g=7, h=8, i=9, j=10)
> l$j
[1] 10
> l[c("a", "b", "c")]
$a
[1] 1
$b
[1] 2
$c
[1] 3
可使用雙方括號進行索引,甚至還能夠進行部分匹配(將參數設置爲:exact=FALSE)
> dairy <- list(milk="1 gallon", butter="1 pound", eggs=12)
> dairy[["milk"]]
[1] "1 gallon"
> dairy[["mil",exact=FALSE]]
[1] "1 gallon"
In this book, I've tried to stick to Google's R Style Guide, which is available at http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html 。 Here is a summary
of its suggestions:
Indentation
Indent lines with two spaces, not tabs. If code is inside parentheses, indent to
the innermost parentheses.
Spacing
Use only single spaces. Add spaces between binary operators and operands. Do
not add spaces between a function name and the argument list. Add a single
space between items in a list, after each comma.
Blocks
Don't place an opening brace ("{") on its own line. Do place a closing brace
("}") on its own line. Indent inner blocks (by two spaces).
Semicolons
Omit semicolons at the end of lines when they are optional.
Naming
Name objects with lowercase words, separated by periods. For function names,
capitalize the name of each word that is joined together, with no periods. Try
to make function names verbs.
Assignment
Use <-, not = for assignment statements
Basic vectors
These are vectors containing a single type of value: integers, floating-point
numbers, complex numbers, text, logical values, or raw data.
Compound objects
These objects are containers for the basic vectors: lists, pairlists, S4 objects, and
environments. Each of these objects has unique properties (described below),
but each of them contains a number of named objects.
Special objects
These objects serve a special purpose in R programming: any, NULL, and ... .
Each of these means something important in a specific context, but you would
never create an object of these types.
R language
These are objects that represent R code; they can be evaluated to return other
Objects.
Functions
Functions are the workhorses of R; they take arguments as inputs and return
objects as outputs. Sometimes, they may modify objects in the environment or
cause side effects outside the R environment like plotting graphics, saving files,
or sending data over the network.
Internal
These are object types that are formally defined by R but which aren't normally
accessible within the R language. In normal R programming, you will probably
never encounter any of the objects.
Bytecode Objects
If you use the bytecode compiler, R will generate bytecode objects that run on
the R virtual machine.
使用R,會遇到六種六種基本的向量類型。R包括幾種建立一個新向量的不一樣方式,最簡單的是C函數(將其中的參數合併成一個向量)
> # a vector of five numbers
> v <- c(.295, .300, .250, .287, .215)
> v
[1] 0.295 0.300 0.250 0.287 0.215
C函數能夠將全部的參數強轉成單一類型
> # creating a vector from four numbers and a char
> v <- c(.295, .300, .250, .287, "zilch")
> v
[1] "0.295" "0.3" "0.25" "0.287" "zilch"
使用recursive=TRUE參數,能夠將其餘數據結構數據合併成一個向量
> # creating a vector from four numbers and a list of three more
> v <- c(.295, .300, .250, .287, list(.102, .200, .303), recursive=TRUE)
> v
[1] 0.295 0.300 0.250 0.287 0.102 0.200 0.303
注意到,使用一個list做爲參數,返回的會是一個list,以下
> v <- c(.295, .300, .250, .287, list(.102, .200, .303), recursive=TRUE)
> v
[1] 0.295 0.300 0.250 0.287 0.102 0.200 0.303
> typeof(v)
[1] "double"
> v <- c(.295, .300, .250, .287, list(1, 2, 3))
> typeof(v)
[1] "list"
> class(v)
[1] "list"
另一個組裝向量的有用工具是":"運算符。這個運算符從第一個算子(operand)到第二個算子建立值序列。
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
更加靈活的方式是使用seq函數
> seq(from=5, to=25, by=5)
[1] 5 10 15 20 25
對於向量,咱們能夠經過length屬性操縱一個向量的長度。
> w <- 1:10
> w
[1] 1 2 3 4 5 6 7 8 9 10
> length(w) <- 5
> w
[1] 1 2 3 4 5
> length(w) <- 10
> w
[1] 1 2 3 4 5 NA NA NA NA NA
An R list is an ordered collection of objects(略)
@矩陣(matrices)
A matrix is an extension of a vector to two dimensions。A matrix is used to represent two-dimensional data of a single type
生成矩陣的函數是matrix。
可使用as.matrix函數將其餘數據結構轉換成一個矩陣。不一樣於其餘類,矩陣沒有顯式類屬性!
@數組(arrays)
An array is an extension of a vector to more than two dimensions。Arrays are used to represent multidimensional data of a single type。
生成數組用array函數。
一樣,arrays don't have an explicit class attribute!
@因子(factors)
A factor is an ordered collection of items. The different values that the factor can take are called levels.
在眼睛顏色的例子中,順序不重要,可是有些時候,因子的順序是事關重要的。例如,在一次調查中,你調查受試者對下面這句話的感受:melon is delicious with an omelet,受試者能夠給出如下幾種回答:Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.
R中有不少方式來表示這種狀況,一是能夠將這些編碼成整數,on a scale of 5.可是這種方式有缺點,例如: is the difference between Strongly Disagree and Disagree the same as the difference between Disagree and Neutral? Can you be sure that a Disagree response and an Agree response
average out to Neutral?
爲了解決這個問題,可使用有序的因子來表明這些受試者的回答,例如
因子使用整數進行內部實施。The levels attribute maps each integer to a factor level
經過設置類屬性,能夠將這個轉變成一個因子。
數據框是一種表明表格數據的有用方式。A data frame represents a table of data. Each column may be a different type, but each row in the data frame must have the same length
數據的格式以下
R provides a formula class that lets you describe the relationship。下面來建立一個公式
Here is an explanation of the meaning of different items in formulas:
Variable names
Represent variable names.
Tilde (~) 波浪字符
Used to show the relationship between the response variables (to the left) and
the stimulus variables (to the right).
Plus sign (+)
Used to express a linear relationship between variables.
Zero (0)
When added to a formula, indicates that no intercept term should be included.
For example:
y~u+w+v+0
Vertical bar (|)
Used to specify conditioning variables (in lattice formulas; see "Customizing
Lattice Graphics" on page 312).
Identity function (I())
Used to indicate that the enclosed expression should be interpreted by its arithmetic meaning. For example:
a+b
means that both a and b should be included in the formula. The formula:
I(a+b)
means that "a plus b" should be included in the formula.
Asterisk (*)
Used to indicate interactions between variables. For example:
y~(u+v)*w
is equivalent to:
y~u+v+w+I(u*w)+I(v*w)
Caret (^) 託字符號
Used to indicate crossing to a specific degree. For example:
y~(u+w)^2
is equivalent to:
y~(u+w)*(u+w)
Function of variables
Indicates that the function of the specified variables should be interpreted as a
variable. For example:
y~log(u)+sin(v)+w
Some additional items have special meaning in formulas, for example s() for
smoothing splines in formulas passed to gam. We'll revisit formulas in Chapter 14
and Chapter 20。
Many important problems look at how a variable changes over time,R包括了一個類來表明這種數據:時間序列對象(time series objects)。時間序列的迴歸函數(好比ar或arima)使用時間序列對象。此外,許多繪圖函數都有針對時間序列的特殊方法。
建立時間序列對象(類ts),使用ts函數:
ts(data = NA, start = 1, end = numeric(0), frequency = 1,
deltat = 1, ts.eps = getOption("ts.eps"), class = , names = )
# data參數指定觀測值序列;其餘參數指定觀測值什麼時候be taken。下面是ts參數的描述。
當與月或季度一塊兒使用時,時間序列對象print方法的能夠輸出很好看的結果。例如:建立一個時間序列,表明2008年Q2季度到2010年Q1間的 8個連續季度。
另外一個時間序列的例子,談談turkey價格。US農業部有一個項目,蒐集各類肉製品的零售價格(retail price),該數據來自表明了約美國20%的超市,已按月和區域平均。該數據集包括在nutshell包(名稱爲:turkey.price.ts數據集)
> library(nutshell)
> data(turkey.price.ts)
> turkey.price.ts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001 1.58 1.75 1.63 1.45 1.56 2.07 1.81 1.74 1.54 1.45 0.57 1.15
2002 1.50 1.66 1.34 1.67 1.81 1.60 1.70 1.87 1.47 1.59 0.74 0.82
2003 1.43 1.77 1.47 1.38 1.66 1.66 1.61 1.74 1.62 1.39 0.70 1.07
2004 1.48 1.48 1.50 1.27 1.56 1.61 1.55 1.69 1.49 1.32 0.53 1.03
2005 1.62 1.63 1.40 1.73 1.73 1.80 1.92 1.77 1.71 1.53 0.67 1.09
2006 1.71 1.90 1.68 1.46 1.86 1.85 1.88 1.86 1.62 1.45 0.67 1.18
2007 1.68 1.74 1.70 1.49 1.81 1.96 1.97 1.91 1.89 1.65 0.70 1.17
2008 1.76 1.78 1.53 1.90
R包含了許多查看時間序列對象的有用函數
> start(turkey.price.ts)
[1] 2001 1
> end(turkey.price.ts)
[1] 2008 4
> frequency(turkey.price.ts)
[1] 12
> deltat(turkey.price.ts) # 不懂啊 deltat=1/frequency=1/12=
[1] 0.08333333
A shingle is a generalization of a factor to a continuous variable。A shingle consists of a numeric vector and a set of intervals。The intervals are allowed to overlap (much like roof shingles; hence the name "shingles")。Shingles在lattice包中被普遍使用。they allow you to easily use a continuous variable as a conditioning or grouping variable。
R包含了一組類來表明日期和時間
Date
Represents dates but not times.
POSIXct
Stores dates and times as seconds since January 1, 1970, 12:00 A.M.
POSIXlt
Stores dates and times in separate vectors. The list includes sec (0–61) , min
(0–59), hour (0–23), mday (day of month, 1–31), mon (month, 0–11), year
(years since 1900), wday (day of week, 0–6), yday (day of year, 0–365), and
isdst (flag for "is daylight savings time").
The date and time classes include functions for addition and subtraction。例如:
此外,R includes a number of other functions for manipulating time and date objects. Many plotting functions require dates and times.
R包括了接受或發送數據( from applications or files outside the R environment.)的特殊對象類型。能夠建立到文件,URLs,zip-壓縮文件,gzip-壓縮文件,bzip-壓縮文件,Unix pipes, network sockets,和FIFO(first in,first out)對象的鏈接。甚至能夠從系統剪貼板(Clipboard)讀取。
爲了了使用鏈接,咱們須要建立鏈接,打開鏈接,使用鏈接,關閉鏈接。例如,在一個名爲consumption的文件中保存了一些數據對象。RData想要加載該數據。R將該文件保存成了壓縮文件格式。所以,咱們須要建立一個與gzfile的鏈接。以下
> consumption.connection <- gzfile(description="consumption.RData",open="r")
> load(consumption.connection)
> close(consumption.connection)
關於鏈接的更多信息參考 connection幫助。
Objects in R can have many properties associated with them, called attributes. These properties explain what an object represents and how it should be interpreted by R.表7中羅列出了一些重要的屬性。
對於R,表中的查詢對象屬性的方式是a(X),其中a表明屬性,X表明對象。使用attributes函數,能夠獲得一個對象的全部屬性列表。例如
@查看該對象的屬性
也能夠直接用 dimnames(m)
@ 訪問行和列名的便捷函數
> colnames(m)
[1] "c1" "c2" "c3"
> rownames(m)
[1] "r1" "r2" "r3" "r4"
能夠簡單經過改變屬性,將矩陣轉變成其餘對象。以下,移除維度屬性,對象被轉化成一個向量。
再看一個小知識點
在R中,有一個all.equal函數,比較兩個對象的數據和屬性,返回的結果會說明是否相等,若是不相等,會給出緣由,以下
> all.equal(a,b)
[1] "Attributes: < Modes: list, NULL >"
[2] "Attributes: < Lengths: 1, 0 >"
[3] "Attributes: < names for target but not for current >"
[4] "Attributes: < current is not list-like >"
[5] "target is matrix, current is numeric"
若是咱們只想檢查兩個對象是否徹底相等(exactly the same),不想知道緣由,使用identical函數。以下:
> identical(a,b)
[1] FALSE
> dim(b) <- c(3,4)
> b[2,2]
[1] 5
> all.equal(a,b)
[1] TRUE
> identical(a,b)
[1] TRUE
對於簡單的對象,類和類型(class and type)高度相關。對於複雜的對象,這兩個是不一樣的。
To determine the class of an object, you can use the class function. You can determine the underlying type of object using the typeof function.例如
> x<-c(1,2,3)
> typeof(x)
[1] "double"
> class(x)
[1] "numeric"
能夠將一個整數數組,轉變成一個因子
每個R中的符號都定義在一個特定的環境中,An environment s an R object that contains the set of symbols available in a given context, the objects asociated with those symbols, and a pointer to a parent environment。符號和與之關聯的對象被稱爲a frame
When R attempts to resolve a symbol, it begins by looking through the current environment. If there is no match in the local environment, then R will recursively search through parent environments looking for a match.
當你在R中定義一個變量時,其實是在一個環境中將一個符號賦給了一個值,以下
> x <- 1
## it assigns the symbol x to a vector object of length 1 with the constant (double) value 1 in the global environment
> x <- 1
> y <- 2
> z <- 3
> v <- c(x, y, z)
> v
[1] 1 2 3
> # v has already been defined, so changing x does not change v
> x <- 10
> v
[1] 1 2 3
能夠延遲一個表達式的估值(delay evaluation of an expression),所以,符號不會被當即估算,以下:
> x <- 1
> y <- 2
> z <- 3
> v <- quote(c(x, y, z))
> eval(v)
[1] 1 2 3
> x <- 5
> eval(v)
[1] 5 2 3
這一效果還能夠經過建立一個promise對象來完成。使用delayedAssign函數
> x <- 1
> y <- 2
> z <- 3
> delayedAssign("v", c(x, y, z))
> x <- 5
> v
[1] 5 2 3
Promise objects are used within packages to make objects available to users without
loading them into memory
R環境也是一個對象。Table 8-1 shows the functions in R for manipulating environment objects。
顯示當前環境能夠用的對象集(more precisely,the set of symbols in the current environment associated with object),使用objects函數。
> x<-1
> y<-2
> z<-3
> objects()
[1] "a" "b" "m" "v" "x" "y" "z"
可使用rm函數從當前環境中移除一個對象。
> rm(x)
> objects()
[1] "a" "b" "m" "v" "y" "z"
When a user starts a new session in R, the R system creates a new environment for objects created during that session. This environment is called the global environment. The global environment is not actually the root of the tree of environments. It's actually the last environment in the chain of environments in the search path. Here's the list of parent environments for the global environment in my R installation。
每個環境都有一個父環境,除了空環境(empty environment),
函數中局部函數與全局環境,這個好理解(略)
@@@ Working with the Call Stack
R maintains a stack of calling environments. (A stack is a data structure in which objects can be added or subtracted from only one end. Think about a stack of trays in a cafeteria; you can only add a tray to the top or take a tray off the top. Adding an object to a stack is called "pushing" the object onto the stack. Taking an object off of the stack is called "popping" the object off the stack.) Each time a new function is called, a new environment is pushed onto the call stack. When R is done evaluating a function, the environment is popped off the call stack.
Table 8-2 shows the functions for manipulating the call stack.
@@@在不一樣的環境中評估函數(evaluate)
You can evaluate an expression within an arbitrary environment using the eval function:
eval(expr, envir = parent.frame(), enclos = if(is.list(envir) || is.pairlist(envir)) parent.frame() else baseenv()) |
參數說明:
Expr:須要估算的表達式,envir:是一個估算expr的環境,數據框或pairlist;當envir是一個數據框或pairlist時,enclos就是查找對象定義的enclosure(附件/圈地)。例如
timethis <- function(...) {
start.time <- Sys.time();
eval(..., sys.frame(sys.parent(sys.parent())));
end.time <- Sys.time();
print(end.time - start.time);
}
另一個例子,咱們記錄將向量中的10000個元素設置爲1的時間。
> create.vector.of.ones <- function(n) {
+ return.vector <- NA;
+ for (i in 1:n) {
+ return.vector[i] <- 1;
+ }
+ return.vector;
+ }
> timethis(returned.vector<-create.vector.of.ones(10000))
Time difference of 0.165 secs
這兩個例子主要是想說明一個問題:eval函數在調用環境中估算一個表達式。notice that the symbol returned.vector is now defined in that environment:
> length(returned.vector)
[1] 10000
上述代碼更爲有效率的一種形式以下
> create.vector.of.ones.b <- function(n) {
+ return.vector <- NA;
+ length(return.vector) <- n;
+ for (i in 1:n) {
+ return.vector[i] <- 1;
+ }
+ return.vector;
+ }
> timethis(returned.vector <- create.vector.of.ones.b(10000))
Time difference of 0.04076099 secs
三種有用的簡約表達式(shorthands)是evalq, eval.parent, and local。當想要引用表達式時,使用evalq,它等價於eval(quote(expr), ...);當要想在父環境中評估一個表達式時,使用eval.parent函數,等價於eval(expr, parent.frame(n));當想要在一個新的環境中評估一個表達式時,使用local函數,等價於eval(quote(expr), envir=new.env()).
下面給出如何使用eval.parent函數的例子。
start.time <- Sys.time();
eval.parent(...);
end.time <- Sys.time();
print(end.time - start.time);
}
有時候,將數據框或列表當成一個環境是很方便的,這容許你經過名稱來檢索數據框或列表中的每一項,R使用with函數和within函數。
with(data, expr, ...) #評估表達式,返回結果
within(data, expr, ...) #在對象數據中做調整和改變,並返回結果。
The argument data is the data frame or list to treat as an environment, expr is the expression, and additional arguments in ... are passed to other methods.例子以下
@@@ Adding Objects to an Environment
Attach與detach
attach(what, pos = 2, name = deparse(substitute(what)),
warn.conflicts = TRUE)
detach(name, pos = 2, unload = FALSE)
參數
The argument what is the object to attach (called a database), pos specifies the position in the search path in which to attach the element within what, name is the name to use for the attached database (more on what this is used for below), warn.conflicts specifies whether to warn the user if there are conflicts.
you can use attach to load all the elements specified within a dataframe or list into the current environment
使用attach時要注意,由於環境中有相同的命名列時,會confusing,因此It is often better to use functions like transform to change values within a data frame or with to evaluate expressions using values in a data frame.
也許,你會發現,當你輸入無效的表達式時,R會給出錯誤提示,例如
> 12 / "hat"
Error in 12/"hat" : non-numeric argument to binary operator
有時候,會給出警告提示.。這部分解釋錯誤處理體系(error-handling system)的運行機制。
@@@signaling errors(發出錯誤提示!!!)
If something occurs in your code that requires you to stop execution, you can use the stop function.例如:To stop execution and print a helpful error message,you could structure your code like this。
若是代碼中發生了你想要告訴用戶的something,可使用warning函數。再看上述例子,若是文件名存在,返回"lalala",若是不存在,warn the user that the file does not exist。
若是僅僅告訴用戶something,使用message函數,例如
> doNothing <- function(x) {
+ message("This function does nothing.")
+ }
> doNothing("another input value")
This function does nothing
@@@捕獲錯誤/異常(catching errors)
使用Try函數,例子以下
公式:Try(expr, silent) # The second argument specifies whether the error message should be printed to the R console (or stderr); the default is to print errors
#### If the expression results in an error, then try returns an object of class "try-error"
使用tryCatch函數,
公式:tryCatch(expression, handler1, handler2, ..., finally=finalexpr)
##### an expression to try, a set of handlers for different conditions, and a final expression to evaluate。
R解釋器首先會估算expression,若是條件發生(an error 或 warning),R會選擇針對該條件合適的處理器(handler),在expression會估算以後,評估finalexpr。(The handlers will not be active when this expression is evaluated)
Functions are the R objects that evaluate a set of input arguments and return an output value。
在R中,R對象以下定義:function(arguments) body,例如
f <- function(x,y) x + y
f <- function(x,y) {x + y}
1)參數可能包括默認值。If you specify a default value for an argument, then the argument is considered optional:
> f <- function(x, y) {x + y}
> f(1,2)
[1] 3
> g <- function(x, y=10) {x + y}
> g(1)
[1] 11
若是不指定參數的默認值,使用該參數時會報錯。
2)在R中,在參數中使用ellipsis(…)來完成給其餘函數傳遞額外的參數,例如:
建立一個輸出第一個參數的函數,而後傳遞全部的其餘參數給summary函數。
Notice that all of the arguments after x were passed to summary.
3)能夠從變量-長度參數列表中讀取參數。這須要將…對象轉變成函數體中的一個列表。例如:
You can also directly refer to items within the list ... through the variables ..1, ..2, to ..9. Use ..1 for the first item, ..2 for the second, and so on. Named arguments are valid symbols within the body of the function。
使用return函數來指定函數的返回值
> f <- function(x) {return(x^2 + 3)}
> f(3)
[1] 12
然而,R會簡單地將最後一個估算表達式做爲函數結果返回,一般return能夠省略
> f <- function(x) {x^2 + 3}
> f(3)
[1] 12
例如
> a <- 1:7
> sapply(a, sqrt)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
@@@@匿名函數
目前爲止,咱們看到的都是命名函數。it is possible to create functions that do not have names. These are called anonymous functions。Anonymous functions are usually passed as arguments to other functions。例如:
> apply.to.three <- function(f) {f(3)}
> apply.to.three(function(x) {x * 7}) #匿名函數.
[1] 21
實際上,R進行了以下操做:f= function(x) {x * 7},而後, 評估f(3),x=3;最後評估3*7=21.
又例如:
> a <- c(1, 2, 3, 4, 5)
> sapply(a, function(x) {x + 1})
[1] 2 3 4 5 6
@@@@函數屬性(properties of functions)
1)R包括了不少關於函數對象的函數,好比,查看一個函數接受的參數集,使用args函數,例
2)若是想要使用R代碼來操做參數列表,可使用formals函數,formals函數會返回一個pairlist對象(with a pair for every argument)。The name of each pair will correspond to each argument
name in the function。當定義了默認值,pairlist中的相應值會被設置爲該默認值,未定義則爲NULL。Formals函數僅僅可用於closure類型的對象。例如:下面是使用formals提取函數參數信息的簡單例子。
You may also use formals on the left-hand side of an assignment statement to change the formal argument for a function。例如:
3)使用alist函數構建參數列表,alist指定參數列表就像是定義一個函數同樣。(Note that for an
argument with no default, you do not need to include a value but still need to include the equals sign),例如:
4)使用body函數返回函數體
> body(f)
{
x + y + z
}
和formals函數同樣,body函數能夠用在賦值語句的左邊。
> f
function (x, y = 3, z = 2)
{
x + y + z
}
> body(f) <- expression({x * y * z})
> f
function (x, y = 3, z = 2)
{
x * y * z
}
Note that the body of a function has type expression, so when you assign a new value it must have the type expression.
(略)
All functions in R return a value. Some functions also do other things: change variables in the current environment (or in other environments), plot graphics, load or save files, or access the network. These operations are called side effects。
<<-運算符會causes side effects。形式以下:var <<- value。
This operator will cause the interpreter to first search through the current environment to find the symbol var。If the interpreter does not find the symbol var in the current environment, then the interpreter will next search through the parent environment. The interpreter will recursively search through environments until it either finds the symbol var or reaches the global environment。If it reaches the global environment before the symbol var is found, then R will assign value to var in the global environment。下面是一個比較<-賦值運算符和<<運算符的例子:
@@@@輸入/輸出
R does a lot of stuff, but it's not completely self-contained. If you're using R, you'll probably want to load data from external files (or from the Internet) and save data to files. These input/output (I/O) actions are side effects, because they do things other than just return an object. We'll talk about these functions extensively in Chapter 11.
@@@@圖形
Graphics functions are another example of side effects in R. Graphics functions may return objects, but they also plot graphics (either on screen or to files). We'll talk about these functions in Chapters 13 and 14.
This part of the book explains how to accomplish some common tasks with R: loading data, transforming data, and saving data. These techniques are useful for any type of data that you want to work with in R。
方式一:直接輸入(適合小數據,好比用於測試)
方式二:edit(打開GUI),格式:var<-edit(var),簡化形式直接使用fix函數(fix(var))
保存:save函數, save(filename, file="~/top.5.salaries.RData") #將filename保存到file指定的路徑。
#####格式:
save(..., list =, file =, ascii =, version =, envir =,compress =, eval.promises =, precheck = )
參數說明
加載:load函數, load("~/top.5.salaries.RData") #加載數據
@@@文本格式text files
R includes a family of functions for importing delimited text files into R, based on the read.table function
|
讀取text files到R中,返回一個數據框對象。每一行被解釋成an observation, 每一列被解釋成a variable. Read.table函數假設每個字段都被一個分隔符(delimiter)分隔。
#####(1)例如,對於以下形式的CSV文件
The first row contains the column names.• Each text field is encapsulated in quotes.• Each field is separated by commas。如何讀取呢?
> top.5.salaries <- read.table("top.5.salaries.csv", header=TRUE, sep=",", quote="\"") #header=TRUE指定第一行爲列名, sep=","指定分隔符爲逗號(comma),quote="\""指定字符值使用雙引號"括起來的(encapsulated)!read.table函數至關靈活,下面是關於它的參數簡要說明
其中最重要的參數(options)是sep和header。R包括了許多調用read.table函數(不一樣的默認值)的便捷函數,以下
所以,大多數時候,不須要指定其餘參數,就可使用read.csv函數讀取逗號分隔的穩健,read.delim讀取tab分隔的文件。
#######(2)又例如,假設要分析歷史股票交易數據,Yahoo!Finance提供了這方面的信息。例如,提取1999年4月1日到2009奶奶4月1日間每月的標準普爾500指數的收盤價。數據連接地址以下 :URL<-http://ichart.finance.yahoo.com/table.csv?s=%5EGSPC&a=03&b=1&c=1999&d=03&e=1&f=2009&g=m&ignore=.csv.
>sp500 <- read.csv(paste(URL, sep=""))
> # show the first 5 rows
> sp500[1:5,]
Date Open High Low Close Volume Adj.Close
1 2009-04-01 793.59 813.62 783.32 811.08 12068280000 811.08
2 2009-03-02 729.57 832.98 666.79 797.87 7633306300 797.87
3 2009-02-02 823.09 875.01 734.52 735.09 7022036200 735.09
4 2009-01-02 902.99 943.85 804.30 825.88 5844561500 825.88
5 2008-12-01 888.61 918.85 815.69 903.25 5320791300 903.25
若是,知道須要加載的文件的很大,可使用nrows=參數來指定加載前20行用於測試語句的對錯,測試成功後,即可所有加載!!!
2)固定寬度的文件
讀取固定寬度的text文件,使用read.fwf函數,格式以下:
下面是該函數的參數說明
注意:read.fwf還能夠接收read.table使用的參數,包括as.is, na.strings, colClasses, and strip.white。
所以,建議使用腳本語言,好比Perl,Python,Ruby先將大而複雜的文本文件處理成R容易理解的形式(digestible form.)。
3)其餘解析數據的函數
####To read data into R one line at a time, use the function readLines:
參數描述
##### Another useful function for reading more complex file formats is scan:
Unlike readLines,scan allows you to read data into a specifically defined data structure using the argument what.
參數說明
注意:Like readLines, you can also use scan to enter data directly into R.
@@@@其餘函數
To export data to a text file, use the write.table function:
There are wrapper functions for write.table that call write.table with different defaults
下面是參數說明
One of the best approaches for working with data from a database is to export the data to a text file and then import the text file into R。
There are two sets of database interfaces available in R
• RODBC. The RODBC package allows R to fetch data from ODBC (Open DataBase Connectivity) connections. ODBC provides a standard interface for different programs to connect to databases.
• DBI. The DBI package allows R to connect to databases using native database drivers or JDBC drivers. This package provides a common database abstraction for R software. You must install additional packages to use the native drivers for each database.
對於提供的兩種鏈接方式,如何選擇呢?到底選哪個好呢?下面給出一些標準做爲參考
For this example, we will use an SQLite database containing the Baseball Databank database. You do not need to install any additional software to use this database. This file is included in the nutshell package. To access it within R, use the following expression as a filename: system.file("extdata", "bb.db", package = "nutshell").
getting RODBC working
在使用RODBC以前,須要配置ODBC鏈接。這個只需配置一次。
#####安裝RODBC包
> install.packages("RODBC")
> library(RODBC)
#####安裝ODBC驅動器
@ 對於Window用戶,安裝SQLite ODBC的過程以下
(原文以下)
個人安裝過程:
Step1: 下載SQLite ODBC Driver, 地址http://www.ch-werner.de/sqliteodbc/
Step2:安裝,默認next便可
Step3:爲數據庫配置DSN(Distributed Service Network),打開管理工具à數據源(ODBC)-à用戶DSN標籤界面中選擇添加,選擇SQLite3 ODBC驅動,進入 SQLite3 ODBC DSN配置界面,填寫數據源名稱, 這裏填寫"bbdb";填寫數據庫名稱,這裏找到nutshell包下的exdata文件下dd數據庫文件。操做過程的截圖以下:
######打開管理工具,選擇數據源
注意:使用以下命令能夠查看一個包的完整路徑名稱!
> system.file(package="nutshell")
[1] "C:/Users/wb-tangyang.b/Documents/R/win-library/3.1/nutshell"
######進入ODBC數據源管理器界面,選擇添加SQLite3 ODBC Driver.
#####彈出SQLite3 ODBC DSN 配置界面,填寫相關信息
@@@@@這樣ODBC驅動就配置好啦!下面來經過ODBC訪問bbdb文件. 在R中使用以下命令來檢查ODBC配置是否運行正常。
> bbdb<-odbcConnect('bbdb')
> odbcGetInfo(bbdb)
Connecting to a database in R is like connecting to a file. First, you need to connect to a database. Next, you can execute any database queries. Finally, you should close the connection.
######打開一個channel(鏈接)
To establish a connection, use the odbcConnect function。
odbcConnect(dsn, uid = "", pwd = "", ...)
You need to specify the DSN for the database to which you want to connect. If you did not specify a username and password in the DSN, you may specify a username with the uid argument and a password with the pwd argument. Other arguments are passed to the underlying odbcDriverConnect function. The odbcConnect function returns an object of class RODBC that identifies the connection. This object is usually called a channel.
下面是使用該函數鏈接"bbdb"DSN的例子。
> library(RODBC)
> bbdb <- odbcConnect("bbdb")
#####獲取數據庫中的信息(get information about the database)
You can get information about an ODBC connection using the odbcGetInfo function.
This function takes a channel (the object returned by odbcConnect) as its only argument. It returns a character vector with information about the driver and connection;
爲了獲得基本數據庫(underlying database)中的表列表,使用sqlTables function。This function returns a data frame with information about the available tables。
> sqlTables(bbdb) # 由於表中沒有數據!!!
[1] TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS
<0 行> (或0-長度的row.names)
獲取一個特定表中列的詳細信息,使用sqlColumns function。
######獲取數據(getting data)
Finally, we've gotten to the interesting part: executing queries in the database and returning results. RODBC provides some functions that let you query a database even if you don't know SQL。
從基本數據庫中獲取一個表或試圖,使用sqlFetch function. This function returns a data frame containing the contents of the table。
sqlFetch(channel, sqtable, ..., colnames = , rownames = )
> teams <- sqlFetch(bbdb,"Teams")
> names(teams)
After loading the table into R, you can easily manipulate the data using R commands
You can also execute an arbitrary SQL query in the underlying database,使用sqlQuery function:
sqlQuery(channel, query, errors = , max =, ..., rows_at_time = )
若是想要從一個很大的表中獲取數據,建議不要一次性獲取全部的數據。RODBC庫提供了分段獲取結果的機制(fetch results piecewise)。首先,調用sqlQuery或sqlFetch函數, 可是須要指定一個max值,告訴函數,每一次要想獲取(retrieve)的最大行數。能夠經過sqlGetResults函數獲取剩下的行!
sqlGetResults(channel, as.is = , errors = , max = , buffsize = ,
nullstring = , na.strings = , believeNRows = , dec = ,
stringsAsFactors = )
實際上,sqlQuery函數就是調用的sqlGetResults函數來獲取查詢的結果的。下面是這兩個函數的參數列表((If you are using sqlFetch, the corresponding function to fetch additional rows is sqlFetchMore)。
By the way, notice that the sqlQuery function can be used to execute any valid query in the underlying database. It is most commonly used to just query results (using SELECT queries), but you can enter any valid data manipulation language query (including SELECT, INSERT, DELETE, and UPDATE queries) and data definition language query (including CREATE, DROP, and ALTER queries).
######關閉一個channel(通道)
When you are done using an RODBC channel, you can close it with the odbcClose function. This function takes the connection name as its only argument:
> odbcClose(bbdb)
Conveniently, you can also close all open channels using the odbcCloseAll function. It is generally a good practice to close connections when you are done, because this frees resources locally and in the underlying database.
One important difference between the DBI packages and the RODBC package is in the objects they use: DBI uses S4 objects to represent drivers, connections, and other objects
Table 11-3 shows the set of database drivers available through this interface
安裝和加載RSQLite包
> install.packages("RSQLite")
> library(RSQLite)
If you are familiar with SQL but new to SQLite, you may want to review what SQL commands are supported by SQLite. You can find this list at http://www.sqlite.org/lang.html.
打開鏈接
To open a connection with DBI, use the dbConnect function:
dbConnect(drv, ...)
獲取DB信息
查詢數據庫
清洗(cleaning up)
There is one last database interface in R that you might find useful: TSDBI. TSDBI is an interface specifically designed for time series data. There are TSDBI packages for many popular databases, as shown in Table 11-4.
Today, one of the most important sources for data is Hadoop. To learn more about Hadoop, including instructions on how to install R packages for working with Hadoop data on HDFS or in HBase, see "R and Hadoop" on page 549.
Everyone loves building models, drawing charts, and playing with cool algorithms. Unfortunately,
most of the time you spend on data analysis projects is spent on preparing data for analysis. I'd estimate that 80% of the effort on a typical project is spent on finding, cleaning, and preparing data for analysis. Less than 5% of the effort is devoted to analysis. (The rest of the time is spent on writing up what you did.)
合併數據集主要用於處理存儲在不一樣地方的數據!(相似於SQL中的各類鏈接!!!)
R provides several functions that allow you to paste together multiple data structures into a single structure.
這些函數中,最簡單的一個就是paste函數。它將多個字符向量鏈接合併(concatenate)成單個向量(若是不是字符的將會首先被強轉爲字符.)
默認下,值由空格分隔,能夠用sep參數指定其餘的分隔符(separator)
若是想獲得:返回的向量中全部的值被依次被鏈接,能夠指定collapse參數,collapse的值會被用做這個值中的分隔符。
#### cbind函數經過添加列來合併對象,能夠當作,水平地合併兩個表。例如:
> top.5.salaries<-NULL > top.5.salaries NULL > top.5.salaries<-data.frame(top.5.salaries) > top.5.salaries<-fix(top.5.salaries) |
接着,建立一個兩列的數據框(year和rank)。
> year <- c(2008, 2008, 2008, 2008, 2008) > rank <- c(1, 2, 3, 4, 5) > more.cols <- data.frame(year, rank) |
、
而後,合併這兩個數據框:使用cbind函數
> cbind(top.5.salaries, more.cols)
##### 同理,rbind函數經過行來合併對象,能夠想象成垂直地合併兩個表!
######擴展例子
To show how to fetch and combine together data and build a data frame for analysis,we'll use an example from the previous chapter: stock quotes. Yahoo! Finance allows you to download CSV files with stock quotes for a single ticker..
假設咱們想要一個關於多隻證券的股票報價的數據集(好比,DJIA中的30只股票)。咱們須要將每次經過查詢返回的單個數據集合並在一塊兒。首先,寫一個函數,組合URL;而後獲取帶內容的數據框。
這個函數的思路以下:首先,定義URL(做者經過試錯法來肯定了URL的格式)。使用paste函數將全部的這些字符值合在一塊兒。而後,使用read.csv函數獲取URL,將數據框賦給tmp符號。數據框有大多數咱們想要的信息,可是沒有ticker符號,所以,咱們將會使用cbind函數附加一個ticker符號向量到數據框中。(by the way,函數使用Date對象表明日期)。 I also used the current date as the default value for to, and the date one year ago as the default value for from。具體函數以下:
URL地址示例:
http://ichart.finance.yahoo.com/table.csv?s=%5EGSPC&a=03&b=1&c=1999&d=03&e=1&f=2009&g=m&ignore=.csv
get.quotes <- function(ticker, # ticker指的是股票代號/或者代碼!
from=(Sys.Date()-365), # 這裏定義下載數據的時間範圍:從過去一年到如今!
to=(Sys.Date()),
interval="d") { # 時間間隔,以天爲單位!!!
# define parts of the URL
base <- "http://ichart.finance.yahoo.com/table.csv?"; #定義URL的主體部分!
symbol <- paste("s=", ticker, sep=""); # 股票代碼符號
# months are numbered from 00 to 11, so format the month correctly
from.month <- paste("&a=",
formatC(as.integer(format(from,"%m"))-1,width=2,flag="0"), sep=""); #月, 高兩部分提取日期中的月份!
from.day <- paste("&b=", format(from,"%d"), sep=""); #日
from.year <- paste("&c=", format(from,"%Y"), sep=""); #年
to.month <- paste("&d=",
formatC(as.integer(format(to,"%m"))-1,width=2,flag="0"), #formatC函數很吊啊
sep="");
to.day <- paste("&e=", format(to,"%d"), sep="");
to.year <- paste("&f=", format(to,"%Y"), sep="");
inter <- paste("&g=", interval, sep="");
last <- "&ignore=.csv";
# put together the url
url <- paste(base, symbol, from.month, from.day, from.year,
to.month, to.day, to.year, inter, last, sep="");
# get the file
tmp <- read.csv(url);
# add a new column with ticker symbol labels
cbind(symbol=ticker,tmp);
}
而後,寫一個函數,返回一個包含多個證券代碼的股票報價的數據框。這個函數每次針對tickers向量中的每個ticker簡單的調用get.quotes,而後將結果使用rbind函數合併在一塊兒;
get.multiple.quotes <- function(tkrs,
from=(Sys.Date()-365),
to=(Sys.Date()),
interval="d") {
tmp <- NULL;
for (tkr in tkrs) {
if (is.null(tmp))
tmp <- get.quotes(tkr,from,to,interval)
else tmp <- rbind(tmp,get.quotes(tkr,from,to,interval))
}
tmp
}
最後,定義一個包含了DJIA指數ticker符號集的向量,並構建一個獲取數據的數據框。
> dow.tickers <- c("MMM", "AA", "AXP", "T", "BAC", "BA", "CAT", "CVX",
"CSCO", "KO", "DD", "XOM", "GE", "HPQ", "HD", "INTC",
"IBM", "JNJ", "JPM", "KFT", "MCD", "MRK", "MSFT", "PFE",
"PG", "TRV", "UTX", "VZ", "WMT", "DIS")
> # date on which I ran this code
> Sys.Date()
[1] "2012-01-08"
> dow30 <- get.multiple.quotes(dow30.tickers) #get.multiple.quotes函數只需指定股票代碼便可,方便啊!!!
下面好比我想要提取阿里巴巴的股票數據!只需輸入:
> alibaba<-get.multiple.quotes('BABA')
> head(alibaba)
symbol Date Open High Low Close Volume Adj.Close
1 BABA 2015-04-08 83.30 85.54 83.07 85.39 26087700 85.39
2 BABA 2015-04-07 81.94 82.95 81.88 82.21 9386400 82.21
3 BABA 2015-04-06 82.05 82.59 81.61 81.82 12758900 81.82
4 BABA 2015-04-02 82.88 83.00 81.25 82.28 19784800 82.28
5 BABA 2015-04-01 83.37 83.72 82.18 82.36 14856100 82.36
6 BABA 2015-03-31 83.64 84.45 83.20 83.24 11763800 83.24
nice job!!!
例如,回到咱們在"Importing Data From Databases使用過的Baseball Databank database。在這張表中,球員的信息存儲在Master表中, 而且被playerID這列惟一標識.
> dbListFields(con,"Master")
Batting信息存儲在Batting表中. 球員一樣被playerID這列惟一標識。
> dbListFields(con, "Batting")
假設你想要顯示每個球員(連同它的姓名和年齡)的擊球統計(batting statistics). 所以, 這就須要合併兩張表的數據(merge data from two tables). 在R中, 使用merge函數。
> batting <- dbGetQuery(con, "SELECT * FROM Batting")
> master <- dbGetQuery(con, "SELECT * FROM Master")
> batting.w.names <- merge(batting, master)
這樣, 兩張表間只有一個共同變量:playerID:
> intersect(names(batting), names(master))
[1] "playerID"
默認下,merge使用兩個數據框間的共同變量做爲合併的關鍵字(merge keys). 所以,在該案例中,咱們不須要指定其餘參數. 下面是merge 函數的用法說明:
merge(x, y, by = , by.x = , by.y = , all = , all.x = , all.y = ,
sort = , suffixes = , incomparables = , ...)
默認狀況下,merge等價於SQL中的NATURAL Join。能夠指定其餘列來使用好比INNER JOIN。能夠指定ALL參數來得到OUTER或者FULL join。If there are no matching field names,or if by is of length 0 (or by.x and by.y are of length 0), then merge will return the full Cartesian product of x and y.
Sometimes, there will be some variables in your source data that aren't quite right. This section explains how to change a variable in a data frame。
在數據框中從新定義一個變量最方便的方式是使用賦值運算符(assignment operators)。例如,假設你想要改變以前建立的alibaba數據框中一個變量的類型。當使用read.csv導入這些數據時Date字段會被解釋成一個字符串,並將其轉變成一個因子。
> class(alibaba$Date)
[1] "factor
Luckily, Yahoo! Finance prints dates in the default date format for R, so we can just transform these values into Date objects using as.Date函數。
> class(alibaba$Date)
[1] "factor"
> alibaba$Date<-as.Date(alibaba$Date)
> class(alibaba$Date)
[1] "Date"
固然,還能夠進行其餘改變,例如:define a new midpoint variable that is the mean of the high and low price。
> alibaba$mid<-(alibaba$High+alibaba$Low)/2
> names(alibaba)
[1] "symbol" "Date" "Open" "High" "Low" "Close"
[7] "Volume" "Adj.Close" "mid"
A convenient function for changing variables in a data frame is the transform function。Transform函數的定義以下:
transform(`_data`, ...)
To use transform,
you specify a data frame (as the first argument) and a set of expressions that use variables within the data frame. The transform function applies each expression to the data frame and then returns the final data frame.例如:咱們經過transform函數完成上述兩個任務:將Date列變成Date格式;添加一個midpoint新列。
> alibaba.transformed<-transform(alibaba,Date=as.Date(Date),mid=(High+Low)/2)
> head(alibaba.transformed)
When transforming data, one common operation is to apply a function to a set of objects (or each part of a composite object) and return a new set of objects (or a new composite object
To apply a function to parts of an array (or matrix), use the apply function:
apply(X, MARGIN, FUN, ...)
Apply accepts three arguments: X is the array to which a function is applied, FUN is the function, and MARGIN specifies the dimensions to which you would like to apply a function. Optionally, you can specify arguments to FUN as addition arguments to apply arguments to FUN.)
例子1)爲了展現該函數如何運做,下面給出一個簡單的例子,先構建一個數據集
首先,使用max函數:選擇每一行最大的元素。(These are the values in the rightmost column: 16, 17, 18, 19, and 20)。在apply函數中指定X=x,MARGIN=1 (rows are the first dimension), and FUN=max。
> apply(X = x,MARGIN = 1,FUN = max)
[1] 16 17 18 19 20
一樣的max運用到列上面的效果以下:
> apply(X = x,MARGIN = 2,FUN = max)
[1] 5 10 15 20
例子2)再給出一個更爲複雜的例子,指定margin參數,運用函數到多維數據集。以下main的一個三維數組(We'll switch to the function paste to show which elements were included)
首先,looking at which values are grouped for each value of MARGIN:
> apply(X = x, MARGIN = 1,FUN = paste,collapse='')
[1] "147101316192225" "258111417202326" "369121518212427"
> apply(X = x, MARGIN = 2,FUN = paste,collapse='')
[1] "123101112192021" "456131415222324" "789161718252627"
> apply(X = x, MARGIN = 3,FUN = paste,collapse='')
[1] "123456789" "101112131415161718" "192021222324252627"
而後,看一個更復雜的例子,Let's select MARGIN=c(1, 2) to see which elements are selected:
對於margin=C(1,2)時,This is the equivalent of doing the following: for each value of i between 1 and 3 and each value of j between 1 and 3, calculate FUN of x[i][j][1], x[i][j][2], x[i][j][3].
To apply a function to each element in a vector or a list and return a list, you can use the function lapply。The function lapply requires two arguments: an object X and a function FUNC. (You may specify additional arguments that will be passed to FUNC.下面看一個例子
也能夠對一個數據框運用一個函數,函數將會被運用到數據框中的每個向量,例如:
有時候,咱們更喜歡返回一個向量,矩陣,或數組而不是一個列表。可使用sapply函數,除了它返回一個向量或矩陣外,這個函數和apply函數用法相同。
另一個相關的函數時mapply函數,是sapply的多變量版本(multivariate)!
mapply(FUN, ..., MoreArgs = , SIMPLIFY = , USE.NAMES = ),下面是mapply的參數說明
這個函數運用FUN到每個向量的第一個元素,而後到第二個,以此類推,直到到最後一個元素。例如
mapply(paste,
+ c(1, 2, 3, 4, 5),
+ c("a", "b", "c", "d", "e"),
+ c("A", "B", "C", "D", "E"),
+ MoreArgs=list(sep="-"))
The plyr package contains a set of 12 logically named functions for applying another function to an R data object and returning the results. Each of these functions takes an array, data frame, or list as input and returns an array, data frame, list, or nothing as output。
下面是plyr庫中最經常使用函數的列表
全部的這些函數接收下面的參數
其餘參數取決於輸入和輸出,若是輸入是數組,可用參數爲
若是輸入是數據框,可用參數爲
若是output is dropped,可用參數爲
下面給幾個例子(略),例子見plyr包學習筆記
Another common data transformation is to group a set of observations into bins based on the value of a specific variable。
例如:假設你有一些時間序列數據(以天爲單位),可是你想要根據月份來彙總數據。在R中有幾個可用來binning(分組/分箱)數值數據的函數。
Shingles are a way to represent intervals in R。They can be overlapping, like roof shingles(屋頂木瓦) (hence the name)。shingles在lattice包中被普遍的使用,好比,當你要想使用數值型值做爲一個條件值時。
To create shingles in R, use the shingle function:
shingle(x, intervals=sort(unique(x)))
經過使用intervals參數來指定在何處分隔bins。可使用一個數值向量來表示breaks(分割點)或一個兩列的矩陣,其中每一行表明一個特定的間隔(interval)。
To create shingles where the number of observations is the same in each bin, you can use the equal.count function:
equal.count(x, ...)
The function cut is useful for taking a continuous variable and splitting it into discrete pieces. Here is the default form of cut for use with numeric vectors:
# numeric form
cut(x, breaks, labels = NULL,
include.lowest = FALSE, right = TRUE, dig.lab = 3,
ordered_result = FALSE, ...)
另一個操做Date對象的cut版本:
# Date form
cut(x, breaks, labels = NULL, start.on.monday = TRUE,
right = FALSE, ...)
cut函數接收一個數值向量做爲輸入,返回一個因子。因子中的每個水平對應輸入向量中的間隔值,下面是cut的參數描述!
例如:假設,你想要在必定範圍內計算平均擊球次數的球員數量,可使用cut函數和table函數。
Sometimes you would like to combine a set of similar objects (either vectors or data frames) into a single data frame, with a column labeling the source.可使用lattice包中的make.groups函數
library(lattice)
make.groups(...)
例如,合併三個不一樣的向量爲一個數據框
hat.sizes <- seq(from=6.25, to=7.75, by=.25)
pants.sizes <- c(30, 31, 32, 33, 34, 36, 38, 40)
shoe.sizes <- seq(from=7, to=12)
make.groups(hat.sizes, pants.sizes, shoe.sizes)
One way to take a subset of a data set is to use the bracket notation。
例如,咱們僅僅想要選擇2008年的batting數據。Batting.w.names$ID列包含了year。所以咱們寫一個表達式:atting.w.names$yearID==2008,生成一個邏輯值向量,Now we just have to index the data frame batting.w.names with this vector to select only rows for the year 2008。
一樣,咱們可使用一樣的符號來選擇某一列。Suppose that we wanted to keep only the variables nameFirst, nameLast, AB, H, and BB. We could provide these in the brackets as well:
另一種替代方案,可使用subset函數從數據框/矩陣中對行和列取子集
subset(x, subset, select, drop = FALSE, ...)
subset函數與bracket notation的區別在於,前者會少不少代碼!Subset allows you to use
variable names from the data frame when selecting subsets。下面是subset函數的參數描述:
例如:使用subset函數再作一遍上面的取子集過程
> batting.w.names.2008 <- subset(batting, yearID==2008)
> batting.w.names.2008.short <- subset(batting, yearID==2008,
+ c("nameFirst","nameLast","AB","H","BB"))
Often, it is desirable to take a random sample of a data set. Sometimes, you might have too much data (for statistical reasons or for performance reasons). Other times, you simply want to split your data into different parts for modeling (usually into training, testing, and validation subsets).
提取隨機樣本最簡單的方式是使用sample函數。它返回一個隨機的向量元素樣本:
sample(x, size, replace = FALSE, prob = NULL)
當對數據框使員工sample函數時,應該當心一點,由於,a data frame is implemented as a list of vectors, so sample is just taking a random sample of the elements of the list。return a random
sample of the columns。
#####在實際操做中,爲了對一個數據集取隨機樣本觀測值,可使用sample函數建立一個row numbers的隨機樣本,而後使用index operators來選擇這些row numbers。例如:let's take a random sample of five elements from the batting.2008 data set。
#####還可使用該方法來選擇一個更加複雜的隨機子集,例如,假設咱們想要選擇三個隊的隨機統計量。
>field.goals.3teams<-field.goals[is.element(field.goals$away.team,sample(levels(field.goals$away.team),3)),]
這個函數對於僅僅要想對全部的觀測值隨機採樣時比較有用!可是一般咱們可能還想作一些更加複雜的事情,好比分層抽樣(stratified sampling),聚類抽樣(cluster sampling),最大熵抽樣(maximum entropy sampling),或者其餘複雜的方法。咱們能夠在sampling包中找到不少這些方法。For an example using this package to do stratified sampling, see "Machine Learning Algorithms for Classification" on page 477
假設你想要知道推送給每個用戶的平均頁面數量。To find the answer,須要查看每個HTTP transaction(對內容的每個請求!),將全部的請求分組成一個部分(sessions),而後對請求數進行計數。
1)Tapply函數對於summarize一個向量X很是靈活。能夠指定summarize向量X的哪個子集:
tapply(X, INDEX, FUN = , ..., simplify = )
下面是tapply函數的參數
#####例如,使用tapply函數按team加總(sum)home的數量。仍然是batting.2008.rda的例子。這個數據集在包nutshell下面,運行命令:獲得nutshell包所在的包路徑!
> system.file(package = 'nutshell')
[1] "C:/Users/wb-tangyang.b/Documents/R/win-library/3.1/nutshell"
> system.file("data",package = 'nutshell')
[1] "C:/Users/wb-tangyang.b/Documents/R/win-library/3.1/nutshell/data"
而後,打開該路徑,看到在data子目錄下有batting.2008.rda文件,因而直接用data加載數據!
> tapply(X=batting.2008$HR, INDEX=list(batting.2008$teamID), FUN=sum)
#####還能夠運用返回多個項的函數,好比fivenum函數(which returns a vector containing the minimum, lower-hinge, median, upper-hinge, and maximum values)。例如,下面針對每個球員的平均擊球數(batting averages)應用fivenum函數,aggregated by league.
> tapply(X = (batting.2008$H/batting.2008$AB),INDEX = list(batting.2008$lgID),FUN = fivenum)
####還可使用tapply函數針對多維計算summaries統計摘要。例如按照league和batting hand計算home runs per player的平均值。
> tapply(X=(batting.2008$HR),INDEX=list(batting.2008$lgID,batting.2008$bats),FUN=mean)
(注:As a side note, there is no equivalent to tapply in the plyr package)
和tapply函數最相近的是by函數。惟一一點的不一樣是,by函數works on數據框。Tapply的index參數被indeces參數替代。
此例子來自官方文檔:
格式:aggregate(x, by, FUN, ...)
Aggregate能夠被運用於時間序列,此時,參數略微有些不一樣
aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1,
ts.eps = getOption("ts.eps"), ...)
下面是參數說明
例如,we can use aggregate to summarize batting statistics by team!
> aggregate(x=batting.2008[, c("AB", "H", "BB", "2B", "3B", "HR")], by=list(batting.2008$teamID), FUN=sum)
計算一個對象中某個特定變量的和(sum),經過一個分組變量(grouping variables)來分組,使用rowsum函數。
格式:rowsum(x, group, reorder = TRUE, ...)
例如:
> rowsum(batting.2008[,c("AB", "H", "BB", "2B", "3B", "HR")],group=batting.2008$teamID)
1)The simplest function for counting the number of observations that take on a value is the tabulate function。該函數對向量中的元素數量計數,接收每個整數值,返回一個計數結果向量。
例如,對hit 0 HR, 1 HR, 2 HR, 3 HR等的球員個數計數!
> HR.cnts <- tabulate(batting.w.names.2008$HR)
> # tabulate doesn't label results, so let's add names:
> names(HR.cnts) <- 0:(length(HR.cnts) - 1)
2)一個相關的函數(對於分類值)是table函數。
table(..., exclude = if (useNA == "no") c(NA, NaN), useNA = c("no",
"ifany", "always"), dnn = list.names(...), deparse.level = 1)
The table function returns a table object showing the number of observations that have each possible categorical value。下面是參數說明
######例如,we wanted to count the number of left-handed batters,right-handed batters, and switch hitters in 2008。
> table(batting.2008$bats)
B L R
118 401 865
#####又例如,生成一個二維表,顯示the number of players who batted and threw with each hand。
3)另外一個有用的函數時xtabs函數,which creates contingency tables from factors using
Formulas。
xtabs(formula = ~., data = parent.frame(), subset, na.action,
exclude = c(NA, NaN), drop.unused.levels = FALSE)
注:xtabs函數和table函數相似,區別在於,xtabs容許經過指定一個公式和數據框指定分組(grouping)。例如:use xtabs to tabulate batting statistics by batting arm and league
xtabs(~bats+lgID, batting.2008)
Table函數僅僅對因子變量有效,可是有時候咱們也許想要使用數值變量計算tables(列聯表)。例如,suppose you wanted to count the number of players with batting averages in certain ranges!此時,可使用cut函數和table函數
> # first, add batting average to the data frame: > batting.w.names.2008 <- transform(batting.w.names.2008, AVG = H/AB) > # now, select a subset of players with over 100 AB (for some > # statistical significance): > batting.2008.over100AB <- subset(batting.2008, subset=(AB > 100)) > # finally, split the results into 10 bins: > battingavg.2008.bins <- cut(batting.2008.over100AB$AVG,breaks=10) > table(battingavg.2008.bins) |
###對矩陣:下面給出一個例子!
###對向量:當調用一個向量時,向量被當成一個矩陣的單列,所以t函數返回的將是單行矩陣!
R包含了幾個函數,可用於在narrow和wide格式數據間轉換。這裏使用stock 數據來看看這些函數的用法。
>my.quotes<-get.multiple.quotes(my.tickers,from=as.Date("2015-01-01"),to=as.Date("2015-03-31"), interval="m")
> my.quotes.narrow<-my.quotes[,c(1,2,6)]
> unstack(my.quotes.narrow, form=Close~symbol) # form是公式,左邊表示values,右邊表示grouping variables
Notice that the unstack operation retains the order of observations but loses the Date column. (It's probably best to use unstack with data in which there are only two variables that matter.
R包含了一個更增強有力的工具,用來改變一個數據框的形狀:reshape函數!
在正式講解如何使用該函數前,先來看幾個例子
>my.quotes.wide<-reshape(my.quotes.narrow, idvar="Date", timevar="symbol",direction="wide")
> my.quotes.wide
Reshape函數的參數被存儲成已建立數據框的屬性
另外,還可讓每一行表明一隻股票,每一列表明不一樣的日期
> reshape(my.quotes.narrow, idvar="symbol", timevar="Date", direction="wide")
The tricky thing about reshape is that it is actually two functions in one: a function that transforms long data to wide data and a function that transforms wide data to long data. The direction argument specifies whether you want a data frame that is "long" or "wide."
When transforming to wide data, you need to specify the idvar and timevar arguments.When transforming to long data, you need to specify the varying argument.
By the way, calls to reshape are reversible. If you have an object d that was created by a call to reshape, you can call reshape(d) to get back the original data frame:
reshape(data, varying = , v.names = , timevar = , idvar = , ids = , times = ,
drop = , direction, new.row.names = , sep = , split = )
下面是參數說明
Many R users (like me) find the built-in functions for reshaping data (like stack,unstack, and reshape) confusing. Luckily, there's an alternative.幸運的是,Hadley Wickham這我的開發了一個reshape包(Don't confuse the reshape library with the reshape function)
Melting 和 casting
the process of turning a table of data into a set of transactions:melting, and the process of turning the list of transactions into a table:casting!
Reshape使用的例子
首先,來melt股價數據(quote data)
my.molten.quotes <- melt(my.quotes)
如今,咱們有了molten形式的數據,用cast函數進行操做
cast(data=my.molten.quotes, variable~Date, subset=(symbol=='baba'))
上面簡要的介紹了一下,下面進行詳細剖析!
Melt
melt is a generic function; the reshape package includes methods for data frames, arrays, and lists。
melt.data.frame(data, id.vars, measure.vars, variable_name, na.rm,
preserve.na, ...)
參數說明
You simply need to specify the dimensions to keep, and melt will melt the array.
melt.array(data, varnames, ...)
the list form of melt will recursively melt each element in the list, join the results, and return the joined form。
melt.list(data, ..., level)
Cast
After you have melted your data, you use cast to reshape the results. Here is a description of the arguments to cast
cast(data, formula, fun.aggregate=NULL, ..., margins, subset, df, fill,
add.missing, value = guess_value(data))
Data cleaning doesn't mean changing the meaning of data. It means identifying problems caused by data collection, processing, and storage processes and modifying the data so that these problems don't interfere with analysis。
Data sources often contain duplicate values. Depending on how you plan to use the data, the duplicates might cause problems. It's a good idea to check for duplicates in your data
R提供了多種檢測重複值的有用工具!
This function returns a logical vector showing which elements are duplicates of values with lower indices
> duplicated(my.quotes.2)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE TRUE TRUE TRUE
檢測出來是最後三行重複啦,緊接着去重
> my.quotes.unique <- my.quotes.2[!duplicated(my.quotes.2),]
另外,可使用unique函數去重,直接完成上述步驟
> my.quotes.unique <- unique(my.quotes.2)
最後還有兩個操做函數,你可能以爲在數據分析時很是有用:sorting和ranking函數!
To sort the elements of an object, use the sort function
> w <- c(5, 4, 7, 2, 7, 1)
> sort(w)
[1] 1 2 4 5 7 7
Add the decreasing=TRUE option to sort in reverse order:
> sort(w, decreasing=TRUE)
[1] 7 7 5 4 2 1
還能夠設置na.last參數來控制如何處理NA值!
> length(w)
[1] 6
> length(w) <- 7
> # note that by default, NA.last=NA and NA values are not shown
> sort(w)
[1] 1 2 4 5 7 7
> # set NA.last=TRUE to put NA values last
> sort(w, na.last=TRUE)
[1] 1 2 4 5 7 7 NA
> # set NA.last=FALSE to put NA values first
> sort(w, na.last=FALSE)
[1] NA 1 2 4 5 7 7
2)對於數據框的sorting函數使用
To sort a data frame, you need to create a permutation of the indices from the data frame and use these to fetch the rows of the data frame in the correct order. You can generate an appropriate permutation of the indices using the order function:
order(..., na.last = , decreasing = )
#####例子一:
先看order是如何運做的,First, we'll define a vector with two elements out of order:
> v <- c(11, 12, 13, 15, 14)
You can see that the first three elements (11, 12, 13) are in order, and the last two (15, 14) are reversed。
> order(v)
[1] 1 2 3 5 4
> v[order(v)]
[1] 11 12 13 14 15
Suppose that we created the following data frame from the vector v and a second vector u:
> u <- c("pig", "cow", "duck", "horse", "rat")
> w <- data.frame(v, u)
> w
v u
1 11 pig
2 12 cow
3 13 duck
4 15 horse
5 14 rat
We could sort the data frame w by v using the following expression
> w[order(w$v),]
v u
1 11 pig
2 12 cow
3 13 duck
5 14 rat
4 15 horse
######例子二:按照收盤價來對my.quotes數據框排序
對整個數據框排序有一點不一樣,
Sorting a whole data frame is a little strange. You can create a suitable permutation using the order function, but you need to call order using do.call for it to work properly. (The reason for this is that order expects a list of vectors and interprets the data frame as a single vector, not as a list of vectors.)
This part of the book explains how to plot data with R.
在R中,繪圖的方式有不少種,這裏咱們只關注三個最流行的包:graphics、lattice和ggplot2!
The graphics package contains a wide variety of functions for plotting data. It is easy to customize or modify charts with the graphics package, or to interact with plots on the screen. The lattice package contains an alternative set of functions for plotting data. Lattice graphics are well suited for splitting data by a conditioning variable. Finally, ggplot2 uses a different metaphor for graphics, allowing you to easily and quickly create stunning charts.
Graphics能夠繪製經常使用的圖形類型:bar charts, pie charts, line charts, and scatter plots;還能夠繪製不那麼經常使用(less-familiar)的圖形:quantile-quantile (Q-Q) plots, mosaic plots, and contour plots。下面的圖表顯示了graphics包中的圖形類型及描述!
能夠將R圖形顯示在屏幕上,也能夠保存成多種不一樣的格式!
繪製散點圖的示例數據來自:2008年的癌症案例,2006年按州(state)的toxic廢物排放.
> library(nutshell)
> data(toxins.and.cancer)
繪製散點圖,使用plot函數。plot是一個泛函,plot能夠繪製許多不一樣類型的對象,包括向量、表、時間序列。對於用兩個向量繪製簡單的散點圖,使用plot.default函數:
plot(x, y = NULL, type = "p", xlim = NULL, ylim = NULL,
log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
ann = par("ann"), axes = TRUE, frame.plot = axes,
panel.first = NULL, panel.last = NULL, asp = NA, ...)
@@@@對Plot函數參數的簡要描述:
1)第一幅圖!比較總體患癌症比例(癌症死亡數除以州人數)與毒素排放量(整體化學毒素排放除以州面積)
> library(nutshell)
> data(toxins.and.cancer)
> head(toxins.and.cancer)
> plot(total_toxic_chemicals/Surface_Area,deaths_total/Population)
可知, 經過空氣傳遞的毒素和肺癌成強的正相關!
2)假設,你想知道哪個州和哪個點相關聯。R提供了識別圖中點的一些交互工具。可使用locator函數告訴一個特定點(一組點)的座標。爲了完成這個任務,首先,繪製數據。接下來,輸入locator(1).。而後,在打開的圖形窗口上點擊一點。好比,假設上面繪製的數據,type locator(1),而後,點擊右上角高亮的點。你將會在R控制檯上看到以下輸出結果:
3)另外一個識別點的有用函數是identity函數。該函數能夠被用於在一副圖上交互的標記(label)點。To use identify with the data above:
> plot(air_on_site/Surface_Area, deaths_lung/Population)
> identify(air_on_site/Surface_Area, deaths_lung/Population,labels = State_Abbrev)
[1] 10 12 14 17 22
While this command is running, you can click on individual points on the chart,and R will label those points with state names!
> plot(air_on_site/Surface_Area, deaths_lung/Population,xlab='Air Release Rate of Toxic Chemicals',ylab='Lung Cancer Death Rate')
> text(air_on_site/Surface_Area, deaths_lung/Population,labels=State_Abbrev,cex=0.5,adj=c(0,-1)) #adj調整位置, cex調整大小!
注意到咱們使用了xlab、ylab參數向圖中添加了x和y軸的標籤,使得圖形外觀更加好看!Text函數對每個點附近繪製一個標籤(咱們使用了cex和adj參數對標籤的大小和位置進行了微調(tweak) !
那麼這個關係統計上顯著嗎?(see "Correlation tests" on page 384))咱們並無足夠的信息來證實這裏存在一個因果關係。
5)若是想要繪製數據中的兩列到一副圖中,plot函數是一個很好的選擇。而後,若是要繪製數據中的多列,或者將分裂成不一樣的類別。或者說,想要繪製一個矩陣的全部列與另外一個矩陣的全部列。To plot multiple sets of columns against one another,使用matplot函數:
matplot(x, y, type = "p", lty = 1:5, lwd = 1, pch = NULL,
col = 1:6, cex = NULL, bg = NA,
xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL,
..., add = FALSE, verbose = getOption("verbose"))
Matplot接收如下參數:
Matplot函數的許多參數和par標準參數名稱相同!然而,matplot函數同時生產多幅圖,當調用matplot函數時,這些參數以多值向量被指定!
6)若是想要繪製大量的點,可使用smoothScatter函數。
smoothScatter(x, y = NULL, nbin = 128, bandwidth,
colramp = colorRampPalette(c("white", blues9)),
nrpoints = 100, pch = ".", cex = 1, col = "black",
transformation = function(x) x^.25,
postPlotHook = box,
xlab = NULL, ylab = NULL, xlim, ylim,
xaxs = par("xaxs"), yaxs = par("yaxs"), ...)
> library(nutshell)
> data(batting.2008)
> pairs(batting.2008[batting.2008$AB>100,c('H','R','SO','BB','HR')])
R包含了繪製時間序列數據的工具,plot函數有一個方法:
plot(x, y = NULL, plot.type = c("multiple", "single"),
xy.labels, xy.lines, panel = lines, nc, yax.flip = FALSE,
mar.multi = c(0, 5.1, 0, if(yax.flip) 5.1 else 2.1),
oma.multi = c(6, 0, 5, 0), axes = TRUE, ...)
參數x和y指定ts對象,panel指定如何繪製時間序列(默認是,lines),其餘參數指定如何將時間序列break成不一樣的圖形。
1)例如,下面來繪製turkey價格數據!
> library(nutshell)
> data(turkey.price.ts)
> turkey.price.ts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2001 1.58 1.75 1.63 1.45 1.56 2.07 1.81 1.74 1.54 1.45 0.57 1.15
2002 1.50 1.66 1.34 1.67 1.81 1.60 1.70 1.87 1.47 1.59 0.74 0.82
2003 1.43 1.77 1.47 1.38 1.66 1.66 1.61 1.74 1.62 1.39 0.70 1.07
2004 1.48 1.48 1.50 1.27 1.56 1.61 1.55 1.69 1.49 1.32 0.53 1.03
2005 1.62 1.63 1.40 1.73 1.73 1.80 1.92 1.77 1.71 1.53 0.67 1.09
2006 1.71 1.90 1.68 1.46 1.86 1.85 1.88 1.86 1.62 1.45 0.67 1.18
2007 1.68 1.74 1.70 1.49 1.81 1.96 1.97 1.91 1.89 1.65 0.70 1.17
2008 1.76 1.78 1.53 1.90
> plot(turkey.price.ts)
從上圖中能夠看出,Turkey價格季節性很強(seasonal)。在11月和12月銷量(感恩節和萬聖節)很大!!!春季銷量不多(多是Easter!)
> acf(turkey.price.ts)
能夠看到, points are correlated over 12-month cycles (and inversely correlated over 6-month cycles。
> pacf(turkey.price.ts)
畫條形圖(列圖),使用barplot函數
####構造數據
doctorates <- data.frame (
year=c(2001, 2002, 2003, 2004, 2005, 2006),
engineering=c(5323, 5511, 5079, 5280, 5777, 6425),
science=c(20643, 20017, 19529, 20001, 20498, 21564),
education=c(6436, 6349, 6503, 6643, 6635, 6226),
health=c(1591, 1541, 1654, 1633, 1720, 1785),
humanities=c(5213, 5178, 5051, 5020, 5013, 4949),
other=c(2159, 2141, 2209, 2180, 2480, 2436))
> doctorates
year engineering science education health humanities other
1 2001 5323 20643 6436 1591 5213 2159
2 2002 5511 20017 6349 1541 5178 2141
3 2003 5079 19529 6503 1654 5051 2209
4 2004 5280 20001 6643 1633 5020 2180
5 2005 5777 20498 6635 1720 5013 2480
6 2006 6425 21564 6226 1785 4949 2436
注:上面的數據在nutshell包中也有,能夠直接data加載!
####轉化成矩陣,便於繪圖!(make this into a matrix!)
> doctorates.m<-as.matrix(doctorates[2:7])
> rownames(doctorates.m)<-doctorates[,1]
> doctorates.m
engineering science education health humanities other
2001 5323 20643 6436 1591 5213 2159
2002 5511 20017 6349 1541 5178 2141
2003 5079 19529 6503 1654 5051 2209
2004 5280 20001 6643 1633 5020 2180
2005 5777 20498 6635 1720 5013 2480
2006 6425 21564 6226 1785 4949 2436
因爲barplot函數不能處理數據框,所以,這裏咱們建立了一個矩陣對象!
1.1)首先來看看2001年博士學位授予條狀圖(第一行數據!!!)
> barplot(doctorates.m[1,])
能夠看到,R默認顯示了 y-axis along with the size of each bar,可是未顯示x-axis軸。R會自動使用列名來對每個bars命名。
1.2)Suppose that we wanted to show all the different years as bars stacked next to one another. Suppose that we also wanted the bars plotted horizontally and wanted to show a legend for the different years。
> barplot(doctorates.m, beside=TRUE, horiz=TRUE, legend=TRUE, cex.names=.75)
1.3)最後, suppose that we wanted to show doctorates by year as stacked bars()。這裏咱們須要對矩陣進行轉化,每一列是年,每一行是一個discipline(學科)。同時,還須要確保足夠的空間來顯示legend,這裏對y-axis的limits進行了擴展!
> barplot(t(doctorates.m), legend=TRUE, ylim=c(0, 66000))
下面是對barplot函數的詳細描述!
barplot(height, width = 1, space = NULL,
names.arg = NULL, legend.text = NULL, beside = FALSE,
horiz = FALSE, density = NULL, angle = 45,
col = NULL, border = par("fg"),
main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
xlim = NULL, ylim = NULL, xpd = TRUE, log = "",
axes = TRUE, axisnames = TRUE,
cex.axis = par("cex.axis"), cex.names = par("cex.axis"),
inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,
add = FALSE, args.legend = NULL, ...)
Barplot函數很是靈活,其參數描述以下:
One of the most popular ways to plot data is the pie chart. Pie charts can be an effective way to compare different parts of a quantity, though there are lots of good reasons not to use pie charts
下面是pie函數:
pie(x, labels = names(x), edges = 200, radius = 0.8,
clockwise = FALSE, init.angle = if(clockwise) 90 else 0,
density = NULL, angle = 45, col = NULL, border = NULL,
lty = NULL, main = NULL, ...)
> domestic.catch.2006 <- c(7752, 1166, 463, 108)
> names(domestic.catch.2006) <- c("Fresh and frozen", "Reduced to meal, oil, etc.","Canned", "Cured")
> pie(domestic.catch.2006, init.angle=100, cex=.6)
# note: cex.6 setting shrinks text size by 40% so you can see the labels
The graphics package includes some very useful, and possibly unfamiliar, tools for looking at categorical data。
1)假設咱們依據一個數值查看一組分類/類別的條件密度。使用cdplot函數!
cdplot(x, y,
plot = TRUE, tol.ylab = 0.05, ylevels = NULL,
bw = "nrd0", n = 512, from = NULL, to = NULL,
col = NULL, border = 1, main = "", xlab = NULL, ylab = NULL,
yaxlabels = NULL, xlim = NULL, ylim = c(0, 1), ...)
調用公式時cdplot的形式
cdplot(formula, data = list(),
plot = TRUE, tol.ylab = 0.05, ylevels = NULL,
bw = "nrd0", n = 512, from = NULL, to = NULL,
col = NULL, border = 1, main = "", xlab = NULL, ylab = NULL,
yaxlabels = NULL, xlim = NULL, ylim = c(0, 1), ...,
subset = NULL)
Cdplot函數使用density函數來計算各類數值的核密度估計(kernel density estimates),而後plot these estimates。下面是cdplot的參數列表:
1.1)例子:看看batting hand分佈是如何隨2008年MLB球員中的平均擊球次數(batting average)變化的。
> batting.w.names.2008 <- transform(batting.2008,AVG=H/AB, bats=as.factor(bats), throws=as.factor(throws))
> head(batting.w.names.2008)
> cdplot(bats~AVG,data=batting.w.names.2008,subset=(batting.w.names.2008$AB>100))
As you can see, the proportion of switch hitters (bats=="B") increases with higher batting average
2)假設僅僅想針對不一樣的分類變量繪製觀測值的比例。可視化這類數據的工具不少,R中最有意思的一個函數是mosaicplot( showing the number of observations with certain properties)。一副mosaic plot(馬賽克圖/拼花圖)顯示了對應於不一樣因子值的一組盒子(boxes)。The x-axis corresponds to one factor and the y-axis to another factor。使用mosaicplot函數來建立馬賽克圖,下面是針對一個列聯表(contingency table)的mosaicplot函數:
mosaicplot(x, main = deparse(substitute(x)),
sub = NULL, xlab = NULL, ylab = NULL,
sort = NULL, off = NULL, dir = NULL,
color = NULL, shade = FALSE, margin = NULL,
cex.axis = 0.66, las = par("las"),
type = c("pearson", "deviance", "FT"), ...)
還有另外一種是容許將數據指定爲一個公式或數據框的形式!
mosaicplot(formula, data = NULL, ...,
main = deparse(substitute(data)), subset,
na.action = stats::na.omit)
2.1)例子:建立一個顯示2008年MLB擊球手(batters)數量的馬賽克圖。
On the x-axis, we'll show batting hand (left, right, or both), and on the yaxis we'll show throwing hand (left or right).該函數能夠接收一個矩陣、公式和數據框。在本例中,咱們使用公式和數據框:
> mosaicplot(formula=bats~throws, data=batting.w.names.2008, color=TRUE) #bats和throws都是分類變量!
3.1)例子:數據和馬賽克圖的數據同樣,下面使用splineplot函數來繪製一個樣條圖
> spineplot(formula=bats~throws, data=batting.w.names.2008)
4)Another function for looking at tables of data is assocplot函數。繪製出來的圖被稱爲
Cohen-Friendly association圖形。(This function plots a set of bar charts, showing the deviation of each combination of factors from independence)
4.1)例子:數據沿用上一個例子的數據
> assocplot(table(batting.w.names.2008$bats, batting.w.names.2008$throws),
xlab="Throws", ylab="Bats")
R包括了一些對三維數據可視化的函數。全部這些函數都能被用於繪製矩陣值。(Row indices correspond to x values,column indices to y values, and values in the matrix to z values)
persp(x = seq(0, 1, length.out = nrow(z)),
y = seq(0, 1, length.out = ncol(z)),
z, xlim = range(x), ylim = range(y),
zlim = range(z, na.rm = TRUE),
xlab = NULL, ylab = NULL, zlab = NULL,
main = NULL, sub = NULL,
theta = 0, phi = 15, r = sqrt(3), d = 1,
scale = TRUE, expand = 1,
col = "white", border = NULL, ltheta = -135, lphi = 0,
shade = NA, box = TRUE, axes = TRUE, nticks = 5,
ticktype = "simple", ...)
1.1)例子:使用Yosemite Valley的三維數據。
Specifically, let's look toward Half Dome. To plot this elevation data(海拔數據), I needed to make two transformations. First, I needed to flip(擲/快速翻動) the data horizontally. In the data file, values move east to west (or left to right) as x indices increase and from north to south (or top to bottom) as y indices increase. Unfortunately, persp plots y coordinates slightly differently. Persp plots increasing y coordinates from bottom to top. So I selected y indices in reverse order。
# load the data:
> library(nutshell)
> data(yosemite)
> head(yosemite)
# check dimensions of data
> dim(yosemite)
[1] 562 253
> yosemite.flipped<-yosemite[,seq(from = 253,to=1)]
> head(yosemite.flipped)
下一步,僅僅選擇海拔點(elevation points)的方形子集(square subset)。這裏,咱們僅僅Yosemite矩陣最右邊(rightmost)的253列!(Note the "+ 1" in this statement; that's to make sure that we take exactly 253 columns. (This is to avoid a fencepost error.)
To plot the figure, I rotated the image by 225° (through theta=225) and changed the viewing angle to 20° (phi=20). I adjusted the light source to be from a 45° angle (ltheta=45) and set the shading factor to 0.75 (shade=.75) to exaggerate topological features. Putting it all together。
> # create halfdome subset in one expression:
# 選擇310:562行,253:1列的方形數據!
> halfdome <- yosemite[(nrow(yosemite) - ncol(yosemite) + 1):562,seq(from=253,to=1)]
> persp(halfdome,col=grey(.25), border=NA, expand=.15,theta=225, phi=20, ltheta=45, lphi=20, shade=.75)
image(x, y, z, zlim, xlim, ylim, col = heat.colors(12),
add = FALSE, xaxs = "i", yaxs = "i", xlab, ylab,
breaks, oldstyle = FALSE, ...)
下面是image函數的參數說明
下面是基於Yosemite Valley數據生成的image圖表達式:
> data(yosemite)
> image(yosemite, asp=253/562, ylim=c(1,0), col=sapply((0:32)/32, gray))
heatmap(x, Rowv=NULL, Colv=if(symm)"Rowv" else NULL,
distfun = dist, hclustfun = hclust,
reorderfun = function(d,w) reorder(d,w),
add.expr, symm = FALSE, revC = identical(Colv, "Rowv"),
scale=c("row", "column", "none"), na.rm = TRUE,
margins = c(5, 5), ColSideColors, RowSideColors,
cexRow = 0.2 + 1/log10(nr), cexCol = 0.2 + 1/log10(nc),
labRow = NULL, labCol = NULL, main = NULL,
xlab = NULL, ylab = NULL,
keep.dendro = FALSE, verbose = getOption("verbose"), ...)
4)此外,還有contour函數。
contour(x = seq(0, 1, length.out = nrow(z)),
y = seq(0, 1, length.out = ncol(z)),
z,
nlevels = 10, levels = pretty(zlim, nlevels),
labels = NULL,
xlim = range(x, finite = TRUE),
ylim = range(y, finite = TRUE),
zlim = range(z, finite = TRUE),
labcex = 0.6, drawlabels = TRUE, method = "flattest",
vfont, axes = TRUE, frame.plot = axes,
col = par("fg"), lty = par("lty"), lwd = par("lwd"),
add = FALSE, ...)
下面是contour函數的參數列表:
4.1)例子:使用Yosemite Valley數據繪製一個contour圖形的表達式
> contour(yosemite, asp=253/562, ylim=c(1, 0))
As with image, we needed to flip the y-axis and to specify an aspect ratio!
當進行數據分析時,理解一份數據的分佈是很重要的。它能夠告訴你是否數據中存在奇異值(outliers),是否某一個建模技術對數據合適等等。
1.1)例子:先來看看在2008年MLB賽季擊球手(batters)的plate appearances的數量。
# 加載數據集
> library(nutshell)
> data(batting.2008)
# Let's calculate the plate appearances for each player and then plot a histogram
#注意:PA (plate appearances) = AB (at bats) + BB (base on balls) + HBP (hit by pitch) + SF (sacrifice flies) + SH (sacrifice bunts)
> batting.2008 <- transform(batting.2008,PA=AB+BB+HBP+SF+SH)
> hist(batting.2008$PA)
The histogram shows that there were a large number of players with fewer than 50 plate appearances. If you were to perform further analysis on this data (for example, looking at the average on-base percentage [OBP]), you might want to exclude these players from your analysis.
1.2)生成第二幅直方圖,this time excluding players with fewer than 25 at bats. We'll also increase the number of bars, using the breaks argument to specify that we want 50 bins!
> hist(batting.2008[batting.2008$PA>25, "PA"], breaks=50, cex.main=.8)
> plot(density(batting.2008[batting.2008$PA>25, "PA"]),cex.main = 0.9)
關於density函數的一個簡要例子說明
Density返回的對象中包括:x,y,bw,n,call,data.name,has.na!
X和y是根據核函數估計出來的連續取值,用來生成平滑曲線的!
另外,對於核密度圖的一個經常使用tricks(addition)是使用rug函數。
#### add a rug to the kernel density plot with an expression like:
> rug(batting.2008[batting.2008$PA>25, "PA"])
在R中使用qqnorm函數生成這類圖,Without arguments, this function will plot the distribution of points in each quantile, assuming a theoretical normal distribution!
> qqnorm(y = batting.2008$AB)
If you would like to compare two actual distributions, or compare the data distribution to a different theoretical distribution, then try the function qqplot.
另外一個可視化分佈的有用方式是box plot.
臨近值(adjacent values)用來顯示極值(extreme values!),可是不老是適用於絕對最大值或最小值。當有遠離四分位間距以外的值時,這些異常值(outlying values)會被單獨繪製出來。具體說,臨近值是如何被計算出來的呢?上部臨近值=小於或等於上部四分位值的最大觀測值+1.5倍四分位間距的長度。超出whiskers範圍的值被稱爲outside values,被單獨繪製!
繪製box plot,使用的是boxplot函數。
###下面是針對向量的boxplot函數的默認方法:
boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,
notch = FALSE, outline = TRUE, names, plot = TRUE,
border = par("fg"), col = NULL, log = "",
pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
horizontal = FALSE, add = FALSE, at = NULL)
###下面是指定formula形式的boxplot函數:
boxplot(formula, data = NULL, ..., subset, na.action = NULL)
下面是參數說明:
> batting.2008 <- transform(batting.2008,OBP=(H+BB+HBP)/(AB+BB+HBP+SF))
> boxplot(OBP~teamID,data=batting.2008[batting.2008$PA>100 & batting.2008$lgID=="AL",],cex.axis=.7)
R中的圖表(graphics)是被畫在一個graphics devices上面的。能夠手動指定一個圖形設備或者使用默認設置。在一個交互式的R環境中,默認是使用將圖形繪製在屏幕上的設備(device)。在window系統上,使用的是windows設備。在大多數Unix系統中,使用的X11。在Mac OS X中,使用的是quartz設備。可使用bmp、jpeg、png和tiff設備生成普通格式的圖形。其餘設備(包括postscript、pdf、pictex(生成LaTeX/PicTex)、xfig和bitmap)
大多數設備都容許指定 width,height,輸出的point size(參數是width、height和pointsize參數!)。對於生成文件(files)的設備,一般使用file參數名,當將一個圖形寫入到一個文件中後,記得調用dev.off函數關掉和保存文件!!!
>png("scatter.1.pdf", width=4.3, height=4.3, units="in", res=72)
> attach(toxins.and.cancer)
> plot(total_toxic_chemicals/Surface_Area, deaths_total/Population)
> dev.off()
有不少改變R繪圖的方式,最直觀(intuitive)的就是經過設置傳遞給繪圖函數(charting function)的參數來達到此目的。另一種自定義圖形的方式是經過設置分區參數(session parameters)。還有一種change a chart的方式就是經過修改圖形的函數(好比,添加titles,trend lines等)。最後一種方式就是從頭開始寫本身的繪圖函數(charting function)。
This section describes common arguments and parameters for controlling how charts are plotted.
經常使用的繪圖函數參數簡介:
Conveniently, most charting functions in R share some arguments. Here is a table of common arguments for charting functions.
This section describes the graphical parameters available in the graphics package。
In most cases, you can specify these parameters as arguments to graphics functions. However, you can also use the par function to set graphics parameters。Par函數sets the graphics functions for a specific graphics device. These new settings will be the defaults for any new plot until you close the device。
想要設置一次參數,而後連續繪製幾幅圖形或者屢次使用相同的參數設置,設置par函數很是有用。能夠寫一個設置正確參數的函數,而後每當你想要繪製一些圖形的時候就調用它:
> my_graphics_params <- function () {
par(some graphics parameters)
}
用par檢查(check)一個參數的值,使用字符串指定值的名稱;設置一個參數值,使用參數名(parameter name)做爲一個參數名(argument name)。幾乎全部的參數都能被讀取或重寫,惟一的例外就是cin、cra、csi、cxy、din,這些只能被讀取;
> par("bg")
[1] "transparent"
You could use the par function to change the bg parameter to "white":
> par(bg="white")
> par("bg")
[1] "white"
Titles和axis labels被稱爲chart annotation。
> par(ann = FALSE) ; > plot(x = 1:5,y = 21:25)
R allows you to control the size of the margin around a plot, The whole graphics device is called the device region. The area where data is plotted is called the plot region。
默認下, R maximizes the use of available space out to the margins (pty="m"), but you can easily ask R to use a square region by setting pty="s"!
> par('mai')
[1] 1.360000 1.093333 1.093333 0.560000
> par('mar')
[1] 5.1 4.1 4.1 2.1
> par('mex')
[1] 1
1)在R中,能夠在相同的chart area內繪製多幅圖,這即是mfcol參數。例如:下面在圖形區域繪製六副圖(三行兩列: in three rows of two columns)
> par(mfcol=c(3, 2))
Each time a new figure is plotted, it will be plotted in a different row or column within the device
從top-left corner開始。每一次添加一幅圖,從top到bottom首先fill每一列,而後,移動到右邊的下一列(moving to the next column to the right)。
1.1)例子:繪製六福不一樣的圖形
> png("~/Documents/book/current/figs/multiplefigs.1.pdf",
+ width=4.3, height=6.5, units="in", res=72)
> par(mfcol=c(3, 2))
> pie(c(5, 4, 3))
> plot(x=c(1, 2, 3, 4, 5), y=c(1.1, 1.9, 3, 3.9, 6))
> barplot(c(1, 2, 3, 4, 5))
> barplot(c(1, 2, 3, 4, 5), horiz=TRUE)
> pie(c(5, 4, 3, 2, 1))
> plot(c(1, 2, 3, 4, 5, 6), c(4, 3, 6, 2, 1, 1))
> dev.off()
若是在圖形設備上繪製子圖(subplots)矩陣,能夠視同參數mfg=c(row, column, nrows,ncolumns)來指定下一副圖的位置!
在圖形區域的周圍有一個outer margin;能夠經過參數omi、oma、omd來控制'在每一副圖中,都有一個第二margin area,經過mai、mar、mex控制。若是你本身的圖形函數,也許你會使用xpd參數來控制where graphics are clipped!
查看當前圖形區域(within the grid)的大小,使用pin參數,獲取圖形區域的座標,使用plt參數。查看使用標準化設備座標(normalized device coordinates)的當前圖形區域的dimensions,使用fig參數!
You may find it easier to use the functions layout or split.screen. Better still, use the packages grid or lattice
Many parameters control the way text is shown within a plot!
Text size
參數介紹:
ps:指定默認的文本point size;cex:指定默認的文本scaling factor(即文本大小);cex.axis:針對座標軸註釋;cex.lab針對x和y軸的labels;cex.main:針對主標題;cex.sub針對子標題(subtitles)。
1)肯定point size for a chart title, multiply ps*cex*main!
可能還會用到只讀的參數:cin,cra,csi和cxy來查看字符的大小!
###Typeface
文本風格經過font參數來指定;You can specify the style for the axis with font.axis, for labels with font.lab, for main titles with font.main, and for subtitles with font.sub.
###Alignment和spacing.
To control how text is aligned, use the adj parameter. To change the spacing between lines of text, use the lheight parameter!
###Rotation
To rotate each character, use the crt parameter. To rotate whole strings,use the srt parameter
@@@首先,取得這些參數的默認值,而後對其進行微調,查看效果!
>unlist(par('ps','cex','cex.axis','cex.lab','cex.main','cex.sub','font','font.axis','font.lab','font.main','font.sub','adj','crt','srt') )
可使用不少方式來指定顏色。做爲一個字符串,使用RGB元素, 或者經過整數索引引用(reference)一個調色板(palette)。獲取一個有效的顏色名稱列表,使用colors函數。使用RGB組件指定一個顏色, 使用形如"#RRGGBB"格式的字符串, 其中RR,GG,BB是16進制值,用於分別指定紅色,綠色,藍色的量。爲了查看或改變一個顏色調色板,使用palette函數。其餘函數還有:rgb, hsv, hcl, gray, and rainbow!
#####(1)根據colors()常量(包含有 657中顏色)生成一個顏色帶,以下:
idx<-1:657
colorband<-colors()
plot(1,1,xlim = c(1,700),ylim=c(1,700))
j=1
> for(i in idx) {
abline(h=i,col=colorband[j])
j=j+1
}
#####(2)根據rainbow()函數生成12中顏色的顏色帶
> plot(1,1,xlim = c(1,15),ylim=c(1,15))
> rainbowband<-rainbow(12)
> j=1
> for(i in idx[1:12]) {
abline(h=i,col=rainbowband[j],lwd=4)
j=j+1
}
能夠經過指定pch參數來改變點的符號,獲取點類型列表,使用points函數
下表顯示了可以被par()圖形參數函數設置的全部R可用圖形參數:
下面是一個被高級圖形函數調用對應低級圖形函數的圖形列表(咱們一般能夠查看低級函數的參數來肯定如何自定義由相應高級函數產生的圖形的外觀)
可使用points函數在一副圖上繪製點
points(x, y = NULL, type = "p", ...)
# 這對於向現有圖(一般是散點圖)中添加額外的點很是的有用,這些額外添加的點通常會用不一樣的顏色或圖形符號。最有用的參數有:col(指定繪製點的前景色),bg(指定點的背景色),pch(指定繪製的字符),cex(指定繪製點的大小),lwd(指定繪製符號的線寬-line width)。
一樣,還能使用matpoints函數向現有的矩陣圖中添加點:
matpoints(x, y, type = "p", lty = 1:5, lwd = 1, pch = NULL,col = 1:6, ...)
lines(x, y = NULL, type = "l", ...) #
# 和點同樣,lines也是被用於添加到一個現有圖中。Lines函數在現有圖中繪製一組線段(lines segments:x和y的值指定線段間的交點)。一些有用的參數是:lty(線類型-line type),lwd(線寬-line width),col(線顏色-line color),lend(線段結尾處的風格-line end style),ljoin(線相交處的風格-line join style),lmitre(線斜接處的風格-line mitre style)。
一樣,可使用matlines向現有圖中添加線:
matlines (x, y, type = "l", lty = 1:5, lwd = 1, pch = NULL,col = 1:6, ..)
在當前圖形設備上繪製曲線,使用curve函數
curve(expr, from = NULL, to = NULL, n = 101, add = FALSE,type = "l", ylab = NULL, log = NULL, xlim = NULL, ...)
### 下面是該函數的參數列表:
#####舉個簡答的例子:畫正弦/餘弦函數
> curve(sin, -2*pi, 2*pi, xname = "t")
> plot(cos, -pi, 3*pi)
> curve(cos, xlim = c(-pi, 3*pi), n = 1001, col = "blue", add = TRUE) ##使用add參數
## 使用text函數向現有圖添加文本。
text (x, y = NULL, labels = seq_along(x), adj = NULL,pos = NULL, offset = 0.5, vfont = NULL,
cex = 1, col = NULL, font = NULL, ...)
## 下面是參數列表
## 在整個圖形區域繪製一根線條,使用abline函數:
abline(a = NULL, b = NULL, h = NULL, v = NULL, reg = NULL,coef = NULL, untf = FALSE, ...)
## 下面是abline函數的參數列表:
## 通常而言,調用一次abline函數來畫一根直線。例如:
@@@(1)draw a simple plot as a background
> plot(x=c(0, 10), y=c(0, 10))
>(2) # plot a horizontal line at y=4
> abline(h=4)
> #(3) plot a vertical line at x=3
> abline(v=3)
> #(4) plot a line with a y-intercept of 1 and slope of 1
> abline(a=1, b=1)
> # (5)plot a line with a y-intercept of 10 and slope of -1,but this time, use the coef argument:
> abline(coef=c(10, -1))
@@@ abline還能夠繪製全部指定的線,例如:
> plot(x=c(0, 10), y=c(0, 10))
> # plot a grid of lines between 1 and 10
> abline(h=1:10, v=1:10)
@@@ 補充:若是想要在一副圖上繪製一網格,使用grid函數:
grid(nx = NULL, ny = nx, col = "lightgray", lty = "dotted",lwd = par("lwd"), equilogs = TRUE)
## 舉個例子吧!
> plot(x=c(0, 10), y=c(0, 10))
> grid(nx = 10,ny = 5,col = rainbow(15),lwd = 3) # 生成一個nx*ny的網格
# 向現有圖中添加-繪製多邊形
polygon(x, y = NULL, density = NULL, angle = 45,border = NULL, col = NA, lty = par("lty"), ..
# x和y參數指定多邊形的頂點(vertices)。例如:
> polygon(x=c(2, 2, 4, 4), y=c(2, 4, 4, 2)) # 以(3,3)爲中心在圖形上繪製一個2*2正方形!
@@@ 特例:若是有些時候,你想畫長方形(rectangle),使用rect函數便可!
rect(xleft, ybottom, xright, ytop, density = NULL, angle = 45,col = NA, border = NULL, lty = par("lty"), lwd = par("lwd"),...)
舉個例子吧:
> plot(c(100, 250), c(300, 450), type = "n", xlab = "", ylab = "",
+ main = "2 x 11 rectangles; 'rect(100+i,300+i, 150+i,380+i)'")
> i <- 4*(0:10) ##指定長方形間的間距 0 4 8 12 16 20 24 28 32 36 40
> ## draw rectangles with bottom left (100, 300)+i
> ## and top right (150, 380)+i
> rect(100+i, 300+i, 150+i, 380+i, col = rainbow(11, start = 0.7, end = 0.1))
# 畫線段/和箭頭
segments(x0, y0, x1, y1,col = par("fg"), lty = par("lty"), lwd = par("lwd"),...)
## x0, y0:coordinates of points from which to draw.
## x1, y1:coordinates of points to which to draw
# 該函數根據(x0[i],y0[i]) to (x1[i], y1[i])指定的頂點對繪製一組線段!
# 舉個小例子:
> x <- stats::runif(12); y <- stats::rnorm(12)
> i <- order(x, y); x <- x[i]; y <- y[i]
> plot(x, y, main = "arrows(.) and segments(.)")
> i
[1] 10 6 12 9 5 4 7 1 2 3 8 11
> s <- seq(length(x)-1)
> arrows(x[s], y[s], x[s+1], y[s+1], col= 1:3)
> plot(x, y, main = "arrows(.) and segments(.)")
> segments(x[s], y[s], x[s+2], y[s+2], col= 'pink')
# 向一副圖中添加legend(圖例)!
legend(x, y = NULL, legend, fill = NULL, col = par("col"),lty, lwd, pch,angle = 45, density = NULL, bty = "o", bg = par("bg"),box.lwd = par("lwd"), box.lty = par("lty"), box.col = par("fg"),pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd, xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,adj = c(0, 0.5), text.width = NULL, text.col = par("col"),merge = do.lines && has.pch, trace = FALSE,plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,inset = 0, xpd, title.col = text.col)
# 下面是參數列表!
# 添加圖形註解
title(main = NULL, sub = NULL, xlab = NULL, ylab = NULL,line = NA, outer = FALSE, ...)
# 該函數能夠添加:a main title (main), a subtitle (sub), an x-axis label (xlab), and a y-axis label (ylab);指定line的值來將標籤從圖形的邊緣外移!指定outer=TRUE,即將標籤放在外邊緣
# 添加座標軸
axis(side, at = NULL, labels = TRUE, tick = TRUE, line = NA,
pos = NA, outer = FALSE, font = NA, lty = "solid",
lwd = 1, lwd.ticks = lwd, col = NULL, col.ticks = NULL,
hadj = NA, padj = NA, ...)
# 下面是參數列表:
# 用於在當前圖形區域繪製一個box。通常當咱們在一個圖形設備中繪製多幅圖時比較有用!
box(which = "plot", lty = "solid", ...)
### which參數指定在哪裏繪製box,可取的值有:"plot,""figure," "inner," and "outer")!
# 用於向圖形的邊緣(margin)添加文本!
mtext(text, side = 3, line = 0, outer = FALSE, at = NA,adj = NA, padj = NA, cex = NA, col = NA, font =NA, ...)
### side參數指定在哪裏繪製文本(side = 1 for bottom, side =2 for left, side = 3 for top, and side = 4 for right);line參數指定在哪裏寫文本(就"margin lines"而言, 從與圖形區域最近的0開始);
# 向透視圖中添加線或點(透視圖使用persp函數繪製)
trans3d(x,y,z, pmat)
# This function takes vectors of points x, y, and z and translates them into the correct screen position. The argument pmat is a perspective matrix that is used for translation. The persp function will return an appropriate perspective matrix object for use by trans3d.