1. 文本文件的讀寫 2. 字符統計和字符翻譯
3. 字符串鏈接
4. 字符串拆分
5. 字符串查詢
6. 字符串替換
7. 字符串提取html
text <- readLines("file.txt", encoding = "UTF-8") #假設有這麼一個文件可供使用 scan("file.txt", what = character(0)) #默認設置,每一個單詞做爲字符向量的一個元素 scan("file.txt", what = character(0), sep = "\n") #設置成每一行文本做爲向量的一個元素,這相似於readLines scan("file.txt", what = character(0), sep = ".") #設置成每一句文本做爲向量的一個元素
cat(text, file = "file.txt", sep = "\n") writeLines(text, con = "file.txt", sep = "\n", useBytes = F)
x <- c("we are the world", "we are the children") x
## [1] "we are the world" "we are the children"
## [1] 16 19
## [1] 2
## [1] 0
length("") #雖然字符爲空,可是它仍然是一個元素。
## [1] 1
dna <- "AgCTaaGGGcctTagct" dna
## [1] "AgCTaaGGGcctTagct"
## [1] "agctaagggccttagct"
chartr("Tt", "Uu", dna) #將T鹼基替換成U鹼基
## [1] "AgCUaaGGGccuUagcu"
paste("control", 1:3, sep = "_")
## [1] "control_1" "control_2" "control_3"
x <- list(a = "aa", b = "bb") y <- list(c = 1, d = 2) paste(x, y, sep = "-")
## [1] "aa-1" "bb-2"
paste(x, y, sep = "-", collapse = ";")
## [1] "aa-1;bb-2"
paste(x, collapse = ":")
## [1] "aa:bb"
## $a ## [1] "aa" ## ## $b ## [1] "bb"
as.character(x) #將其它類型的對象轉換成字符
## [1] "aa" "bb"
## a b ## "aa" "bb"
strsplit()是一個拆分函數,該函數可使用正則表達式進行匹配拆分。其命令形式爲:strsplit(x, split, fixed= F, perl= F, useBytes= F)
text <- "We are the world.\nWe are the children!" text
## [1] "We are the world.\nWe are the children!"
cat(text) #注意\n被解釋稱換行符,R裏字符串自身也是正則!
## We are the world. ## We are the children!
strsplit(text, " ")
## [[1]] ## [1] "We" "are" "the" "world.\nWe" "are" ## [6] "the" "children!"
strsplit(text, "\\s") #以任意空白符做爲分割的位置,注意雙反斜線
## [[1]] ## [1] "We" "are" "the" "world." "We" "are" ## [7] "the" "children!"
class(strsplit(text, "\\s"))
## [1] "list"
strsplit(text, "")
## [[1]] ## [1] "W" "e" " " "a" "r" "e" " " "t" "h" "e" " " "w" "o" "r" ## [15] "l" "d" "." "\n" "W" "e" " " "a" "r" "e" " " "t" "h" "e" ## [29] " " "c" "h" "i" "l" "d" "r" "e" "n" "!"
grep(pattern, x, ignore.case= F, perl= F, value= F, fixed= F, useBytes= F, invert= F)
grep(pattern, x, ignore.case= F, perl= F, fixed= F, useBytes= F) 由命令形式能夠看出,前者返回了向量x中哪一個元素匹配了模式pattern(即返回了向量x的某些下標)或者具體哪一個元素匹配了模式(經過設置value 參數來完成),然後者則返回了一系列邏輯值,其長度等同於向量x的長度,表示向量x中的元素是否匹配了模式。它們都沒有提供具體的位置信息,即向量x中的 元素在哪一個位置匹配了模式。
text <- c("We are the world", "we are the children") grep("We", text) #向量text中的哪些元素匹配了單詞'We'
## [1] 1
grep("We", text, invert = T) #向量text中的哪些元素沒有匹配單詞'We'
## [1] 2
grep("we", text, ignore.case = T) #匹配時忽略大小寫
## [1] 1 2
grepl("are", text) #向量text中的每一個元素是否匹配了單詞'We',即只返回TRUE或FALSE
## [1] TRUE TRUE
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
regexec(pattern, text, ignore.case = FALSE, fixed = FALSE, useBytes = FALSE)
text <- c("We are the world", "we are the children") regexpr("e", text)
## [1] 2 2 ## attr(,"match.length") ## [1] 1 1 ## attr(,"useBytes") ## [1] TRUE
class(regexpr("e", text))
## [1] "integer"
gregexpr("e", text)
## [[1]] ## [1] 2 6 10 ## attr(,"match.length") ## [1] 1 1 1 ## attr(,"useBytes") ## [1] TRUE ## ## [[2]] ## [1] 2 6 10 18 ## attr(,"match.length") ## [1] 1 1 1 1 ## attr(,"useBytes") ## [1] TRUE
class(gregexpr("e", text))
## [1] "list"
regexec("e", text)
## [[1]] ## [1] 2 ## attr(,"match.length") ## [1] 1 ## ## [[2]] ## [1] 2 ## attr(,"match.length") ## [1] 1
class(regexec("e", text))
## [1] "list"
從regexpr()的返回結果看,返回結果是個整數型向量,可是它還具備兩個額外的屬性(attributes),分別是匹配字段的長度和是否按 字節進行匹配;regexpr()的返回結果爲-1和1,其中-1表示沒有匹配上,1表示text中第2個元素中的第一個字符被匹配上,且匹配字符的長度 爲2(屬性值中提供);gregexpr()的返回結果中包含了所有的匹配結果的位置信息,而regexpr()只返回了向量text裏每一個元素中第一個 匹配的位置信息,gregexpr()的返回結果類型是list類型對象;regexec()的返回結果基本與regexpr()相似,只返回了第一個匹 配的位置信息,但其結果是一個list類型的對象,而且列表裏面的元素少了一個屬性值,即attr(,「useBytes」)。
除了上面的字符串的查詢,有時還會用到徹底匹配,這是會用到match(),其命令形式以下: match(x, table, nomatch= NAinteger, incomparables)
text <- c("We are the world", "we are the children", "we") match("we", text)
## [1] 3
match(2, c(3, 4, 2, 8))
## [1] 3
match("xx", c("abc", "xxx", "xx", "xx")) #只會返回第一個徹底匹配的元素的下標
## [1] 3
match(2, c(3, 4, 2, 8, 2))
## [1] 3
match("xx", c("abc", "xxx")) # 沒有徹底匹配的,所以返回NA
## [1] NA
此外還有一個charmatch(),其命令形式相似於match,但從下面的例子來看其行爲有些古怪。一樣該函數也會返回其匹配字符串所在 table中的下標,該函數在進行匹配時,會從table裏字符串的最左面(即第一個字符)開始匹配,若是起始位置沒有匹配則返回NA;若是同時部分匹配 和徹底匹配,則會優先選擇徹底匹配;若是同時有多個徹底匹配或者多個部分匹配時,則會返回0;若是以上三個都沒有,則返回NA。另外還有一個 pmatch(),其功能同charmatch()同樣,僅僅寫法不一樣。
charmatch("xx", c("abc", "xxa"))
## [1] 2
charmatch("xx", c("abc", "axx")) # 從最左面開始匹配
## [1] NA
charmatch("xx", c("xxa", "xxb")) # 不惟一
## [1] 0
charmatch("xx", c("xxa", "xxb", "xx")) # 優先選擇徹底匹配,儘管有兩個部分匹配
## [1] 3
charmatch(2, c(3, 4, 2, 8))
## [1] 3
charmatch(2, c(3, 4, 2, 8, 2))
## [1] 0
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
text <- c("we are the world", "we are the children") sub("w", "W", text)
## [1] "We are the world" "We are the children"
gsub("w", "W", text)
## [1] "We are the World" "We are the children"
sub(" ", "", "abc def ghi")
## [1] "abcdef ghi"
gsub(" ", "", "abc def ghi")
## [1] "abcdefghi"
注意:gsub()是對向量裏面的每一個元素進行搜素,若是發現元素裏面有多個位置匹配了模式,則所有進行替換,而grep()也是對向量裏每一個元素進行搜 索,但它僅僅知道元素是否匹配了模式(並返回該元素在向量中的下標),但具體元素中匹配了多少次卻沒法知道。在這裏僅僅是爲了說明這二者的區別,這在實際 中可能不會用到。
text <- c("we are the world", "we are the children") grep("w", text) #grep的返回結果爲1和2,表示text向量的兩個元素中都含有字符w,但text向量的第一個元素裏共含有兩個字符。
## [1] 1 2
gsub("w", "W", text)
## [1] "We are the World" "We are the children"
字符串的提取,有些地方相似於字符串的拆分。經常使用到的提取函數有substr()和substring(),它們都是靠位置來進行提取的,它們自身 並不適用正則表達式,可是它們能夠結合正則表達式函數regexpr(),gregexpr()和regexec()等能夠方便地從文本中提取所需信息。 二者的命令形式以下:
substr(x, start, stop)
substring(text, first, last)
x <- "123456789" substr(x, c(2, 4), c(4, 5, 8))
## [1] "234"
substring(x, c(2, 4), c(4, 5, 8))
## [1] "234" "45" "2345678"
y <- c("12345678", "abcdefgh") substr(y, c(2, 4), c(4, 5, 8))
## [1] "234" "de"
substring(y, c(2, 4), c(4, 5, 8))
## [1] "234" "de" "2345678"
從上面的輸出結果來,向量x的長度爲1,substr()無論後面的兩個參數的長度如何,它只會用到這兩個參數的第一個數值,即分別爲2和4,表示 提取的起始和終止位置分別爲2和4,返回的結果則是字符串「234」。而用substring()來提取時,則會依據參數最長的last參數,此外還須要 注意的是first和last兩個參數的長度不等,這時會用到R裏面的「短向量循環」原則,參數first會自動延長爲c(2, 4, 2),函數會依次提取從2到4,從4到5,從2到8這三個字符串。
dna <- paste(sample(c("A", "G", "C", "T"), 12, replace = T), collapse = "") dna
substring(dna, seq(1, 10, by = 3), seq(3, 12, by = 3))
## [1] "ATA" "ACG" "CGT" "GGG"
strtrim(x, width)
strtrim(c("abcde", "abcde", "abcde"), c(1, 5, 10))
## [1] "a" "abcde" "abcde"
strtrim(c(1, 123, 12345), 4) #短向量循環
## [1] "1" "123" "1234"
strwrap(x, width, indent= 0, exdent= 0, prefix= 「」, simplify= T, initial= prefix)
string <- "Each character string in the input is first split into\n paragraphs (or lines containing whitespace only). The paragraphs are then formatted by breaking lines at word boundaries." string
## [1] "Each character string in the input is first split into\n paragraphs (or lines containing whitespace only). The paragraphs are then formatted by breaking lines at word boundaries."
## Each character string in the input is first split into ## paragraphs (or lines containing whitespace only). The paragraphs are then formatted by breaking lines at word boundaries.
strwrap(string) #直接將換行符忽略了
## [1] "Each character string in the input is first split into paragraphs" ## [2] "(or lines containing whitespace only). The paragraphs are then" ## [3] "formatted by breaking lines at word boundaries."
strwrap(string, width = 40, indent = 4) #首行縮進
## [1] " Each character string in the input" ## [2] "is first split into paragraphs (or" ## [3] "lines containing whitespace only). The" ## [4] "paragraphs are then formatted by" ## [5] "breaking lines at word boundaries."
strwrap(string, width = 40, exdent = 4) #除了首行的其他行縮進
## [1] "Each character string in the input is" ## [2] " first split into paragraphs (or" ## [3] " lines containing whitespace only)." ## [4] " The paragraphs are then formatted" ## [5] " by breaking lines at word" ## [6] " boundaries."
strwrap(string, width = 40, simplify = F) # 返回結果是個列表,而再也不是個字符串向量
## [[1]] ## [1] "Each character string in the input is" ## [2] "first split into paragraphs (or lines" ## [3] "containing whitespace only). The" ## [4] "paragraphs are then formatted by" ## [5] "breaking lines at word boundaries."
strwrap(string, width = 40, prefix = "******")
## [1] "******Each character string in the" ## [2] "******input is first split into" ## [3] "******paragraphs (or lines containing" ## [4] "******whitespace only). The paragraphs" ## [5] "******are then formatted by breaking" ## [6] "******lines at word boundaries."摘自:http://rstudio-pubs-static.s3.amazonaws.com/13823_dbf87ac4114b44f8a4b4fbd2ea5ea162.html