R中的普通文本處理-彙總

時間 2019-11-13

標籤普通文本處理彙總简体版

原文原文鏈接

介紹：
1. 文本文件的讀寫 2. 字符統計和字符翻譯
3. 字符串鏈接
4. 字符串拆分
5. 字符串查詢
6. 字符串替換
7. 字符串提取html

說明：正則表達式

普通文本文件不一樣於咱們平時接觸到的表格式文本文件，這裏的文本文件是純文本文件，裏面包含的大部分都是字符串；而表格式文本文件大可能是行列比較整齊的數據文件，讀取這一類的文件，須要用到read.table()或read.csv()之類的函數。
關於正則表達式的介紹暫時不涉及
stringr、parser、R.utils等包暫時也不涉及，但不能否認它們提供的函數可用性更高些！

1.文本文件的讀寫

R裏面讀取文本文件的函數主要有readLines()和scan()，二者均可以指定讀入內容的編碼方式（經過encoding參數設置），整個文本讀入R中以後會被存儲在一個字符型的向量裏。函數

text <- readLines("file.txt", encoding = "UTF-8")  #假設有這麼一個文件可供使用
scan("file.txt", what = character(0))  #默認設置，每一個單詞做爲字符向量的一個元素
scan("file.txt", what = character(0), sep = "\n")  #設置成每一行文本做爲向量的一個元素，這相似於readLines
scan("file.txt", what = character(0), sep = ".")  #設置成每一句文本做爲向量的一個元素

一樣R對象裏面的內容也能夠寫入文本文件中，主要有cat()和writeLines()。默認狀況下，cat()會將向量裏的元素連在一塊兒寫入到文件中去，但能夠sep參數設置分割符。編碼

cat(text, file = "file.txt", sep = "\n")
writeLines(text, con = "file.txt", sep = "\n", useBytes = F)

2.字符統計及字符翻譯

nchar()用來統計每一個元素的字符個數，注意與length()的區別，length()用來統計每一個向量中的元素個數。spa

x <- c("we are the world", "we are the children")
x

## [1] "we are the world"    "we are the children"

nchar(x)

## [1] 16 19

length(x)

## [1] 2

nchar("")

## [1] 0

length("")  #雖然字符爲空，可是它仍然是一個元素。

## [1] 1

字符翻譯經常使用的函數有tolower(),toupper()和chartr()翻譯

dna <- "AgCTaaGGGcctTagct"
dna

## [1] "AgCTaaGGGcctTagct"

tolower(dna)

## [1] "agctaagggccttagct"

toupper(dna)

## [1] "AGCTAAGGGCCTTAGCT"

chartr("Tt", "Uu", dna)  #將T鹼基替換成U鹼基

## [1] "AgCUaaGGGccuUagcu"

3.字符串鏈接

paste()是R中用來鏈接字符串的函數，可是它的功能又遠遠不止於此。code

paste("control", 1:3, sep = "_")

## [1] "control_1" "control_2" "control_3"

x <- list(a = "aa", b = "bb")
y <- list(c = 1, d = 2)
paste(x, y, sep = "-")

## [1] "aa-1" "bb-2"

paste(x, y, sep = "-", collapse = ";")

## [1] "aa-1;bb-2"

paste(x, collapse = ":")

## [1] "aa:bb"

## $a
## [1] "aa"
## 
## $b
## [1] "bb"

as.character(x)  #將其它類型的對象轉換成字符

## [1] "aa" "bb"

unlist(x)

##    a    b 
## "aa" "bb"

4.字符串拆分

strsplit()是一個拆分函數，該函數可使用正則表達式進行匹配拆分。其命令形式爲：
strsplit(x, split, fixed= F, perl= F, useBytes= F)regexp

參數x爲字符串格式向量，函數依次對向量的每一個元素進行拆分
參數split爲拆分位置的字串向量，即在哪一個字串處開始拆分；該參數默認是正則表達式匹配；若設置fixed= T則表示是用普通文本匹配或者正則表達式的精確匹配。用普通文原本匹配的運算速度要快些。
參數perl的設置和perl的版本有關，表示可使用perl語言裏面的正則表達式。若是正則表達式過長，則能夠考慮使用perl的正則來提升運算速度。
參數useBytes表示是否逐字節進行匹配，默認爲FALSE，表示是按字符匹配而不是按字節進行匹配。

text <- "We are the world.\nWe are the children!"
text

## [1] "We are the world.\nWe are the children!"

cat(text)  #注意\n被解釋稱換行符，R裏字符串自身也是正則！

## We are the world.
## We are the children!

strsplit(text, " ")

## [[1]]
## [1] "We"         "are"        "the"        "world.\nWe" "are"       
## [6] "the"        "children!"

strsplit(text, "\\s")  #以任意空白符做爲分割的位置，注意雙反斜線

## [[1]]
## [1] "We"        "are"       "the"       "world."    "We"        "are"      
## [7] "the"       "children!"

class(strsplit(text, "\\s"))

## [1] "list"

strsplit()的返回結果是list類型，若是想將其轉換成字符串類型，則可使用上面提到的unlist()和as.character()。orm

有一種特殊狀況，若是strsplit()的split參數爲空字符串的話，得函數的返回結果是一個個字符。htm

strsplit(text, "")

## [[1]]
##  [1] "W"  "e"  " "  "a"  "r"  "e"  " "  "t"  "h"  "e"  " "  "w"  "o"  "r" 
## [15] "l"  "d"  "."  "\n" "W"  "e"  " "  "a"  "r"  "e"  " "  "t"  "h"  "e" 
## [29] " "  "c"  "h"  "i"  "l"  "d"  "r"  "e"  "n"  "!"

5.字符串查詢

字符串的查詢或者搜索着要是應用了正則表達式的匹配來完成任務的，R里正方面的函數有grep()，grepl()，regexpr()，gregexpr()和regexec()等。

其中grep()和grepl()兩個函數的命令形式以下：
grep(pattern, x, ignore.case= F, perl= F, value= F, fixed= F, useBytes= F, invert= F)
grep(pattern, x, ignore.case= F, perl= F, fixed= F, useBytes= F) 由命令形式能夠看出，前者返回了向量x中哪一個元素匹配了模式pattern（即返回了向量x的某些下標）或者具體哪一個元素匹配了模式（經過設置value 參數來完成），然後者則返回了一系列邏輯值，其長度等同於向量x的長度，表示向量x中的元素是否匹配了模式。它們都沒有提供具體的位置信息，即向量x中的元素在哪一個位置匹配了模式。

text <- c("We are the world", "we are the children")
grep("We", text)  #向量text中的哪些元素匹配了單詞'We'

## [1] 1

grep("We", text, invert = T)  #向量text中的哪些元素沒有匹配單詞'We'

## [1] 2

grep("we", text, ignore.case = T)  #匹配時忽略大小寫

## [1] 1 2

grepl("are", text)  #向量text中的每一個元素是否匹配了單詞'We'，即只返回TRUE或FALSE

## [1] TRUE TRUE

regexpr(),gregexpr()和regexec()函數一樣也可用來進行字符串搜索，與grep()和grepl()不一樣的是它們返回的結果中包含了匹配的具體位置和字符串長度信息（所以可用於字符串的提取操做中去）。它們的命令形式以下：
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
regexec(pattern, text, ignore.case = FALSE, fixed = FALSE, useBytes = FALSE)

text <- c("We are the world", "we are the children")
regexpr("e", text)

## [1] 2 2
## attr(,"match.length")
## [1] 1 1
## attr(,"useBytes")
## [1] TRUE

class(regexpr("e", text))

## [1] "integer"

gregexpr("e", text)

## [[1]]
## [1]  2  6 10
## attr(,"match.length")
## [1] 1 1 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1]  2  6 10 18
## attr(,"match.length")
## [1] 1 1 1 1
## attr(,"useBytes")
## [1] TRUE

class(gregexpr("e", text))

## [1] "list"

regexec("e", text)

## [[1]]
## [1] 2
## attr(,"match.length")
## [1] 1
## 
## [[2]]
## [1] 2
## attr(,"match.length")
## [1] 1

class(regexec("e", text))

## [1] "list"

從regexpr()的返回結果看，返回結果是個整數型向量，可是它還具備兩個額外的屬性(attributes)，分別是匹配字段的長度和是否按字節進行匹配；regexpr()的返回結果爲-1和1，其中-1表示沒有匹配上，1表示text中第2個元素中的第一個字符被匹配上，且匹配字符的長度爲2（屬性值中提供）；gregexpr()的返回結果中包含了所有的匹配結果的位置信息，而regexpr()只返回了向量text裏每一個元素中第一個匹配的位置信息，gregexpr()的返回結果類型是list類型對象；regexec()的返回結果基本與regexpr()相似，只返回了第一個匹配的位置信息，但其結果是一個list類型的對象，而且列表裏面的元素少了一個屬性值，即attr(,「useBytes」)。

除了上面的字符串的查詢，有時還會用到徹底匹配，這是會用到match()，其命令形式以下： match(x, table, nomatch= NAinteger, incomparables)
只有參數x的內容被徹底匹配，函數纔會返回參數x所在table參數中的下標，不然的話會返回nomatch參數中定義的值（默認是NA）。

text <- c("We are the world", "we are the children", "we")
match("we", text)

## [1] 3

match(2, c(3, 4, 2, 8))

## [1] 3

match("xx", c("abc", "xxx", "xx", "xx"))  #只會返回第一個徹底匹配的元素的下標

## [1] 3

match(2, c(3, 4, 2, 8, 2))

## [1] 3

match("xx", c("abc", "xxx"))  # 沒有徹底匹配的，所以返回NA

## [1] NA

此外還有一個charmatch()，其命令形式相似於match，但從下面的例子來看其行爲有些古怪。一樣該函數也會返回其匹配字符串所在 table中的下標，該函數在進行匹配時，會從table裏字符串的最左面（即第一個字符）開始匹配，若是起始位置沒有匹配則返回NA；若是同時部分匹配和徹底匹配，則會優先選擇徹底匹配；若是同時有多個徹底匹配或者多個部分匹配時，則會返回0；若是以上三個都沒有，則返回NA。另外還有一個 pmatch()，其功能同charmatch()同樣，僅僅寫法不一樣。

charmatch("xx", c("abc", "xxa"))

## [1] 2

charmatch("xx", c("abc", "axx"))  # 從最左面開始匹配

## [1] NA

charmatch("xx", c("xxa", "xxb"))  # 不惟一

## [1] 0

charmatch("xx", c("xxa", "xxb", "xx"))  # 優先選擇徹底匹配，儘管有兩個部分匹配

## [1] 3

charmatch(2, c(3, 4, 2, 8))

## [1] 3

charmatch(2, c(3, 4, 2, 8, 2))

## [1] 0

不知道這樣一個奇怪的函數在那裏可以用到，真是有點期待！

6.字符串的替換

雖然sub()和gsub()可以提供替換的功能，但其替換的實質是先建立一個對象，而後對原始對象進行從新賦值，最後結果好像是「替換」了同樣。（R語言對參數都是傳值不傳址）

sub()和gsub()的命令形式具體以下：
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

text <- c("we are the world", "we are the children")
sub("w", "W", text)

## [1] "We are the world"    "We are the children"

gsub("w", "W", text)

## [1] "We are the World"    "We are the children"

sub(" ", "", "abc def ghi")

## [1] "abcdef ghi"

gsub(" ", "", "abc def ghi")

## [1] "abcdefghi"

從上面的輸出結果能夠看出，sub()和gsub()的區別在於，前者只替換第一次匹配的字串（請注意輸出結果中world的首字母），然後者會替換掉全部匹配的字串。
注意：gsub()是對向量裏面的每一個元素進行搜素，若是發現元素裏面有多個位置匹配了模式，則所有進行替換，而grep()也是對向量裏每一個元素進行搜索，但它僅僅知道元素是否匹配了模式（並返回該元素在向量中的下標），但具體元素中匹配了多少次卻沒法知道。在這裏僅僅是爲了說明這二者的區別，這在實際中可能不會用到。

text <- c("we are the world", "we are the children")
grep("w", text)  #grep的返回結果爲1和2，表示text向量的兩個元素中都含有字符w，但text向量的第一個元素裏共含有兩個字符。

## [1] 1 2

gsub("w", "W", text)

## [1] "We are the World"    "We are the children"

7.字符串的提取

字符串的提取，有些地方相似於字符串的拆分。經常使用到的提取函數有substr()和substring()，它們都是靠位置來進行提取的，它們自身並不適用正則表達式，可是它們能夠結合正則表達式函數regexpr(),gregexpr()和regexec()等能夠方便地從文本中提取所需信息。二者的命令形式以下：
substr(x, start, stop)
substring(text, first, last)
x和text爲要從中提取的字符串向量，start和first爲提取的起始位置向量，stop和last爲提取的終止位置向量，可是這兩個函數的返回值的長度稍有區別：

substr()返回的字符串個數等於第一個向量的長度
substring()返回的字符串個數等於其三個參數中長度最長的那個參數的長度

x <- "123456789"
substr(x, c(2, 4), c(4, 5, 8))

## [1] "234"

substring(x, c(2, 4), c(4, 5, 8))

## [1] "234"     "45"      "2345678"

y <- c("12345678", "abcdefgh")
substr(y, c(2, 4), c(4, 5, 8))

## [1] "234" "de"

substring(y, c(2, 4), c(4, 5, 8))

## [1] "234"     "de"      "2345678"

從上面的輸出結果來，向量x的長度爲1，substr()無論後面的兩個參數的長度如何，它只會用到這兩個參數的第一個數值，即分別爲2和4，表示提取的起始和終止位置分別爲2和4，返回的結果則是字符串「234」。而用substring()來提取時，則會依據參數最長的last參數，此外還須要注意的是first和last兩個參數的長度不等，這時會用到R裏面的「短向量循環」原則，參數first會自動延長爲c(2, 4, 2)，函數會依次提取從2到4，從4到5，從2到8這三個字符串。

用substing()能夠很方便地把DNA或RNA序列進行翻譯（三個鹼基轉換成一個密碼子）。

dna <- paste(sample(c("A", "G", "C", "T"), 12, replace = T), collapse = "")
dna

## [1] "ATAACGCGTGGG"

substring(dna, seq(1, 10, by = 3), seq(3, 12, by = 3))

## [1] "ATA" "ACG" "CGT" "GGG"

8.字符串的定製輸出

這個內容有點相似於字符串的鏈接。這裏用到了strtrim()，用於將字符串修剪到特定的顯示寬度，其命令形式以下：
strtrim(x, width)
該函數返回的字符串向量的長度等於參數x的長度。

strtrim(c("abcde", "abcde", "abcde"), c(1, 5, 10))

## [1] "a"     "abcde" "abcde"

strtrim(c(1, 123, 12345), 4)  #短向量循環

## [1] "1"    "123"  "1234"

strtrim()會根據width參數提供的數字來修剪字符串，若width提供的數字大於字符串的字符數的話，則該字符串會保持原樣，不會增長空格之類的東西。

strwrap()會把字符串當成一個段落來處理（無論段落中是否有換行），按照段落的格式進行縮進和分行，返回結果就是一行行的字符串，其命令形式以下：
strwrap(x, width, indent= 0, exdent= 0, prefix= 「」, simplify= T, initial= prefix)
函數返回結果中的每一行的字符串中的字符數目等於參數width。

string <- "Each character string in the input is first split into\n paragraphs (or lines containing whitespace only). The paragraphs are then formatted by breaking lines at word boundaries."
string

## [1] "Each character string in the input is first split into\n paragraphs (or lines containing whitespace only). The paragraphs are then formatted by breaking lines at word boundaries."

cat(string)

## Each character string in the input is first split into
##  paragraphs (or lines containing whitespace only). The paragraphs are then formatted by breaking lines at word boundaries.

strwrap(string)  #直接將換行符忽略了

## [1] "Each character string in the input is first split into paragraphs"
## [2] "(or lines containing whitespace only). The paragraphs are then"   
## [3] "formatted by breaking lines at word boundaries."

strwrap(string, width = 40, indent = 4)  #首行縮進

## [1] "    Each character string in the input"
## [2] "is first split into paragraphs (or"    
## [3] "lines containing whitespace only). The"
## [4] "paragraphs are then formatted by"      
## [5] "breaking lines at word boundaries."

strwrap(string, width = 40, exdent = 4)  #除了首行的其他行縮進

## [1] "Each character string in the input is" 
## [2] "    first split into paragraphs (or"   
## [3] "    lines containing whitespace only)."
## [4] "    The paragraphs are then formatted" 
## [5] "    by breaking lines at word"         
## [6] "    boundaries."

strwrap(string, width = 40, simplify = F)  # 返回結果是個列表，而再也不是個字符串向量

## [[1]]
## [1] "Each character string in the input is"
## [2] "first split into paragraphs (or lines"
## [3] "containing whitespace only). The"     
## [4] "paragraphs are then formatted by"     
## [5] "breaking lines at word boundaries."

strwrap(string, width = 40, prefix = "******")

## [1] "******Each character string in the"    
## [2] "******input is first split into"       
## [3] "******paragraphs (or lines containing" 
## [4] "******whitespace only). The paragraphs"
## [5] "******are then formatted by breaking"  
## [6] "******lines at word boundaries."摘自：http://rstudio-pubs-static.s3.amazonaws.com/13823_dbf87ac4114b44f8a4b4fbd2ea5ea162.html

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。