R語言學習筆記:字符串處理

想在R語言中生成一個圖形文件的文件名,前綴是fitbit,後面跟上月份,再加上".jpg",先不百度,試了試其它語言的相似語法,沒一個可行的:正則表達式

C#中:"fitbit" + month + ".jpg"windows

VB:"fitbit" & month & ".jpg"api

Haskell:"fitbit" ++ month ++ ".jpg"app

還想到concat之類的函數,都不行,看來只能查幫助了,原來必需要用一個paste函數。函數

paste()與paste0():鏈接字符串

paste()不只能夠鏈接多個字符串,還能夠將對象自動轉換爲字符串再相連,另外它還能處理向量,因此功能更強大。ui

paste("fitbit", month, ".jpg", sep="")spa

這個函數的特殊地方在於默認的分隔符是空格,因此必須指定sep="",這樣若是month=10時,就會生成fitbit10.jpg這樣的字符串。regexp

另外還有一個paste0函數,默認就是sep=""orm

因此paste0("fitbit", month, ".jpg")就與前面的代碼簡潔一點了。對象

要生成12個月的fitbit文件名:

paste("fitbit", 1:12, ".jpg", sep = "")

[1] "fitbit1.jpg"  "fitbit2.jpg"  "fitbit3.jpg"  "fitbit4.jpg"  "fitbit5.jpg"  "fitbit6.jpg"  "fitbit7.jpg"

[8] "fitbit8.jpg"  "fitbit9.jpg"  "fitbit10.jpg" "fitbit11.jpg" "fitbit12.jpg"

能夠看出參數裏面有向量時的捉對拼接的效果,若是某個向量較短,就自動補齊:

a <- c("甲","乙","丙", "丁","戊","己","庚","辛","壬","癸")

b <- c("子","醜","寅","卯","辰","巳","午","未","申","酉","戌","亥")

paste0(a, b)

[1] "甲子" "乙丑" "丙寅" "丁卯" "戊辰" "己巳" "庚午" "辛未" "壬申" "癸酉" "甲戌" "乙亥"

paste還有一個collapse參數,能夠把這些字符串拼成一個長字符串,而不是放在一個向量中。

> paste("fitbit", 1:3, ".jpg", sep = "", collapse = "; ")

[1] "fitbit1.jpg; fitbit2.jpg; fitbit3.jpg"

 

nchar():求字符個數

nchar()可以獲取字符串的長度,它和length()的結果是有區別的。

nchar(c("abc", "abcd"))    #求字符串中的字符個數,返回向量c(3, 4)

length(c("abc", "abcd"))  #返回2,向量中元素的個數

注意nchar(NA)返回2

tolower(x) 和toupper(x) :大小寫轉換。

不用多說。

 

strsplit:字符串分割

strsplit("2014-10-30 2262 10367 7.4 18 1231 77 88 44", split=" ")

[[1]]

[1] "2014-10-30" "2262"       "10367"      "7.4"        "18"         "1231"       "77"         "88"         "44" 

實際上這個函數支持很是強大的正則表達式。

 

substr與substring:提取子串

substr("abcdef", 2, 4)
[1] "bcd"

substr(c("abcdef", "ghijkl"), 2, 4)
[1] "bcd" "hij"

substr("abcdef", 1:6, 1:6)
[1] "a"

注意還有一個substring函數,效果就不同了:

substring("abcdef",1:6,1:6)
[1] "a" "b" "c" "d" "e" "f"

區別是:substr返回的字串個數等於第一個參數的長度 
而substring返回字串個數等於三個參數中最長向量長度,短向量循環使用。

substr("123456789", c(2, 3), c(4, 5, 6))  #等價於:substr("123456789", 2, 4)

[1] "234"

substring("123456789", c(2, 3), c(4,5,6)) #最長的向量長度爲3,其它向量都循環補齊

[1] "234"   "345"   "23456"

 

grep搜索某個模式的子串

grep(pattern, x) 在字符串向量x中查找一個正則表達式pattern,返回包括這個pattern的字符串的索引。

grep("def", c("abc","abcdef","def"))    #第二、3兩個字符串中都包含def,因此返回2和3

[1] 2 3

下例中列出windows目錄中的全部.exe文件。

files <- list.files("c:/windows")

files[grep("\\.exe$", files)]

[1] "adb.exe" "explorer.exe" "hh.exe" "notepad.exe"

[5] "regedit.exe" "slrundll.exe" "twunk_16.exe" "twunk_32.exe"

[9] "winhelp.exe" "winhlp32.exe"

sub搜索並替換

待整理

sprintf()

這個函數與C語言的函數相似。

sprintf("2+3=%d", 2+3)

[1] "2+3=5"

sprintf("today: %s", date())

[1] "today: Wed Nov 05 14:05:47 2014"

regexpr()

> regexpr("def", "abcdefghi")    #def從第4個字符位置上出現

[1] 4

attr(,"match.length")

[1] 3

attr(,"useBytes")

[1] TRUE

gregexpr()

這個與上面相似,但查找全部匹配的位置。

gregexpr("def", "abcdefghijabcdef")   #在第4和第14個字符上都出現了def

[[1]]

[1]  4 14

 

用R解決個人一個實際問題

如今要自動生成fitbit的10月的統計圖,並保存爲文件fitbit_month_10.jpg。

m <- 10

jpeg(paste0("fitbit_month_", m, ".jpg"))

monthData <- fitbit[as.double(format(fitbit$date, "%m"))==m, ]

plot(format(monthData$date,"%d"), monthData$step, type="l", xlab="date", ylab="steps", main=paste("2014年",m,"月步數統計圖",sep=""))

dev.off()

fitbit_month_10

 

 

 

 

其它內容還未整理,先把網上搜到的資料放到下面。

===============================================

grepl()函數與grep相似,但其後面的"l"則意味着返回的將是邏輯值。

字符串替代:gsub()負責搜索字符串的特定表達式,並用新的內容加以替代。sub()函數是相似的,但只替代第一個發現結果。

chartr(old, new, x)

casefold(x, upper = FALSE)

agrep()  字符不徹底匹配

 

regexec("Adam", text)

## [[1]]
## [1] 9
## attr(,"match.length")
## [1] 4
## ## [[2]]
## [1] 5
## attr(,"match.length")
## [1] 4
## ## [[3]]
## [1] 14
## attr(,"match.length")
## [1] 4

 

rep和grepl函數:
這兩個函數返回向量水平的匹配結果,不涉及匹配字符串的詳細位置信息。

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes =FALSE, invert = FALSE) grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

雖然參數看起差很少,可是返回的結果不同。下來例子列出C:\windows目錄下的全部文件,而後用grep和grepl查找exe文件:

files <- list.files("c:/windows") grep("\\.exe$", files)

## [1] 8 28 30 35 36 58 69 99 100 102 111 112 115 117

grepl("\\.exe$", files)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [34] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [100] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE

grep僅返回匹配項的下標,而grepl返回全部的查詢結果,並用邏輯向量表示有沒有找到匹配。二者的結果用於提取數據子集的結果都同樣:

files[grep("\\.exe$", files)]

## [1] "bfsvc.exe" "explorer.exe" "fveupdate.exe" "HelpPane.exe" ## [5] "hh.exe" "notepad.exe" "regedit.exe" "twunk_16.exe" ## [9] "twunk_32.exe" "uninst.exe" "winhelp.exe" "winhlp32.exe" ## [13] "write.exe" "xinstaller.exe"

files[grepl("\\.exe$", files)]

## [1] "bfsvc.exe" "explorer.exe" "fveupdate.exe" "HelpPane.exe" ## [5] "hh.exe" "notepad.exe" "regedit.exe" "twunk_16.exe" ## [9] "twunk_32.exe" "uninst.exe" "winhelp.exe" "winhlp32.exe" ## [13] "write.exe" "xinstaller.exe"

 

6 字符串替換
6.1 sub和gsub函數
雖然sub和gsub是用於字符串替換的函數,但嚴格地說R語言沒有字符串替換的函數,由於R語言無論什麼操做對參數都是傳值不傳址。

text

## [1] "Hellow, Adam!" "Hi, Adam!" "How are you, Adam."

sub(pattern = "Adam", replacement = "world", text)

## [1] "Hellow, world!" "Hi, world!" "How are you, world."

text

## [1] "Hellow, Adam!" "Hi, Adam!" "How are you, Adam."

能夠看到:雖說是「替換」,但原字符串並無改變,要改變原變量咱們只能經過再賦值的方式。 sub和gsub的區別是前者只作一次替換(無論有幾回匹配),而gsub把知足條件的匹配都作替換:

sub(pattern = "Adam|Ava", replacement = "world", text)

## [1] "Hellow, world!" "Hi, world!" "How are you, world."

gsub(pattern = "Adam|Ava", replacement = "world", text)

## [1] "Hellow, world!" "Hi, world!" "How are you, world."

sub和gsub函數可使用提取表達式(轉義字符+數字)讓部分變成所有:

sub(pattern = ".*(Adam).*", replacement = "\\1", text)

## [1] "Adam" "Adam" "Adam"

7 字符串提取 用regexpr、gregexpr或regexec函數得到位置信息後再進行字符串提取的操做能夠本身試試看。

8 其餘:
8.1 strtrim函數
用於將字符串修剪到特定的顯示寬度,其用法爲strtrim(x, width),返回字符串向量的長度等於x的長度。由於是「修剪」,因此只能去掉多餘的字符不能增長其餘額外的字符:若是字符串自己的長度小於 width,獲得的是原字符串,別期望它會用空格或其餘什麼字符補齊:

strtrim(c("abcdef", "abcdef", "abcdef"), c(1, 5, 10))

## [1] "a" "abcde" "abcdef"

strtrim(c(1, 123, 1234567), 4)

## [1] "1" "123" "1234"

8.2 strwrap函數
該函數把一個字符串當成一個段落的文字(無論字符串中是否有換行符),按照段落的格式(縮進和長度)和斷字方式進行分行,每一行是結果中的一個字符串。例如:

str1 <- "Each character string in the input is first split into paragraphs\n(or lines containing whitespace only). The paragraphs are then\nformatted by breaking lines at word boundaries. The target\ncolumns for wrapping lines and the indentation of the first and\nall subsequent lines of a paragraph can be controlled\nindependently." str2 <- rep(str1, 2)strwrap(str2, width = 80, indent = 2)

## [1] " Each character string in the input is first split into paragraphs (or lines"
## [2] "containing whitespace only). The paragraphs are then formatted by breaking"
## [3] "lines at word boundaries. The target columns for wrapping lines and the"
## [4] "indentation of the first and all subsequent lines of a paragraph can be"
## [5] "controlled independently."
## [6] " Each character string in the input is first split into paragraphs (or lines"
## [7] "containing whitespace only). The paragraphs are then formatted by breaking"
## [8] "lines at word boundaries. The target columns for wrapping lines and the"
## [9] "indentation of the first and all subsequent lines of a paragraph can be"
## [10] "controlled independently."

simplify參數用於指定結果的返回樣式,默認爲TRUE,即結果中全部的字符串都按順序放在一個字符串向量中(如上);若是爲FALSE,那麼結果將是列表。另一個參數exdent用於指定除第一行之外的行縮進:

strwrap(str1, width = 80, indent = 0, exdent = 2)

## [1] "Each character string in the input is first split into paragraphs (or lines"
## [2] " containing whitespace only). The paragraphs are then formatted by breaking"
## [3] " lines at word boundaries. The target columns for wrapping lines and the"
## [4] " indentation of the first and all subsequent lines of a paragraph can be"
## [5] " controlled independently."

8.3 match和charmatch

match("xx", c("abc", "xx", "xxx", "xx"))

## [1] 2

match(2, c(3, 1, 2, 4))

## [1] 3

charmatch("xx", "xx")

## [1] 1

charmatch("xx", "xxa")

## [1] 1

charmatch("xx", "axx")

## [1] NA

match按向量進行運算,返回第一次匹配的元素的位置(若是有),非字符向量也可用。charmatch函數真坑爹。其餘不看了,其實有正則表達式就足夠。

 

 

 

=================================

聽說還有一個stringr包,將本來的字符處理函數進行了打包,統一了函數名和參數。在加強功能基礎上,還能處理向量化數據併兼容非字符數據。stringr包號稱能讓處理字符的時間減小95%。下面將其中的一些主要函數羅列一下。

library(stringr)

# 合併字符串
fruit <- c("apple","banana","pear","pinapple")
res <- str_c(1:4,fruit,sep=' ',collapse=' ')
str_c('I want to buy ',res,collapse=' ')

# 計算字符串長度
str_length(c("i","like","programming R",123,res))

# 按位置取子字符串
str_sub(fruit,1,3)
# 子字符串從新賦值
capital <-toupper(str_sub(fruit,1,1))
str_sub(fruit,rep(1,4),rep(1,4))<- capital

# 重複字符串
str_dup(fruit,c(1,2,3,4))

# 加空白
str_pad(fruit,10,"both")
# 去除空白
str_trim(fruit)

#  根據正則表達式檢驗是否匹配
str_detect(fruit,"a$")
str_detect(fruit,"[aeiou]")

# 找出匹配的字符串位置
str_locate(fruit,"a")

# 提取匹配的部分
str_extract(fruit,"[a-z]+")
str_match(fruit,"[a-z]+")

# 替換匹配的部分
str_replace(fruit,"[aeiou]","-")

# 分割 str_split(res," ")

相關文章
相關標籤/搜索