Word Cloud (詞雲) - R

時間 2019-11-08

原文原文鏈接

在前面已經陸續總結了如何用 Python 和 JavaScript 建立詞雲了，今天要說的是 R。其實 SPSS 和 SAS 的 Word Cloud 擴展模板都是基於 R 實現的。html

>> Create Word Cloud via R

1) 準備文本。git

咱們再…再次使用上次保存的 Word Cloud History.txt 的文本，這樣咱們就能夠在最後比較用各類方法生成詞雲的效果。（好吧，其實主要是懶，繼續用吧……）github

2) 安裝並加載所需的 R 包。dom

# Install
install.packages("tm")  # for text mining
install.packages("wordcloud") # word-cloud generator 
install.packages("RColorBrewer") # color palettes
# Load
library("tm")
library("wordcloud")
library("RColorBrewer")

3) 讀取並清洗文本數據。讀取數據完畢咱們能夠用 inspect() 來查看是否讀取文本成功。函數

#Read text file
text <- readLines(file.choose())
# Load the data as a corpus
docs <- Corpus(VectorSource(text))
#Inspect the content
#inspect(docs)[1:10]

4) 清洗數據。咱們將使用 tm_map() 函數來進行文本的大小寫轉換，清洗文本的空格符，常見停用詞等。this

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

5) 用文本數據生成矩陣存放詞語 (words) 及其頻率 (frequencies) 。其中所用的 TermDocumentMatrix() 來自於 text mining 程序包。轉換後咱們能夠用 head() 來查看矩陣數據。spa

#Convert this into a matrix format
m <- as.matrix(dtm)
#Gives you the frequencies for every word
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
#Scan the data
#head(d, 10)

6) 生成 word cloud。code

wordcloud(words = d$word, freq = d$freq, scale=c(5,0.5), min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Accent"))