盼望着,盼望着,《復聯3》終於在國內上映。《復仇者聯盟:無限戰爭》的表現也不負衆望,國內上映3天后票房即達12億元,目前豆瓣評分爲8.5。git
不用說你也知道,「復仇者聯盟」裏每一個成員都性格迥異,因此說話用詞都有各自鮮明的特色。那他們說話都愛用哪些詞兒?windows
國外有幾位漫威的鐵桿粉絲把每一個復仇者的說話習慣用 R 語言可視化了出來,圖中每一個詞對應的條形長度,表明了他比其餘復仇者更愛說這個詞的程度。api
咱們能夠看到,美隊老愛喊別人名字,特別是託尼(emmmmmm...);黑豹常常唸叨一些很高大上的詞(好比朋友,國王),不像蜘蛛俠,滿嘴嗯啊個不停(好比嘿,啊,呃),還跟個孩子似的;浩克和鷹眼說的最多的是黑寡婦,不過兩人喊得稱呼卻不一樣(緣由你猜);幻視和緋紅女巫頗有共同話題,因此這是倆人互生愛慕的緣由?果真,雷神唸叨最多的仍是老弟洛基,並且總是想着「宇宙大事」,說的話都和第三部《無限戰爭》緊密相關;至於洛基嘛,意料之中的常常嗶嗶「權力」「王位」這些,可是跟洛基同樣也渴望權力的奧創卻說話不同,人家說的詞就頗有詩意。bash
這麼有意思的可視化圖形是怎麼作出來的呢?祕笈以下:app
首先咱們會用到如下 R 語言包:ide
library(dplyr)
library(grid)
library(gridExtra)
library(ggplot2)
library(reshape2)
library(cowplot)
library(jpeg)
library(extrafont)
複製代碼
有些人可能認爲使用「清除全部」代碼行很很差,可是在腳本頂部用它能夠確保在執行腳本時,腳本不會依賴不當心遺留在工做區內的任何對象。函數
rm(list = ls())
複製代碼
這是包含全部復仇者圖像的文件夾:字體
dir_images <- "C:\\Users\\Matt\\Documents\\R\\Avengers"
setwd(dir_images)
複製代碼
設置字體ui
windowsFonts(Franklin=windowsFont("Franklin Gothic Demi"))
複製代碼
character_names <- c("black_panther","black_widow","bucky","captain_america",
"falcon","hawkeye","hulk","iron_man",
"loki","nick_fury","rhodey","scarlet_witch",
"spiderman","thor","ultron","vision")
image_filenames <- paste0(character_names, ".jpg")
複製代碼
讀取和簡化版復仇者名字對應的圖像文件的函數this
read_image <- function(filename){
char_name <- gsub(pattern = "\\.jpg$", "", filename)
img <- jpeg::readJPEG(filename)
return(img)
}
複製代碼
將全部圖像讀取爲一個列表
all_images <- lapply(image_filenames, read_image)
複製代碼
爲這列圖像分配名字,這樣後面就能夠被字符檢索到了
names(all_images) <- character_names
複製代碼
其實使用圖像名字很簡單,好比下面這個例子
# clear the plot window
grid.newpage()
# draw to the plot window
grid.draw(rasterGrob(all_images[['vision']]))
複製代碼
獲取文本數據 這幾位漫威粉並無將他們本身的電影臺詞數據集分享出來,不過咱們能夠在 IMSDB 上下載,而後用文本分析技術稍做處理。若是原做者後面將本身的數據集公開,咱們會第一時間分享。
加載本地數據集。
修正人物名字的大小寫
capitalize <- Vectorize(function(string){
substr(string,1,1) <- toupper(substr(string,1,1))
return(string)
})
proper_noun_list <- c("clint","hydra","steve","tony",
"sam","stark","strucker","nat","natasha",
"hulk","tesseract", "vision",
"loki","avengers","rogers", "cap", "hill")
# Run the capitalization function
word_data <- word_data %>%
mutate(word = ifelse(word %in% proper_noun_list, capitalize(word), word)) %>%
mutate(word = ifelse(word == "jarvis", "JARVIS", word))
複製代碼
注意前面的簡化版人物名字,不要匹配文本數據框中已經處理好格式的人物名字。
unique(word_data$Speaker)
## [1] "Black Panther" "Black Widow" "Bucky"
## [4] "Captain America" "Falcon" "Hawkeye"
## [7] "Hulk" "Iron Man" "Loki"
## [10] "Nick Fury" "Rhodey" "Scarlet Witch"
## [13] "Spiderman" "Thor" "Ultron"
## [16] "Vision"
複製代碼
製做一個查詢表,將簡寫的文件名轉換爲美觀的人物名字
character_labeler <- c(`black_panther` = "Black Panther",
`black_widow` = "Black Widow",
`bucky` = "Bucky",
`captain_america` = "Captain America",
`falcon` = "Falcon", `hawkeye` = "Hawkeye",
`hulk` = "Hulk", `iron_man` = "Iron Man",
`loki` = "Loki", `nick_fury` = "Nick Fury",
`rhodey` = "Rhodey",`scarlet_witch` ="Scarlet Witch",
`spiderman`="Spiderman", `thor`="Thor",
`ultron` ="Ultron", `vision` ="Vision")
複製代碼
得到兩個不一樣版本的人物名字
其中一個版本用來展現(由於美觀),另外一個版本用於簡單的組織和引用圖像文件(由於簡單)。
convert_pretty_to_simple <- Vectorize(function(pretty_name){
# pretty_name = "Vision"
simple_name <- names(character_labeler)[character_labeler==pretty_name]
# simple_name <- as.vector(simple_name)
return(simple_name)
})
# convert_pretty_to_simple(c("Vision","Thor"))
# just for fun, the inverse of that function
convert_simple_to_pretty <- function(simple_name){
# simple_name = "vision"
pretty_name <- character_labeler[simple_name] %>% as.vector()
return(pretty_name)
}
# example
convert_simple_to_pretty(c("vision","black_panther"))
## [1] "Vision" "Black Panther"
複製代碼
爲文本數據框添加簡化版人物名字。
word_data$character <- convert_pretty_to_simple(word_data$Speaker)
複製代碼
爲每一個人物分配一個主要顏色。
character_palette <- c(`black_panther` = "#51473E",
`black_widow` = "#89B9CD",
`bucky` = "#6F7279",
`captain_america` = "#475D6A",
`falcon` = "#863C43", `hawkeye` = "#84707F",
`hulk` = "#5F5F3F", `iron_man` = "#9C2728",
`loki` = "#3D5C25", `nick_fury` = "#838E86",
`rhodey` = "#38454E",`scarlet_witch` ="#620E1B",
`spiderman`="#A23A37", `thor`="#323D41",
`ultron` ="#64727D", `vision` ="#81414F" )
複製代碼
製做水平方向的條形圖
avengers_bar_plot <- word_data %>%
group_by(Speaker) %>%
top_n(5, amount) %>%
ungroup() %>%
mutate(word = reorder(word, amount)) %>%
ggplot(aes(x = word, y = amount, fill = character))+
geom_bar(stat = "identity", show.legend = FALSE)+
scale_fill_manual(values = character_palette)+
scale_y_continuous(name ="Log Odds of Word",
breaks = c(0,1,2)) +
theme(text = element_text(family = "Franklin"),
# axis.title.x = element_text(size = rel(1.5)),
panel.grid = element_line(colour = NULL),
panel.grid.major.y = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white",
colour = "white"))+
# theme(strip.text.x = element_text(size = rel(1.5)))+
xlab("")+
coord_flip()+
facet_wrap(~Speaker, scales = "free_y")
avengers_bar_plot
複製代碼
看起來很不錯。
可是咱們想畫個更酷炫的圖:用每一個復仇者的照片來填充條形圖。
也就是說咱們只在條形圖區域內展現出復仇者的照片,在條形區域之外的地方則不展現(以下圖所示)。
若是想作到這點,咱們須要顯示一個透明的條形,而後在條形的末尾畫一個白色的條形,延伸至圖像邊緣覆蓋人物照片的剩餘部分。
在數據框部分,咱們如今想用所需的值的餘數來補充數字值,以實現總體最大化,這樣當把值和餘數相加時,全部數值都會增長到同一最大數值,以一樣的格式將不一樣行組合到一塊兒。
max_amount <- max(word_data$amount)
word_data$remainder <- (max_amount - word_data$amount) + 0.2
複製代碼
只提取每一個復仇者說的最多的5個詞
word_data_top5 <- word_data %>%
group_by(character) %>%
arrange(desc(amount)) %>%
slice(1:5) %>%
ungroup()
複製代碼
將數量&餘數轉換爲長格式
這樣能保證每一個人物和所說詞語的匹配關係有兩個 entry,一個用以真實數量(「amount」),一個用以選擇在哪裏結束,達到常見的最大值(「remainder」)。
這會將「amount」和「remainder」重疊爲一個單獨的列稱爲「variable」,表示是什麼值,而另外一個列「value」包含來自這些值中每個值的數字。
word_data_top5_m <- melt(word_data_top5, measure.vars = c("amount","remainder"))
複製代碼
Variable 是一個值是真實數量仍是補充數量的標記。
如今咱們按順序將它們放在一塊兒,和在melt函數中的肯定它們的順序相反。不然「amount」和「remainder」會以相反的順序展示在圖形中。
word_data_top5_m$variable2 <- factor(word_data_top5_m$variable,
levels = rev(levels(word_data_top5_m$variable)))
複製代碼
以簡單的形式聲明人物名字,好比用 black_panther 而不是 Black Panther。
plot_char <- function(character_name){
# example: character_name = "black_panther"
# plot details that we might want to fiddle with
# thickness of lines between bars
bar_outline_size <- 0.5
# transparency of lines between bars
bar_outline_alpha <- 0.25
#
# The function takes the simple character name,
# but here, we convert it to the pretty name,
# because we'll want to use that on the plot.
pretty_character_name <- convert_simple_to_pretty(character_name)
# Get the image for this character,
# from the list of all images.
temp_image <- all_images[character_name]
# Make a data frame for only this character
temp_data <- word_data_top5_m %>%
dplyr::filter(character == character_name) %>%
mutate(character = character_name)
# order the words by frequency
# First, make an ordered vector of the most common words
# for this character
ordered_words <- temp_data %>%
mutate(word = as.character(word)) %>%
dplyr::filter(variable == "amount") %>%
arrange(value) %>%
`[[`(., "word")
# order the words in a factor,
# so that they plot in this order,
# rather than alphabetical order
temp_data$word = factor(temp_data$word, levels = ordered_words)
# Get the max value,
# so that the image scales out to the end of the longest bar
max_value <- max(temp_data$value)
fill_colors <- c(`remainder` = "white", `value` = "white")
# Make a grid object out of the character's image
character_image <- rasterGrob(all_images[[character_name]],
width = unit(1,"npc"),
height = unit(1,"npc"))
# make the plot for this character
output_plot <- ggplot(temp_data)+
aes(x = word, y = value, fill = variable2)+
# add image
# draw it completely bottom to top (x),
# and completely from left to the the maximum log-odds value (y)
# note that x and y are flipped here,
# in prep for the coord_flip()
annotation_custom(character_image,
xmin = -Inf, xmax = Inf, ymin = 0, ymax = max_value) +
geom_bar(stat = "identity", color = alpha("white", bar_outline_alpha),
size = bar_outline_size, width = 1)+
scale_fill_manual(values = fill_colors)+
theme_classic()+
coord_flip(expand = FALSE)+
# use a facet strip,
# to serve as a title, but with color
facet_grid(. ~ character, labeller = labeller(character = character_labeler))+
# figure out color swatch for the facet strip fill
# using character name to index the color palette
# color= NA means there's no outline color.
theme(strip.background = element_rect(fill = character_palette[character_name],
color = NA))+
# other theme elements
theme(strip.text.x = element_text(size = rel(1.15), color = "white"),
text = element_text(family = "Franklin"),
legend.position = "none",
panel.grid = element_blank(),
axis.text.x = element_text(size = rel(0.8)))+
# omit the axis title for the individual plot,
# because we'll have one for the entire ensemble
theme(axis.title = element_blank())
return(output_plot)
}
複製代碼
plot_x_axis_text <- paste("Tendency to use this word more than other characters do",
"(units of log odds ratio)", sep = "\n")
複製代碼
下面是函數在這裏的工做示例
sample_plot <- plot_char("black_panther")+
theme(axis.title = element_text())+
# x lab is still declared as y lab
# because of coord_flip()
ylab(plot_x_axis_text)
sample_plot
複製代碼
由於隨着數字增大,差別也會隨之增大(具體數學知識這裏再也不講述);將它們轉換爲對數尺度,能夠約束變化幅度的大小,方便咱們在屏幕上展現。
若是想將這些對數差別轉化爲簡單的機率形式,能夠用以下函數:
logit2prob <- function(logit){
odds <- exp(logit)
prob <- odds / (1 + odds)
return(prob)
}
複製代碼
這樣處理後水平軸會以下所示:
logit2prob(seq(0, 2.5, 0.5))
## [1] 0.5000000 0.6224593 0.7310586 0.8175745 0.8807971 0.9241418
複製代碼
注意此序列中連續項目之間的差別在慢慢消失:
diff(logit2prob(seq(0, 2.5, 0.5)))
## [1] 0.12245933 0.10859925 0.08651590 0.06322260 0.04334474
複製代碼
Okay,如今咱們製做出了一個圖···
咱們接着將函數應用到列表中全部復仇者身上,將全部繪圖放入一個列表對象。
all_plots <- lapply(character_names, plot_char)
複製代碼
不只僅是文本,還有其它畫出的信息。
你能夠選擇提取 X 軸名稱仍是 Y 軸名稱:
get_axis_grob <- function(plot_to_pick, which_axis){
# plot_to_pick <- sample_plot
tmp <- ggplot_gtable(ggplot_build(plot_to_pick))
# tmp$grobs
# find the grob that looks like
# it would be the x axis
axis_x_index <- which(sapply(tmp$grobs, function(x){
# for all the grobs,
# return the index of the one
# where you can find the text
# "axis.title.x" or "axis.title.y"
# based on input argument `which_axis`
grepl(paste0("axis.title.",which_axis), x)}
))
axis_grob <- tmp$grobs[[axis_x_index]]
return(axis_grob)
}
複製代碼
px_axis_x <- get_axis_grob(sample_plot, "x")
px_axis_y <- get_axis_grob(sample_plot, "y")
複製代碼
下面是使用這些提取的軸的方法:
grid.newpage()
grid.draw(px_axis_x)
複製代碼
將全部繪圖排成一個對象
big_plot <- arrangeGrob(grobs = all_plots)
複製代碼
將 X 軸嵌入繪圖的底部,由於每一個圖並無 X 軸,而咱們想讓它們都有 X 軸。
注意這時繪圖會看着很不協調,高度差很少是寬度的十倍。
big_plot_w_x_axis_title <- arrangeGrob(big_plot,
px_axis_x,
heights = c(10,1))
grid.newpage()
grid.draw(big_plot_w_x_axis_title)
複製代碼
繪圖所佔的空間大小不一,由於每一個圖的詞彙長度不一樣。
這樣看起來有些混亂。
一般咱們會用 facet_grid() 或 facet_wrap() 來確保繪圖整潔有序,但這裏卻不能使用由於每一個圖的背景圖各不相同,沒法像數據框中的其它列同樣映射到平面上(由於背景圖像實際上並不是數據框的一部分)。
這樣繪圖的軸會垂直對齊:
big_plot_aligned <- cowplot::plot_grid(plotlist = all_plots, align = 'v', nrow = 4)
複製代碼
和以前同樣,將X軸名稱添加至繪圖對齊後網格的下方。
big_plot_w_x_axis_title_aligned <- arrangeGrob(big_plot_aligned,
px_axis_x,
heights = c(10,1))
複製代碼
下面是將總體效果圖繪製在屏幕上的方法:
grid.newpage()
grid.draw(big_plot_w_x_axis_title_aligned)
複製代碼
很好!
保存最終圖像:
ggsave(big_plot_w_x_axis_title_aligned,
file = "Avengers_Word_Usage.png",
width = 12, height = 6.3)
複製代碼
這樣,咱們就可視化出了《復聯》中各個復仇者都最愛說那些話!