2016年總統選舉的預測

ASA的美國總統競選

在這個大選之年,美國統計協會(ASA)將學生競賽和總統選舉放在一塊兒,將學生預測誰是2016年總統大選的贏家準確的百分比做爲比賽點。詳情見:html

 http://thisisstatistics.org/electionprediction2016/git

獲取數據

互聯網上有不少公開的民調數據。能夠下面的網站獲取總統大選的相關數據:github

http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/app

其餘較好的數據源是:ide

http://www.realclearpolitics.com/epolls/latest_polls/post

http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton網站

http://www.gallup.com/products/170987/gallup-analytics.aspx)ui

值得注意的是:數據是天天更新的,因此你在看本文的時候極可能數據變化而獲得不一樣的結果。this

由於原始的數據是JSON文件,R拉取下來將其做爲了lists中的一個list(列表)。url

原文的Github地址:https://github.com/hardin47/prediction2016/blob/master/predblog.Rmd

複製代碼
##載入須要的包
require(XML)
require(dplyr)
require(tidyr)
require(readr)
require(mosaic)
require(RCurl)
require(ggplot2)
require(lubridate)
require(RJSONIO)


##數據拉取

url = "http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/"
doc <- htmlParse(url, useInternalNodes = TRUE) #爬取網頁內容

sc = xpathSApply(doc, 
                 "//script[contains(., 'race.model')]", 
                 function(x) c(xmlValue(x), xmlAttrs(x)[["href"]]))

jsobj = gsub(".*race.stateData = (.*);race.pathPrefix.*", "\\1", sc)

data = fromJSON(jsobj)
allpolls <- data$polls

#unlisting the whole thing
indx <- sapply(allpolls, length)
pollsdf <- as.data.frame(do.call(rbind, lapply(allpolls, 'length<-', max(indx))))

##數據清洗
#unlisting the weights
pollswt <- as.data.frame(t(as.data.frame(do.call(cbind, 
                                                 lapply(pollsdf$weight, 
                                                       data.frame, 
                                                       stringsAsFactors=FALSE)))))
names(pollswt) <- c("wtpolls", "wtplus", "wtnow")
row.names(pollswt) <- NULL

pollsdf <- cbind(pollsdf, pollswt)

#unlisting the voting
indxv <- sapply(pollsdf$votingAnswers, length)
pollsvot <- as.data.frame(do.call(rbind, lapply(pollsdf$votingAnswers,
                                                'length<-', max(indxv))))
pollsvot1 <- rbind(as.data.frame(do.call(rbind, lapply(pollsvot$V1, data.frame,
                                                       stringsAsFactors=FALSE))))
pollsvot2 <- rbind(as.data.frame(do.call(rbind, lapply(pollsvot$V2, data.frame,
                                                       stringsAsFactors=FALSE))))

pollsvot1 <- cbind(polltype = rownames(pollsvot1), pollsvot1, 
                   polltypeA = gsub('[0-9]+', '', rownames(pollsvot1)),
                   polltype1 = extract_numeric(rownames(pollsvot1)))

pollsvot1$polltype1 <- ifelse(is.na(pollsvot1$polltype1), 1, pollsvot1$polltype1 + 1)


pollsvot2 <- cbind(polltype = rownames(pollsvot2), pollsvot2, 
                   polltypeA = gsub('[0-9]+', '', rownames(pollsvot2)),
                   polltype1 = extract_numeric(rownames(pollsvot2)))

pollsvot2$polltype1 <- ifelse(is.na(pollsvot2$polltype1), 1, pollsvot2$polltype1 + 1)


pollsdf <- pollsdf %>% 
  mutate(population = unlist(population), 
         sampleSize = as.numeric(unlist(sampleSize)), 
         pollster = unlist(pollster), 
         startDate = ymd(unlist(startDate)),
         endDate = ymd(unlist(endDate)), 
         pollsterRating = unlist(pollsterRating)) %>%
  select(population, sampleSize, pollster, startDate, endDate, pollsterRating,
         wtpolls, wtplus, wtnow)

allpolldata <- cbind(rbind(pollsdf[rep(seq_len(nrow(pollsdf)), each=3),],
                           pollsdf[rep(seq_len(nrow(pollsdf)), each=3),]), 
                     rbind(pollsvot1, pollsvot2))

allpolldata <- allpolldata %>%
  arrange(polltype1, choice) 
複製代碼

查看全部的選擇數據:allolldata

 

快速可視化

在找出2016年美國總統競選的預測選票比例以前,簡單的查看數據是很是有必要的。數據集已經整理好了,使用ggplot2包對其進行可視化(選取2016年8月之後的數據,x軸爲endDate,y軸爲adj_pct,顏色根據choice也就是兩種顏色克林頓和希拉里,並根據wtnow設置點的大小):

複製代碼
##快速可視化
ggplot(subset(allpolldata, ((polltypeA == "now") & (endDate > ymd("2016-08-01")))), 
       aes(y=adj_pct, x=endDate, color=choice)) + 
  geom_line() + geom_point(aes(size=wtnow)) + 
  labs(title = "Vote percentage by date and poll weight\n", 
       y = "Percent Vote if Election Today", x = "Poll Date", 
       color = "Candidate", size="538 Poll\nWeight")
複製代碼

快速分析

考慮到每位候選人的選票比例會基於當前投票的票數百分比,因此,必須基於538人(樣本容量samplesize)的想法(投票舉動)和投票關閉天數(day sine poll)進行選票權重設置。權重的計算公式以下:

 

使用計算出的權重,我將計算被預測選票百分比的加權平均和其標準誤差(SE)。標準誤差(SE)計算公式來自 Cochran (1977) 。

複製代碼
##快速分析

# 參考文獻
# code found at http://stats.stackexchange.com/questions/25895/computing-standard-error-in-weighted-mean-estimation
# cited from http://www.cs.tufts.edu/~nr/cs257/archive/donald-gatz/weighted-standard-error.pdf
# Donald F. Gatz and Luther Smith, "THE STANDARD ERROR OF A WEIGHTED MEAN CONCENTRATION-I. BOOTSTRAPPING VS OTHER METHODS"

weighted.var.se <- function(x, w, na.rm=FALSE)
  #  Computes the variance of a weighted mean following Cochran 1977 definition
{
  if (na.rm) { w <- w[i <- !is.na(x)]; x <- x[i] }
  n = length(w)
  xWbar = weighted.mean(x,w,na.rm=na.rm)
  wbar = mean(w)
  out = n/((n-1)*sum(w)^2)*(sum((w*x-wbar*xWbar)^2)-2*xWbar*sum((w-wbar)*(w*x-wbar*xWbar))+xWbar^2*sum((w-wbar)^2))
  return(out)
}

# 計算累計平均和加權平均值Cumulative Mean / Weighted Mean
allpolldata2 <- allpolldata %>%
  filter(wtnow > 0) %>%
  filter(polltypeA == "now") %>%
  mutate(dayssince = as.numeric(today() - endDate)) %>%
  mutate(wt = wtnow * sqrt(sampleSize) / dayssince) %>%
  mutate(votewt = wt*pct) %>%
  group_by(choice) %>%
  arrange(choice, -dayssince) %>%
  mutate(cum.mean.wt = cumsum(votewt) / cumsum(wt)) %>%
  mutate(cum.mean = cummean(pct))
View(allpolldata2 )
複製代碼

 

可視化累計平均和加權平均值

複製代碼
##繪製累計平均/加權平均Cumulative Mean / Weighted Mean
# 累計平均
ggplot(subset(allpolldata2, ( endDate > ymd("2016-01-01"))), 
       aes(y=cum.mean, x=endDate, color=choice)) + 
  geom_line() + geom_point(aes(size=wt)) + 
  labs(title = "Cumulative Mean Vote Percentage\n", 
       y = "Cumulative Percent Vote if Election Today", x = "Poll Date", 
       color = "Candidate", size="Calculated Weight")

# 加權平均
ggplot(subset(allpolldata2, (endDate > ymd("2016-01-01"))), 
       aes(y=cum.mean.wt, x=endDate, color=choice)) + 
  geom_line() + geom_point(aes(size=wt)) + 
  labs(title = "Cumulative Weighted Mean Vote Percentage\n", 
       y = "Cumulative Weighted Percent Vote if Election Today", x = "Poll Date", 
       color = "Candidate", size="Calculated Weight")
複製代碼

選票百分比預測

 此外,加權平均和平均的標準誤差(科克倫(1977))能夠對每一個候選人進行計算。使用這個公式,咱們能夠預測主要候選人的最後的百分比!

複製代碼
pollsummary <- allpolldata2 %>% 
  select(choice, pct, wt, votewt, sampleSize, dayssince) %>%
  group_by(choice) %>%
  summarise(mean.vote = weighted.mean(pct, wt, na.rm=TRUE),
            std.vote = sqrt(weighted.var.se(pct, wt, na.rm=TRUE)))

pollsummary

## # A tibble: 2 x 3
##     choice mean.vote  std.vote
##      <chr>     <dbl>     <dbl>
## 1 Clinton  43.48713 0.5073771
## 2   Trump  38.95760 1.0717574
複製代碼

 顯然,主要的候選人是克林頓和希拉里,克林頓的選票平均百分比高於希拉里,而且其標準誤差小於希拉里,也就是說其選票變化穩定,最後勝出的極可能就是克林頓,可是按照希拉里的變化波動大,也不排除希拉里獲勝的可能。能夠看到希拉里的選票比例最高曾達到51%。

 原文連接:https://www.r-statistics.com/2016/08/presidential-election-predictions-2016/

 本文連接:http://www.cnblogs.com/homewch/p/5811945.html

相關文章
相關標籤/搜索