SA: 情感分析資源(Corpus、Dictionary)

先主要摘自一篇中文Survey,
http://wenku.baidu.com/view/0c33af946bec0975f465e277.htmlhtml

 



4.2 情感分析的資源建設 4.2.1 情感分析的語料 除了4.1節中三個國際/國內評測所提供的語料外,很多研究單位和我的也提供了必定規模的語料. 1. 康奈爾大學(Cornell)提供的影評數據集(http://www.cs.cornell.edu/people/pabo/movie-review-data/):由電影評論組成,其中持確定和否認態度的各1,000篇;另外還有標註了褒貶極性的句子各5,331句,標註了主客觀標籤的句子各5,000句.目前影評庫被普遍應用於各類粒度的,如詞語、句子和篇章級情感分析研究中. 2. 伊利諾伊大學芝加哥分校(UIC)的Hu和Liu提供的產品領域的評論語料:主要包括從亞馬遜和Cnet下載的五種電子產品的網絡評論(包括兩個品牌的數碼相機,手機,MP3和DVD播放器).其中他們將這些語料按句子爲單元詳細標註了評價對象,情感句的極性及強度等信息.所以,該語料適合於評價對象抽取和句子級主客觀識別,以及情感分類方法的研究.此外,Liu還貢獻了比較句研究[74]方面的語料. 3. Janyce Wiebe等人所開發的MPQA(Multiple-Perspective QA)庫:包含535篇不一樣視角的新聞評論,它是一個進行了深度標註的語料庫.其中標註者爲每一個子句手工標註出一些情感信息,如觀點持有者,評價對象,主觀表達式以及其極性與強度.文獻[75]描述了整個的標註流程.MPQA語料適合於新聞評論領域任務的研究. 4. 麻省理工學院(MIT)的Barzilay等人構建的多角度餐館評論語料:共4,488篇,每篇語料分別按照五個角度(飯菜,環境,服務,價錢,總體體驗)分別標註上1~5個等級.這組語料爲單文檔的基於產品屬性的情感文摘提供了研究平臺. 5. 國內的中科院計算所的譚鬆波博士提供的較大規模的中文酒店評論語料:約有10,000篇,並標註了褒貶類別,能夠爲中文的篇章級的情感分類提供必定的平臺. 4.2.2 情感分析的詞典資源 情感分析發展到如今,有很多前人總結出來的情感資源,大多數表現爲評價詞詞典資源. 1. GI(General Inquirer)評價詞詞典(英文,http://www.wjh.harvard.edu/~inquirer/).該詞典收集了1,914個褒義詞和2,293個貶義詞,併爲每一個詞語按照極性,強度,詞性等打上不一樣的標籤,便於情感分析任務中的靈活應用. 2. NTU評價詞詞典(繁體中文).該詞典由臺灣大學收集,含有2,812個褒義詞與8,276個貶義詞[76]. 3. 主觀詞詞典(英文,http://www.cs.pitt.edu/mpqa/).該詞典的主觀詞語來自OpinionFinder系統,該詞典含有8,221個主觀詞,併爲每一個詞語標註了詞性,詞性還原以及情感極性. 4. HowNet評價詞詞典(簡體中文、英文,http://www.keenage.com/html/e_index.html).該詞典包含9,193箇中文評價詞語/短語, 9,142個英文評價詞語/短語,並被分爲褒貶兩類.其中,該詞典提供了評價短語,爲情感分析提供了更豐富的情感資源.



再補上上次總結的:
http://site.douban.com/204776/widget/notes/12599608/note/284723117/
##Datasets for SA:
###Lexicons:
[1]
The General Inquirer Lexicon
•Homepage: http://www.wjh.harvard.edu/~inquirer
•Categories
–Positive (1,915 words) and Negative (2,291 words)
–Strong vs Weak, Active vs Passive, Overstated versus Understated
–Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc
•Free for research use
Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press.

[2]
LIWC (Linguistic Inquiry and Word Count)
•Homepage: http://www.liwc.net/
•2,300 words, > 70 classes
–Affective Processes
•negative emotion (bad, weird, hate, problem, tough)
•positive emotion (love, nice, sweet)
–Cognitive Processes
•Tentative (maybe, perhaps, guess), Inhibition (block, constraint)
–Pronouns, Negation (no, never), Quantifiers (few, many)
•$30 or $90 fee
Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). Linguistic Inquiry and Word Count: LIWC 2007.


[3]
MPQA Subjectivity Cues Lexicon
•Homepage: http://www.cs.pitt.edu/mpqa/subj_lexicon.html
•6,885 words from 8,221 lemmas
–2,718 positive
–4,912 negative
•Each word annotated for intensity (strong, weak)
•GNU GPL
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. HLT-EMNLP-2005.
Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003


[4]
Opinion Lexicon
•Homepage: http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
•6,786 words
–2,006 positive
–4,783 negative
•Bing Liu's Page on Opinion Mining
Minqing Hu and Bing Liu. Mining and Summarizing Customer Reviews. ACM SIGKDD-2004



[5]
SentiWordNet
•Homepage: http://sentiwordnet.isti.cnr.it/
•All WordNet synsets automatically annotated for degrees of positivity, negativity, and neutrality/objectiveness
–[estimable(J,3)] 「may be computed or estimated」
•Pos 0 Neg 0 Obj 1
–[estimable(J,1)] 「deserving of respect or high regard」
•Pos .75 Neg 0 Obj .25
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010 SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. LREC-2010

Sentiment Classification of Reviews Using SentiWordNet
http://arrow.dit.ie/cgi/viewcontent.cgi?article=1000&context=ittpapnin




###Corpus and Reviews:
[1]
Movie reviews
–Internet Movie Database (IMDb)
http://www.cs.cornell.edu/people/pabo/movie-review-data/
http://reviews.imdb.com/Reviews/
–700 positive / 700 negative

[2]
MOVIEREVIEWSET (Pang and Lee 2004)
[3]
MPQACORPUS (Wiebeet al. 2005)
[4]
PRODUCTREVIEWSET (Yi et al. 2003)

[2]-[4]
http://www.cs.uic.edu/liub/FBS/sentiment-analysis.html
http://www.cs.pitt.edu/mpqa/
http://ai.stanford.edu/amaas/data/sentiment
http://people.csail.mit.edu/jrennie/20Newsgroups

[5]
BOOKREVIEWSET (Aueand Gamon, 2005)
[6]
SENTENCESET (Kim and Hovy2004)

[7]
The J.D. Power and Associates Sentiment Corpus
http://verbs.colorado.edu/jdpacorpus/
The JDPA Corpus consists of user-generated content (blog posts) containing opinions about automobiles and digital cameras. They have been manually annotated for named, nominal, and pronominal mentions of entities. Entities are marked with the aggregate sentiment expressed toward them in the document. Mentions of each entity are marked as co-referential. Mentions are assigned semantic types consisting of the Automatic Content Extraction (ACE) mention types and additional domain-specific types. Meronymy (part-of and feature-of) and instance relations are also annotated. Expressions which convey sentiment toward an entity are annotated with the polarity of their prior and contextual sentiments as well the mentions they target. The following modifiers are annotated. These may target other modifiers or sentiment expressions

negators (expressions which invert the polarity of a sentiment expression or modifier)
neutralizers (expressions that do not commit the the speaker to the truth of the target sentiment expression or modifier)
committers (expressions which shift the commitment of the speaker toward the truth a sentiment expression or modifier)
intensifiers (expressions which shift the intensity of a sentiment expression or modifier)
Additionally, we have annotated when the opinion holder of a sentiment expression is someone other than the author of the blog by linking the expression to the holder. We also annotate when two entities are compared on a particular dimension.

The data, organized into training and testing sets, consists of 515 documents (blog posts) covering 330,762 tokens which make up 19,322 sentences. 87,532 mentions, 15,637 sentiment expressions, and 22,662 relations between entities (co-reference groups) are annotated.

Please see the included README file for more information about this data. For a more detailed explanation of the preparation of the corpus, please read The JDPA Sentiment Corpus Annotation Guidelines or The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain.



##Packages and APIs for SA: 
http://stackoverflow.com/questions/10233087/sentiment-analysis-using-r
https://sites.google.com/site/miningtwitter/questions/sentiment




##Apps for SA:
Twitteratr
Tweetfeel
Twitter sentiment / Sentiment140git

相關文章
相關標籤/搜索