Summary:
The article Large-Scale
Sentiment Analysis for News and Blogs is a study by Namrata Godbole,
Manjunath Srinivasaiah, and Steven Skiena assessing news analysis with a
particular focus on sentiment. Their
method utilizes a system assigning scores that indicate a positive or negative
opinion to each distinct entity in the text corpus. The systems include a sentiment
identification phase and a sentiment aggregation and scoring phase. The sentiment identification phase associated
the expressed opinions with each relevant entity while the sentiment
aggregation and scoring phase scores each entity relative to those in the same
class. The study ends with an evaluation
of the scoring techniques over a large corpus of news and blogs.
By building off the Lydia text analysis system the authors
determine the public sentiment on thousands of entities further determining how
the sentiment varies with time.
Various aspects of the sentiment analysis system include Algorithmic Construction of Sentiment Dictionaries, Sentiment Index Formulation, and Evaluation of Significance. The Algorithmic Construction of Sentiment Dictionaries portion of the study includes tracking the reference frequencies of adjectives with positive and negative connotations. The authors incorporate a method that expands small candidate seed lists of positive and negative words into full sentiment lexicons that use path-based analysis in synonym and antonym sets in WordNet. Furthermore, the authors use sentiment-alteration hop counts to determine the polarity strength of the candidate terms and eliminate any ambiguous terms. Sentiment Index Formulation includes constructing a statistical index to reflect the significance of sentiment term juxtaposition. The use of juxtaposition of sentiment terms and entities and a frequency-weighted interpolation with word happiness levels scores the overall entity sentiment. Finally, the Evaluation of Significance element provides statistical evidence of the validity of the sentiment evaluation. It does this by correlating the index with real-world events.
Various aspects of the sentiment analysis system include Algorithmic Construction of Sentiment Dictionaries, Sentiment Index Formulation, and Evaluation of Significance. The Algorithmic Construction of Sentiment Dictionaries portion of the study includes tracking the reference frequencies of adjectives with positive and negative connotations. The authors incorporate a method that expands small candidate seed lists of positive and negative words into full sentiment lexicons that use path-based analysis in synonym and antonym sets in WordNet. Furthermore, the authors use sentiment-alteration hop counts to determine the polarity strength of the candidate terms and eliminate any ambiguous terms. Sentiment Index Formulation includes constructing a statistical index to reflect the significance of sentiment term juxtaposition. The use of juxtaposition of sentiment terms and entities and a frequency-weighted interpolation with word happiness levels scores the overall entity sentiment. Finally, the Evaluation of Significance element provides statistical evidence of the validity of the sentiment evaluation. It does this by correlating the index with real-world events.
After
presenting the overall structure of the study, a section describing a method to
determine the semantic orientation of words is included. An overview of sentiment analysis systems is
also incorporated. The next section focuses
on sentiment lexicon generation. The
authors define separate lexicons for the seven sentiment dimensions used in the
study including general, health, crime, sports, business, politics, and
media. The sentiment word generation
algorithm used in the study expands a set of seed words by using synonym and
antonym queries in multiple ways. First,
a polarity is associated to each word and query. Second, the significance of a path decreases
as a function of its length or depth from a seed word. The final score of each word is the summation
of the scores received over all the paths. Two iterations are run on each word. The first iteration calculates a preliminary
score estimate while the second re-enumerates the paths while calculating the
number of apparent sentiment alternations.
Next, WordNet orders the synonyms and antonyms by sense. Overall, the algorithm generates over 18,000
words.
The sentiment lexicon generation
was evaluated in two different was. The
evaluation was done using a n“un-test”as well as by comparing sentiment
lexicons against the lexicons obtained by Wiebe. To interpret and score the data the authors
utilized sentiment lexicons to mark up all the sentiment words and associated
entities in the corpus. Finally, a section preceding the conclusion
talks about news versus blogs and the significant differences they
generated.
Critique:
The article compares sentiment analysis in both news sources
and blogs but lacks an intelligence perspective. Although the authors mention
these two sources are not comparable displaying different data for both,
stating the issues and the people discussed in blogs varied considerably from newspapers; both analyses are important and deserve their own study. After conducting separate studies, a
comparative case study may prove to be effective. The authors state they are interested in how
sentiment can vary by demographic group and geographic location. These findings can vary drastically between
news sources and blogs reiterating the need for seperate studies.
The
overall study would have been more clear and effective if it were broken into
two different studies on testing blogs and another news sources. It would have likely allowed for a more
detailed analysis of each topic.
It also
seems as though certain dimensions such as politics may receive
more negative sentiment that sports. Additionally,
the period may have a very significant impact on sentiment, for instance,
election years will generate more emotion over non-election years, other
examples include the super bowl or World Series when people are paying more
attention to sports. In business, dips
in the stock market or times when the market is doing very well will generate
more emotion. Other events such as the
release of a new movie followed by reviews can skew results in terms of media. Location is also a factor not included in the
study that can affect sentiment. For
example, conservative newspapers or blogs generated in areas that are more conservative will likely produce different sentiment over more liberal newspapers and blog posts coming from more liberal areas. Overall, although this research generates
useful conclusions, there are many potential factors that have the ability to
skew the data not accounted for in the study.
Source:
Godbole, N., Skiena, S., Srinivasaiah, M. (2007). Large-Scale Sentiment Analysis for News and Blogs. Proceedings of the International Conference on Weblogs and Social Media (ICWSM).
No comments:
Post a Comment