I think we've pretty much established that dictionary methods are not good at classifying the sentiment of *short* docs *individually*

But that's not the same as validating dictionary methods on *many* docs together as an *aggregate*, and I'd like to see more work on that

1/
To raise the point of how these are not the same, I'd like to briefly discuss the @hedonometer project by @compstorylab

2/
They calculate sentiment as the dictionary weighted avgerage over many tweets, *not* the avgerage of weight averages of individual tweets

These are mathematically and conceptually different. I think the differences warrant further investigation for at least 2 reasons

3/
1) In their "Geography of Happiness" paper, they find reasonable correlations between simple dictionary sentiment over the tweets from entire US states and other state-level measures of well-being
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0064417

4/
2) For a decade now, applying the sentiment analysis to 10% of all English tweets has consistently identified reasonable and interpretable spikes and trends in expressed emotion

5/
Please please please do not come into my mentions talking about Twitter as a non-representative sample

That's not my point

My point is that there's enough evidence here to warrant further investigating how well dictionary methods perform over large aggregate corpora of text

6/
So I would like to see less about how well dictionary methods do on *classification*, and more on how well they do as *continuous measures* across many texts/a lot of text

This requires very different annotation and validation tasks and there's a lot of room for innovation

7/7
You can follow @ryanjgallag.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.