I think we've pretty much established that dictionary methods are not good at classifying the sentiment of *short* docs *individually*

But that's not the same as validating dictionary methods on *many* docs together as an *aggregate*, and I'd like to see more work on that

To raise the point of how these are not the same, I'd like to briefly discuss the @hedonometer project by @compstorylab

They calculate sentiment as the dictionary weighted avgerage over many tweets, *not* the avgerage of weight averages of individual tweets

These are mathematically and conceptually different. I think the differences warrant further investigation for at least 2 reasons

1) In their "Geography of Happiness" paper, they find reasonable correlations between simple dictionary sentiment over the tweets from entire US states and other state-level measures of well-being

2) For a decade now, applying the sentiment analysis to 10% of all English tweets has consistently identified reasonable and interpretable spikes and trends in expressed emotion

Please please please do not come into my mentions talking about Twitter as a non-representative sample

That's not my point

My point is that there's enough evidence here to warrant further investigating how well dictionary methods perform over large aggregate corpora of text

So I would like to see less about how well dictionary methods do on *classification*, and more on how well they do as *continuous measures* across many texts/a lot of text

This requires very different annotation and validation tasks and there's a lot of room for innovation

