Thread by @BootlegGirl, OK so I’m really excited about my new machine learning project and [...]

OK so I’m really excited about my new machine learning project and I wanna talk about it, and hopefully get some interest and help with parts of it, but mainly I just wanna talk about it - been working on it all morning.

As I mentioned early, it’s about detecting semi-organized, decentralized campaigns targeting consumer products. (There’s already been a lot of research on individual-targeted harassment, and doing that requires more ethical safeguards.)

Essentially, my project stems from my own personal hobby and my investment in a particular entertainment title which was targeted pre- and especially post-release by angry groups with various ideological positions

Initially this project was a way for me to learn machine learning while engaging with something I enjoyed, but I've realized it could also used by private businesses in their decision making processes - if I can get sufficient data. I have to increase the scope substantially.

Essentailly, since at least 2014 online discussion of mass media (*also politics, etc., but I'm limiting my scope) has become dominated by what are frequently recognized as - but hard to prove to be - semi-organized or limited communities (I don't have the best language for this)

This is usually a form of hostility toward the media property (as seen in elements of GG, the Last Jedi backlash, the TLOU2 backlash, and I believe a few other controveries I'm forgetting).

Sometimes, it is positive toward a media property and hostile towards its critics (K-Pop stans, sometimes) and sometimes both coexist, although in all the examples I can think of, the hostility was going overwhelmingly one direction or another.

Yesterday, I realized something that I had been overlooking: there is a massive gap in the discourse trying to explain these kinds of reactions. Right now there's basically two explanations, "bots" and "overwhelming public opinion".

The "bots" claim is transparently wrong; having read through a great deal of the tens of thousands of reviews of The Last of Us Part II, I saw little to no evidence that actual automation, scripting, or text generation was being used.

This makes sense, as this is a developing phenomenon; if someone is *trying* to create controversy, they need it to be recognized as such, and journalists these days know how to spot a "bot" (and platforms have *always* been good at removing them,see spam filters in the 90s!)

You are unlikely to get a headline "Massive Backlash Against Deadpool 3" even on a clickbait website if you do it by having a bunch of people saying the same thing, especially since reporters look for viral quotes (having been harvested myself by journalists in this way).

So mostly these backlashes aren't bots (although they may be sockpuppets, see later.)

On the other side of things, people who want to take these things at face value.

Some assume that there really are that many people Big Mad on the Internet but this claim has to to deal with the fact that *many* - and I'm not claiming all, but many - of these cases produced minimal financial fallout or impact on the product being criticized.

Star Wars The Last Jedi and The Last of Us Part II are both in the all time top sales lists for their respective platforms, but the Internet reaction would indicate they should have been abysmal financial failures.

Notably in the case of The Last Jedi, I can't prove this but it seems pretty clear that the creative teams were changed and the next product in the series was changed to attempt to appease the Internet anyway, and that was a mistake, objectively.

TLJ(which I dislike, for the record of showing myself slightly unbiased)is the #14 grossing film of all time. Rise of Skywalker made money - about 300 million dollars, or in other terms a bit less than its whole budget - less than TLJ. The backlash wasn't a real business concern.

So, I have a theory which poses a machine learning problem which poses an opportunity. The theory is one of those things We All Know but we are still (citation needed) on.

Insular online communities of real individuals with real opinions who are nevertheless not a major factor in the financial success or failure of a product sometimes, but not always, are behind these outrage campaigns against particular popular media properties.

Again, I'm limiing this to popular media - I'm not getting into the "cancel culture" discourse, and I have some examples I'd rather keep out of my main thread but can get into in replies of how I think online outrage was genuine, but I also want to test my algorithms against this

Anyway, so in cases like GG, TLOU2, TLJ, and the campaign in favor of a certain 4 hour HBO special, groups of individuals on Twitter, image boards, or other forms of organization communicate about their displeasure with or desire for (thing).

In some/many cases they are also driven by YouTube celebrities and other influencers (I suspect something similar happens with aggressive K-pop stans but I'm too Millenial to get that).

A large number of people by normal people standards, but a very small number by "giant corporation bottom line" standards, get mad and launch attacks against (thing).

(I want to note that my inclusion of GG isn't meant to imply it's exactly like this; it mainly targeted individuals and a few small companies, but it is worth noting how many companies decided initially to pander to them.)

If we had a way of probabilizing a given post's likelihood of originating in an insular community with some sort of marching orders, companies could use that algorithm to determine who their real customers are and ideally avoid missteps like catering to GG or the TLJ haters.

(Although I'm framing this in business logic, this is a social justice issue: we want a way to stop corporations catering to these people.)

Anyway, I came to the realization that this would be helpful when I was going through TLOU reviews and trying to algorithmically prove that the critics were bigoted (which is surprisingly difficult), and I realized that wasn't the thing -

the thing was that they were saying the same things, many of which are untrue, but they were saying the same things the same way, and many of those things stand out. I'm still working on my data pipeline so I can actually show this, >

but things like identifying a choice to misspell Neil Druckmann's name, to misgender Abby - these aren't just dick moves or bigotry, they're evidence these people talk to each other or at least listen to the same pundits.

And my argument that in business logic these kind of reviews should be disregarded is pretty strong given the financial performance of TLOU2 and TLJ and what I anticipate to be the epic failure of that one HBO special with the Supermans.

Anyway, I'm interested in discussion, and I'm also interested in anyone suggesting datasets that somehow have clearly identifiable insular communities (that's the term I'm using for now) *mixed with organic posts*.

What I aim to do is classify posts (first manually as a training dataset, then algorithmically) as pro-thing, anti-thing, but also organic, insular community, or bot, and then further analysis can be run, including only organic pro-thing and anti-thing

(and the trends discussed by insular communities and bots can also be mapped and quantified).

If anyone has an *intact* early GG dataset from Twitter, including both GG and anti-GG posts, that would be really, really great, because I think it's the best example of that.

Failing that, I'm gonna have to acquire TLJ reviews (bc the fact that TLJ was trolled is uncontroversial now and I'll get less pushback using it than TLOU2).

Latest Threads Unrolled: