๐—ฆ๐˜๐—ผ๐—ฝ ๐—ช๐—ผ๐—ฟ๐—ฑ๐˜€ & ๐—ฆ๐—˜๐—ข

So, @jroakes has had a reaction to a post by
@semrush's Connor Lahey

https://twitter.com/jroakes/status/1358059103667511298

So ... I have some questions for the #SEO community :D
(There are several, please answer each one :D

>>>
1) ๐——๐—ผ๐—ฒ๐˜€ ๐—š๐—ผ๐—ผ๐—ด๐—น๐—ฒ ๐˜‚๐˜€๐—ฒ ๐—ฆ๐˜๐—ผ๐—ฝ ๐—ช๐—ผ๐—ฟ๐—ฑ๐˜€?
2) ๐——๐—ผ๐—ฒ๐˜€ ๐˜๐—ต๐—ฒ ๐˜‚๐˜€๐—ฒ ๐—ผ๐—ณ ๐—ฆ๐˜๐—ผ๐—ฝ ๐—ช๐—ผ๐—ฟ๐—ฑ๐˜€ ๐—ถ๐—บ๐—ฝ๐—ฎ๐—ฐ๐˜ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฟ๐—ฎ๐—ป๐—ธ๐—ถ๐—ป๐—ด?
3) ๐—ฆ๐—ต๐—ผ๐˜‚๐—น๐—ฑ ๐˜†๐—ผ๐˜‚ ๐—ฎ๐—น๐˜๐—ฒ๐—ฟ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฐ๐—ผ๐—ป๐˜๐—ฒ๐—ป๐˜ ๐˜๐—ผ ๐—ฟ๐—ฒ๐—ฑ๐˜‚๐—ฐ๐—ฒ/๐—ฎ๐˜ƒ๐—ผ๐—ถ๐—ฑ ๐—ฆ๐˜๐—ผ๐—ฝ ๐—ช๐—ผ๐—ฟ๐—ฑ๐˜€?
4) ๐——๐—ผ ๐—ฆ๐˜๐—ผ๐—ฝ ๐—ช๐—ผ๐—ฟ๐—ฑ๐˜€ ๐—บ๐—ฎ๐—ธ๐—ฒ ๐—ฎ ๐—ฑ๐—ถ๐—ณ๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐˜๐—ผ ๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฅ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€?
Okay - so you should have (hopefully) answered 4 questions:

1) Does Google use Stop Words?

2) Does the use of Stop Words impact your ranking?

3) Should you alter your content to reduce/avoid Stop Words?

4) Do Stop Words make a difference to Search Results?

>>>
If you've not answered the above questions
(all 4 of them),
please do :D

>>>
Right ... so ... #StopWords ... are are carry over from #InformationRetrieval.

Remember - we are going back in time to before home-computers.
We're talking about machines that had limited storage, memory and processing power.

>>>
Processing text often required breaking it down into pieces.

Part of this was for a result of processing (additional data such as Part of Speech (PoS) would be attached),
and thus things took up more memory.

>>>
But, they found a way to reduce the "costs" retain the accuracy - Stop Words.

By removing a number of highly common words, they could reduce the storage, and speed up queries too!
Over time, they identified other terms that weren't contributory to selection, and expanded.

>>>
So, when doing a book search, or looking up papers etc.,
words such as "the" and "and" etc. were irrelevant.
So you could strip[ them from the text/query, and lose (next to?) nothing.

For Natural Language Processing, this could provide significant gains for resources too!

>>>
Classification of texts could be done by stripping non-informative words/tuples/triplets, and looking at what was left. Less to process, little informational loss.

At the time, they were often using hand-crafted rules (100% human, or later, machine suggested).

>>>
Fast forward ... machines have greater resources - but there's been an explosion in data.
Uni's have access to far larger corpora, and NLP has expanded to look at numerous other tasks.

New processes/approaches have been developed ... we still use Stop Words though.

>>>
We also still tend to approach things with the "bag of words" view (sequence etc. isn't that relevant).

For things like grammar/syntax, things were broken down to tokens words, punctuation and additions (things like "'s" were separate! (so "she's" was "she"+"'s")

>>>
And then they'd be utilised as pairs/triples (so they'd look to the token to the left/right (or two tokens etc.)).

Fast forward a little more ... we're now looking at automation of extraction!
We want to use the machines to build ontologies, identify senses etc.

>>>
And that's when things started to change.

At a similar time, developments with psycholinguistics etc. has also occurred, and again, the current "approaches" are found a little lacking.

The results need to be higher (people want more than 80% accuracy!)

>>>
The solution?
Well, there were multiple.

We had associations, clustering, larger frames (using multiple tokens in a sequence) - greater resources, new minds, new demands ... things were changing!

But one of the biggest changes was ... not singular words!

>>>
The goal was accuracy.
Didn't matter if it was for presenting candidates for translations,
or identifying entities and their relations,
or allocating a text to a domain/topic,
or labelling words as positive/negative,
there was demand for accuracy and speed!

>>>
And what did we discover?
What was one of the biggest alterations that contributed?

Multi-words.

Meaning is often more than the sum of it's parts.

Things like negators ("not" etc.) were introducing errors.
Things like distanced modifiers (adjectives after nouns) too.

>>>
And for comprehension?
Sense evaluation?
Identification of relations?

Why - it was realised that using multiple words in sequence was more effective.

Windows/Framing took a strong lead.
Select your seed words.
Find them in the texts.
Take the 3 words before/after.

>>>
Search for the first 3 words, followed by the second 3 words, with one/two/3words between.
You have a bunch of potential synonyms, related words, words of similar usage (if not usage) etc. etc. etc.

And what words are often found with the things that were sought after?

>>>
Why ... many of the words on what is now a prolific number of Stop Word Lists :D

Now ... don't rush to the conclusion that SWLs are useless.
They aren't!
They are still more than useful in various applications.

Also remember, diminishing returns!
>>>
As SEOs, programmers etc.,
we should all be familiar with the concept that as you work down a prioritised list,
each additional step has less impact
(yes - so long as it's done in a descending order! (smart asses!))

So, you can get good results with small efforts.
>>>
The more effort you exert, the more steps you take, the smaller the gains :(

<SideNote: Studies were done on URLs for Spam Identification - with approx. 78% accuracy! But URL alone! No parsing the content! Now consider that ratio of info/effort to Information Retrieval!>

>>>
Why am I saying this?

Because I don't know if G use a SWL.

I assume they do - but not a single list (ignoring languages!).
Instead, they likely have various lists for different usages (inc. spam candidate flagging :D)

I think they may use them for classification etc.
>>>
But I'd hazard that they do that for initial efforts,
and then do deeper/broader later?

The only way we'd know is by asking ... and I think the answers are going to be messy :D

But, I seriously doubt that SW are anything we should be thinking about!

>>>
As explained above - many of the words in common SWLs are informative (to some degree, in various situations).
You'd struggle to extract entity relations without things like
A
Is
Of
etc. (it's doable, but you'd find you'd have far more lower quality candidates etc.)

>>>
So ... personally ... Forget Stop Words!

You'd likely do far better optimising your Meta Descriptions (and we all know how often those are overwritten!).

Instead, focus on producing natural, informative and useful content, including titles/URLs!
You can follow @darth_na.
Tip: mention @twtextapp on a Twitter thread with the keyword โ€œunrollโ€ to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.