๐ฆ๐๐ผ๐ฝ ๐ช๐ผ๐ฟ๐ฑ๐ & ๐ฆ๐๐ข
So, @jroakes has had a reaction to a post by
@semrush's Connor Lahey
https://twitter.com/jroakes/status/1358059103667511298
So ... I have some questions for the #SEO community :D
(There are several, please answer each one :D
>>>
So, @jroakes has had a reaction to a post by
@semrush's Connor Lahey
https://twitter.com/jroakes/status/1358059103667511298
So ... I have some questions for the #SEO community :D
(There are several, please answer each one :D
>>>
1) ๐๐ผ๐ฒ๐ ๐๐ผ๐ผ๐ด๐น๐ฒ ๐๐๐ฒ ๐ฆ๐๐ผ๐ฝ ๐ช๐ผ๐ฟ๐ฑ๐?
2) ๐๐ผ๐ฒ๐ ๐๐ต๐ฒ ๐๐๐ฒ ๐ผ๐ณ ๐ฆ๐๐ผ๐ฝ ๐ช๐ผ๐ฟ๐ฑ๐ ๐ถ๐บ๐ฝ๐ฎ๐ฐ๐ ๐๐ผ๐๐ฟ ๐ฟ๐ฎ๐ป๐ธ๐ถ๐ป๐ด?
3) ๐ฆ๐ต๐ผ๐๐น๐ฑ ๐๐ผ๐ ๐ฎ๐น๐๐ฒ๐ฟ ๐๐ผ๐๐ฟ ๐ฐ๐ผ๐ป๐๐ฒ๐ป๐ ๐๐ผ ๐ฟ๐ฒ๐ฑ๐๐ฐ๐ฒ/๐ฎ๐๐ผ๐ถ๐ฑ ๐ฆ๐๐ผ๐ฝ ๐ช๐ผ๐ฟ๐ฑ๐?
4) ๐๐ผ ๐ฆ๐๐ผ๐ฝ ๐ช๐ผ๐ฟ๐ฑ๐ ๐บ๐ฎ๐ธ๐ฒ ๐ฎ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐๐ผ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฅ๐ฒ๐๐๐น๐๐?
Okay - so you should have (hopefully) answered 4 questions:
1) Does Google use Stop Words?
2) Does the use of Stop Words impact your ranking?
3) Should you alter your content to reduce/avoid Stop Words?
4) Do Stop Words make a difference to Search Results?
>>>
1) Does Google use Stop Words?
2) Does the use of Stop Words impact your ranking?
3) Should you alter your content to reduce/avoid Stop Words?
4) Do Stop Words make a difference to Search Results?
>>>
If you've not answered the above questions
(all 4 of them),
please do :D
>>>
(all 4 of them),
please do :D
>>>
Right ... so ... #StopWords ... are are carry over from #InformationRetrieval.
Remember - we are going back in time to before home-computers.
We're talking about machines that had limited storage, memory and processing power.
>>>
Remember - we are going back in time to before home-computers.
We're talking about machines that had limited storage, memory and processing power.
>>>
Processing text often required breaking it down into pieces.
Part of this was for a result of processing (additional data such as Part of Speech (PoS) would be attached),
and thus things took up more memory.
>>>
Part of this was for a result of processing (additional data such as Part of Speech (PoS) would be attached),
and thus things took up more memory.
>>>
But, they found a way to reduce the "costs" retain the accuracy - Stop Words.
By removing a number of highly common words, they could reduce the storage, and speed up queries too!
Over time, they identified other terms that weren't contributory to selection, and expanded.
>>>
By removing a number of highly common words, they could reduce the storage, and speed up queries too!
Over time, they identified other terms that weren't contributory to selection, and expanded.
>>>
So, when doing a book search, or looking up papers etc.,
words such as "the" and "and" etc. were irrelevant.
So you could strip[ them from the text/query, and lose (next to?) nothing.
For Natural Language Processing, this could provide significant gains for resources too!
>>>
words such as "the" and "and" etc. were irrelevant.
So you could strip[ them from the text/query, and lose (next to?) nothing.
For Natural Language Processing, this could provide significant gains for resources too!
>>>
Classification of texts could be done by stripping non-informative words/tuples/triplets, and looking at what was left. Less to process, little informational loss.
At the time, they were often using hand-crafted rules (100% human, or later, machine suggested).
>>>
At the time, they were often using hand-crafted rules (100% human, or later, machine suggested).
>>>
Fast forward ... machines have greater resources - but there's been an explosion in data.
Uni's have access to far larger corpora, and NLP has expanded to look at numerous other tasks.
New processes/approaches have been developed ... we still use Stop Words though.
>>>
Uni's have access to far larger corpora, and NLP has expanded to look at numerous other tasks.
New processes/approaches have been developed ... we still use Stop Words though.
>>>
We also still tend to approach things with the "bag of words" view (sequence etc. isn't that relevant).
For things like grammar/syntax, things were broken down to tokens words, punctuation and additions (things like "'s" were separate! (so "she's" was "she"+"'s")
>>>
For things like grammar/syntax, things were broken down to tokens words, punctuation and additions (things like "'s" were separate! (so "she's" was "she"+"'s")
>>>
And then they'd be utilised as pairs/triples (so they'd look to the token to the left/right (or two tokens etc.)).
Fast forward a little more ... we're now looking at automation of extraction!
We want to use the machines to build ontologies, identify senses etc.
>>>
Fast forward a little more ... we're now looking at automation of extraction!
We want to use the machines to build ontologies, identify senses etc.
>>>
And that's when things started to change.
At a similar time, developments with psycholinguistics etc. has also occurred, and again, the current "approaches" are found a little lacking.
The results need to be higher (people want more than 80% accuracy!)
>>>
At a similar time, developments with psycholinguistics etc. has also occurred, and again, the current "approaches" are found a little lacking.
The results need to be higher (people want more than 80% accuracy!)
>>>
The solution?
Well, there were multiple.
We had associations, clustering, larger frames (using multiple tokens in a sequence) - greater resources, new minds, new demands ... things were changing!
But one of the biggest changes was ... not singular words!
>>>
Well, there were multiple.
We had associations, clustering, larger frames (using multiple tokens in a sequence) - greater resources, new minds, new demands ... things were changing!
But one of the biggest changes was ... not singular words!
>>>
The goal was accuracy.
Didn't matter if it was for presenting candidates for translations,
or identifying entities and their relations,
or allocating a text to a domain/topic,
or labelling words as positive/negative,
there was demand for accuracy and speed!
>>>
Didn't matter if it was for presenting candidates for translations,
or identifying entities and their relations,
or allocating a text to a domain/topic,
or labelling words as positive/negative,
there was demand for accuracy and speed!
>>>
And what did we discover?
What was one of the biggest alterations that contributed?
Multi-words.
Meaning is often more than the sum of it's parts.
Things like negators ("not" etc.) were introducing errors.
Things like distanced modifiers (adjectives after nouns) too.
>>>
What was one of the biggest alterations that contributed?
Multi-words.
Meaning is often more than the sum of it's parts.
Things like negators ("not" etc.) were introducing errors.
Things like distanced modifiers (adjectives after nouns) too.
>>>
And for comprehension?
Sense evaluation?
Identification of relations?
Why - it was realised that using multiple words in sequence was more effective.
Windows/Framing took a strong lead.
Select your seed words.
Find them in the texts.
Take the 3 words before/after.
>>>
Sense evaluation?
Identification of relations?
Why - it was realised that using multiple words in sequence was more effective.
Windows/Framing took a strong lead.
Select your seed words.
Find them in the texts.
Take the 3 words before/after.
>>>
Search for the first 3 words, followed by the second 3 words, with one/two/3words between.
You have a bunch of potential synonyms, related words, words of similar usage (if not usage) etc. etc. etc.
And what words are often found with the things that were sought after?
>>>
You have a bunch of potential synonyms, related words, words of similar usage (if not usage) etc. etc. etc.
And what words are often found with the things that were sought after?
>>>
Why ... many of the words on what is now a prolific number of Stop Word Lists :D
Now ... don't rush to the conclusion that SWLs are useless.
They aren't!
They are still more than useful in various applications.
Also remember, diminishing returns!
>>>
Now ... don't rush to the conclusion that SWLs are useless.
They aren't!
They are still more than useful in various applications.
Also remember, diminishing returns!
>>>
As SEOs, programmers etc.,
we should all be familiar with the concept that as you work down a prioritised list,
each additional step has less impact
(yes - so long as it's done in a descending order! (smart asses!))
So, you can get good results with small efforts.
>>>
we should all be familiar with the concept that as you work down a prioritised list,
each additional step has less impact
(yes - so long as it's done in a descending order! (smart asses!))
So, you can get good results with small efforts.
>>>
The more effort you exert, the more steps you take, the smaller the gains :(
<SideNote: Studies were done on URLs for Spam Identification - with approx. 78% accuracy! But URL alone! No parsing the content! Now consider that ratio of info/effort to Information Retrieval!>
>>>
<SideNote: Studies were done on URLs for Spam Identification - with approx. 78% accuracy! But URL alone! No parsing the content! Now consider that ratio of info/effort to Information Retrieval!>
>>>
Why am I saying this?
Because I don't know if G use a SWL.
I assume they do - but not a single list (ignoring languages!).
Instead, they likely have various lists for different usages (inc. spam candidate flagging :D)
I think they may use them for classification etc.
>>>
Because I don't know if G use a SWL.
I assume they do - but not a single list (ignoring languages!).
Instead, they likely have various lists for different usages (inc. spam candidate flagging :D)
I think they may use them for classification etc.
>>>
But I'd hazard that they do that for initial efforts,
and then do deeper/broader later?
The only way we'd know is by asking ... and I think the answers are going to be messy :D
But, I seriously doubt that SW are anything we should be thinking about!
>>>
and then do deeper/broader later?
The only way we'd know is by asking ... and I think the answers are going to be messy :D
But, I seriously doubt that SW are anything we should be thinking about!
>>>
As explained above - many of the words in common SWLs are informative (to some degree, in various situations).
You'd struggle to extract entity relations without things like
A
Is
Of
etc. (it's doable, but you'd find you'd have far more lower quality candidates etc.)
>>>
You'd struggle to extract entity relations without things like
A
Is
Of
etc. (it's doable, but you'd find you'd have far more lower quality candidates etc.)
>>>
So ... personally ... Forget Stop Words!
You'd likely do far better optimising your Meta Descriptions (and we all know how often those are overwritten!).
Instead, focus on producing natural, informative and useful content, including titles/URLs!
You'd likely do far better optimising your Meta Descriptions (and we all know how often those are overwritten!).
Instead, focus on producing natural, informative and useful content, including titles/URLs!