Skip to content

TF-IDF Text Analysis in SEO

Sara Taher
4 min read
Clusters created by a Python Script using TF-IDF
Clusters created by a Python Script using TF-IDF
💡
🚀 Ready to boost your SEO with Python? Join my hands-on training designed for SEO professionals! Learn to automate tasks and analyze data easily. Don't miss out—start your journey today! Learn more here.
audio-thumbnail
Listen to the podcast - Created by NoteBookLM
0:00
/622.8

I have attended a talk by Mic King recently about the "The Mechanics of Modern Search" and decided to write a blog - or a series of blogs, not sure yet - that just simplifies some of the concepts going on in this space. Here's one of the slides from his deck.

Sooo What's going on Sara? Let's start with the simplest topic, TF-IDF.

What is TF-IDF?

TF-IDF stands for Text Frequency - Inverse Document Frequency. It is basically a method to numerically represent the importance of a word within a document relative to a collection of documents.

💡
TF-IDF measures the importance of a word within a document relative to a collection of documents. You can use TF-IDF to figure out the most important words in a document.

Are there Benefits to Using TF-IDF for SEO

TF-IDF is a text analysis technique. While it's not perfect and comes with limitations, there are still benefits to analyzing SERPs using this method:

Using TF-IDF analysis for SEO can help:

  • Identify important and relevant keywords beyond just basic keyword research
  • Understand what topics and terms Google considers important for a given search query
  • Reduces the impact of common words (e.g., "the", "a").

This quote captures the value of using TF-IDF:

💡
"There is a fundamental difference between retrieving variations of the same keyword and retrieving apparently unrelated, yet relevant, terms." ~ CXL

How is TF-IDF used in SEO

I asked Chatgpt and Claude to help me create a simple example to explain the concept:

Document Representation

    • TF-IDF converts each webpage into a set of numbers.
    • These numbers (the vectors) represent how important different keywords are in that document.
    • Example: A pizza restaurant's webpage might be represented as: [pizza: 0.8, cheese: 0.6, delivery: 0.7]
    • This means "pizza" is very important, "delivery" is quite important, and "cheese" is somewhat important on this page.

Query Representation

    • Convert your query into a similar set of numbers (vectors).
    • Example: Searching for "best pizza delivery" might be represented as: [pizza: 1.0, delivery: 0.9, best: 0.3]
    • This shows "pizza" is most important in the query, followed by "delivery", then "best".

Matching Process

    • Now compare the numbers representing your query to the numbers representing each webpage (vector embeddings comparison).
    • Webpages with similar numbers to your query are considered more relevant.
    • Example: A webpage about pizza delivery will have numbers similar to the "best pizza delivery" query, so it's likely to appear in the search results.
    • A webpage about pasta would have very different numbers, so it probably won't appear in these results.

I have used TF-IDF to cluster keywords in my Python for Marketers Training. From the example above, TF-IDF can also be used to analyze webpages in SERPs and figure out the most important keywords in those pages that are beyond the simple variations of the main keyword. I will probably add a script for that very soon in the training.

TL;DR: What does that mean for SEOs

You maybe scratching your head thinking, ok what should I do now with this information. Here's how to apply this in your day-to-day SEO tasks:

  • Using simple python scripts, you can input a list of keywords, and cluster them. The results are not perfect, but I used this recently when I wanted to cluster 5k+ keywords. Here's an example from my course of a clusters created by TF-IDF python script:
Clusters created by a Python Script using TF-IDF
  • You can also use TF-IDF to analyze the top ranking pages in SERPs for the most important keywords. The output will go beyond the simple variations of a keyword so instead of the usual: "healthy breakfast recipes", "best healthy breakfast recipes", "easy healthy breakfast ideas", etc... expect something like this:
    • "healthy breakfast recipes"
    • "high-protein breakfast options"
    • "vegan breakfast recipes"
    • "quick breakfast meals for busy mornings"
    • "gluten-free breakfast ideas"

You can then use this information to update your content, beyond the basic keyword variations. That's probably my next python project! If you're a course member stay tuned.

Should you just use a TF-IDF tool?

There are tools on the market right now that does this. Should you just signup for one? my answer is no. While this information is valuable, using TF-IDF limits your recommendations to what the tools is offering and gives your copywriters a fake impression that your content is complete.

Writing content suddenly becomes a checklist. The analysis is useful, but do not rely solely on it. This is just one aspect of content analysis and recommendations.

That's that for today folks! Hope you find this useful. Sorry for shamelessly plugging my Python training. Have a great rest of your day!

SEO

Related Posts

Members Public

Information Gain in SEO

Every once in a while, there's a new buzzword in SEO. This time it's "information gain". The term is not new and has been in SEO for a while. It dates back to a patent that was filed by Google in 2018 and published

Google Patent on Information Gain
Members Public

The SEO ROI of Blog Content

In marketing, we are obsessed with attribution and forecasting so much to build business cases - which is a fair ask btw - that sometimes, we hinder our progress and limit ourselves. Challenges of Conversion Attribution By Channel I remember one time in the past, for a personal project, I

Example of funnel exploration report in GA4 for SEO
Members Public

A Guide to Conducting a SWOT Analysis for SEO

I think I first came across the term "SWOT Analysis" during my startup days. It was commonly discussed in meetings with other founders and as part of accelerator programs. 💡A SWOT (Acronym for strength, weaknesses, opportunities, and threats) Analysis is a 2X2 grid with one dimension representing the

Example of an SEO SWOT Analysis for a website