A detailed analysis of ChatGPT search and Google’s performance across 62 queries, with scoring metrics and practical examples.
The emergence of ChatGPT search has led to many questions about the quality of the overall results compared to Google.
This is a difficult question to answer, and in today’s article, I will provide some insights into how to do just that.
Note that our understanding is that the technology that makes it possible for OpenAI to offer a search capability is called SearchGPT, but the actual product name is ChatGPT search.
In this article, we will use the name ChatGPT search.
What’s in this report
This report presents an analysis of 62 queries to assess the strengths and weaknesses of each platform.
Each response was meticulously fact-checked and evaluated for alignment with potential user intents.
The process, requiring about an hour per query, highlighted that “seemingly good” and “actually good” answers often differ.
Additionally, when Google provided an AI Overview, it was scored against ChatGPT search.
A combined score for the AI Overviews and the rest of Google’s SERP was also included.
Of the queries tested – two-thirds of which were informational – Google returned an AI Overview in 25 instances (40% of the time).
The queries analyzed fell into multiple categories:
The total number of the above is greater than 100%, and that’s because some queries could fall into more than one classification.
For example, about 13% of the queries were considered informational and commercial.
Detailed information from SparkToro on the makeup of queries suggests a natural distribution of search queries as follows:
Metrics used in this study
I designed 62 queries to reflect diverse query intents, aiming to highlight each platform’s strengths and weaknesses.
Each response was scored across specific metrics to evaluate performance effectively.
- Errors: Did the response include incorrect information?
- Omissions: Was important information not in the response?
- Weaknesses: Were other aspects of the response considered weak but not scored as an error or omission?
- Fully addresses: Was the user’s query intent substantially addressed?
- Follow-up resources: Did the response provide suitable resources for follow-up research?
- Quality: An assessment by me of the overall quality of the response. This was done by weighing the other factors contained in this list.
At the end of this article are the total scores for each platform across the 62 queries.
Competitive observations
When considering how different search platforms provide value, it’s important to understand the many aspects of the search experience. Here are some of those areas:
Advertising
Multiple reviewers note that ChatGPT search is ad-free and tout how much better this makes it than Google. That is certainly the case now, but it won’t stay that way.
Microsoft has $13 billion committed to OpenAI so far, and they want to make that money back (and then some).
In short, don’t expect ChatGPT search to remain ad-free. That will change significantly at some point.
An important note is that advertising works best on commercial queries.
As you will see later in this article, I scored Google’s performance on commercial queries significantly higher than ChatGPT search.
Understanding user intent
Google has been working on understanding user intent across nearly infinite scenarios since 2004 or earlier.
They’ve been collecting data based on all the user interactions within search and leveraging what they have seen with the Chrome browser since its launch in 2008.
This data has most likely been used to help train Google algorithms to understand user intent and brand authority on a per query basis.
For reference, as of November 2024, Statcounter pegs Chrome’s market share at 67.5%, Safari at 18.2%, and Edge at 4.8%
This is a critical advantage for Google because understanding the user intent of a query is what it’s all about.
You can’t possibly answer the user’s need without understanding their need. As I’ll illustrate in the next section, this is complex!
How query sessions work
Part of the problem with understanding user intent is that the user may not have fully worked out what they’re looking for until they start the process.
Consider the following example of a query sequence that was given to me via Microsoft many years ago:
The initial query seems quite simple: “Merrell Shoes.”
You can imagine that the user entering that query often has a specific Merrell shoe in mind, or at least a shoe type, that they want to buy.
However, we see this user’s path has many twists and turns.
For example, the second site they visit is www.merrell.com, a website you might suspect has authoritative information about Merrell shoes.
However, this site doesn’t appear to satisfy the user’s needs.
The user ends up trying four more different queries and visiting six different websites before they finally execute a transaction on www.zappos.com.
This degree of uncertainty in search query journeys is quite common.
Some of the reasons why users have this lack of clarity include is that they:
- Don’t fully understand the need that they’re feeling.
- Don’t know how to ask the right questions to address their need.
- Need more information on a topic before deciding what they need.
- Are in general exploration mode.
Addressing this is an essential aspect of providing a great search experience. This is why the Follow-Up Resources score is part of my analysis.
Understanding categories of queries
Queries can be broadly categorized into several distinct groups, as outlined below:
- Informational: Queries where the user wants information (e.g., “what is diabetes?”).
- Navigational: Queries where the user wants to go to a specific website or page (e.g., “United Mileage Club”).
- Commercial: Queries where the user wants to learn about a product or service (e.g., “Teak dining table”).
- Transactional: Queries where the user is ready to conduct a transaction (e.g., “pizza near me”).
Recent data from SparkToro’s Rand Fishkin provides some insight into the percentage of search queries that fall into each of these categories:
Be advised that the above is a broad view of the categories of queries.
The real work in search relates to handling searches on a query-by-query basis. Each query has many unique aspects that affect how it can be interpreted.
Next, we’ll examine several examples to illustrate this. Then, we’ll compare how ChatGPT search and Google performed on these queries.
Query type: Directions
This query type is a natural strength for Google (as is any locally oriented query). We can see ChatGPT search’s weaknesses in this area in its response:
The problems with this response are numerous.
For example, I wasn’t in Marlborough, Massachusetts, when I did the query (I was in the neighboring town of Southborough).
In addition, steps 1 and 2 in the directions are unclear. Anyone following them and heading east on Route 20 would end up at Kenmore Square in Boston without ever crossing I-90 East.
In contrast, Google nails it:
The reason why Google handles this better is simple.
Google Maps has an estimated 118 million users in the U.S., and Waze adds another 30 million users.
I wasn’t able to find a reasonable estimate for Bing Maps, but suffice it to say that it’s far lower than Google’s.
The reason Google is so much better than Bing here is simple – I use Google Maps, and that lets Google know exactly where I am.
This advantage applies to all Google Maps and Waze users in the U.S.
Query type: Local
Other types of local queries present similar issues to those of ChatGPT search. Note that a large percentage of search queries have local intent.
One estimate pegged this at 46% of all queries. This was reportedly shared by a Googler during a Secrets of Local Search conference at GoogleHQ in 2018.
Here is ChatGPT’s response to one example query that I tested:
As with the directions example, it thinks that I’m in Marlborough.
In addition, it shows two pizza shops in Marlborough (only one of the two is shown in my screenshot).
Google’s response to this query is much more on point:
I also gave Google a second version of the query “Pizza shops in Marlborough,” and it returned 11 locations – 9 more than I saw from the ChatGPT search.
This shows us that Google also has far more access to local business data than ChatGPT search.
For this query class (including the Directions discussed previously), I assigned these scores:
- ChatGPT search: 2.00.
- Google: 6.25.
Query type: Content gap analysis
A content gap analysis is one of the most exciting SEO tasks that you can potentially do with generative AI tools.
The concept is simple: provide the tool of your choice a URL from a page on your site that you’d like to improve and ask it to identify weaknesses in the content.
As with most things involving generative AI tools, it’s best to use this type of query as part of a brainstorming process that your subject matter expert writer can use as input to a larger process they go through to update your content.
There are many other different types of content analysis queries that you can do with generative AI that you can’t do with Google (even with AI Overviews) at this point.
For this study, I did four content gap analysis queries to evaluate how well ChatGPT search did with its responses.
Google presented search results related to the page I targeted in the query but did not generate an AI Overview in any of the four cases.
However, ChatGPT search’s responses had significant errors for three of the four queries I tested.
Here is the beginning of ChatGPT search’s response to the one example query where the scope of errors was small:
This result from ChatGPT isn’t perfect (there are a few weaknesses, but it’s pretty good. The start of Google’s response to the same query:
Overall, I tried four different content gap analysis queries and ChatGPT search made significant errors in three of them. For this query, I assigned these scores:
- ChatGPT search: 3.25.
- Google: 1.00.
Query type: Individual bio
How these queries perform is impacted by how well-known the person is.
If the person is very famous, such as Lionel Messi, there will be large volumes of material written about them.
If the amount of material written about the person is relatively limited, there is a higher probability that the published online information hasn’t been kept up to date or fact-checked.
We see that in the responses to the query from both ChatGPT search and Google.
Here is what we see from ChatGPT search:
Query type: Debatable user intent
Arguably, nearly every search query has debatable user intent, but some cases are more extreme than others.
Consider, for example, queries like these:
- Diabetes.
- Washington Commanders.
- Physics.
- Ford Mustang.
Each of these examples represents an extremely broad query that could have many different intents behind it.
In the case of diabetes:
- Does the person just discover that they have (or a loved one has) diabetes, and they want a wide range of general information on the topic?
- Are they focused on treatment options? Long-term outlook? Medications? All of the above?
Or, for a term like physics:
- Do they want a broad definition of what it’s about?
- Or is there some specific aspect of physics that they wish to learn much more about?
Creating the best possible user experience for queries like these is tricky because your response should provide opportunities for each of the most common possible user intents.
For example, here is how ChatGPT responded to the query “physics”:
Query type: Disambiguation
One special class of debatable intents queries is words or phrases that require disambiguation. Here are some example queries that I included in my test set:
- Where is the best place to buy a router?
- What is a jaguar?
- What is mercury?
- What is a joker?
- What is a bat?
- Racket meaning.
For example, here is how ChatGPT search responded to the question, “What is a joker query?”
Query type: Maintaining context in query sequences
Another interesting aspect of search is that users tend to enter queries in sequences.
Sometimes those query sequences contain much information that helps clarify their query intent.
An example query sequence is as follows:
- What is the best router to use for cutting a circular table top?
- Where can I buy a router?
As we’ve seen, the default assumption when people speak about routers is that they refer to devices for connecting devices to a single Internet source.
However, different types of devices, also called routers, are used in woodworking.
In the query sequence above, the reference to cutting a circular table should make it clear that the user’s interest is in the woodworking type of router.
ChatGPT’s response to the first query was to mention two specific models of routers and the general characteristics of different types of woodworking routers.
Then the response to “where can I buy a router” was a map with directions to Staples and the following content: