February Report: How consistent are AI responses for the exact same query?
I asked ChatGPT the same five questions 100+ times each, then recorded how consistent the answers are.
Brand Consistency Across Responses for the Same Query
Loading...
Responses Where Brands Exactly Match, in Any Order
Loading...
Responses Where Brands And Their Order Match Exactly
Loading...
Brand "Ranking" Position Consistency
How consistently do brands appear in the same position?
| Brand | Appearances | Avg Pos | Most Common | Consistency | Position Distribution |
|---|---|---|---|---|---|
| Loading... | |||||
Consistency = % of appearances at most common position. Dot size = frequency at that position.
Average Number of Brands Per Response
Loading...
Average Number of URLs Per Response
Loading...
Sentiment Scores By Brand
Commentary next to each brand mention was monitored for its sentiment over time.
| Brand | Avg Score | Consistency | Score Trend |
|---|---|---|---|
|
|
8.6 | 90% | |
|
|
8.5 | 89% | |
|
|
8.1 | 90% | |
|
|
8.1 | 87% | |
|
|
8.0 | 86% | |
|
|
7.3 | 88% |
Questions I Thought You Might Have
This page is new so go easy on me, but hopefully I've covered your questions below.
What was the inspiration behind this tool? ▸
I've been tracking and analyzing AI search responses for over a year now, and it's a big part of my focus at Ahrefs where the team tracks hundreds of millions of prompts.
That said, full credit for the inspiration to create this page goes to Rand Fishkin, who shared a detailed guide on the Sparktoro blog about how consistent responses are.
Near the end of his article, Rand said "More data is needed. More people should look into these questions." I wanted some more answers for myself, so decided to start my own live report on the topic.
Won't answers naturally change, and impact consistency anyways? ▸
Yes! Answers will change based on model updates, changes to data sources (such as Google updates) and more.
The current data is only looking at responses over the last few days, to try and focus on individual response consistency.
That said, going forward I should either check far more results over the course of a day, or separate data based on short time periods (such as daily, or weekly).
How do sentiment scores work? ▸
There are lots of different APIs to assign sentiment to text.
Here I kept things simple.
I took the commentary associated with each brand and ran it through OpenAI's GPT 5.2 model, providing custom instructions on how to score responses.
I checked each snippet three times and generated an average score, then repeated that across dozens of brand mentions.
Are you using the OpenAI API or something else? ▸
I'm not using the OpenAI API, as the results are too different to what you get with the web interface.
I hope that one day their chat models will be much closer to the "real" thing.
For each query I start a new conversation in the web interface, from the US πΊπΈ.
I check results just a few times per hour, as I don't think there's a need to go overboard running this analysis. I'll likely slow this down much further now that the page is live and we have a lot of responses already.
How are you handling brand name variations? ▸
I'm merging them π
If one response says 'HubSpot' and one response says 'HubSpot CRM', I count them both as HubSpot. I do the same for Monday, Monday CRM and Monday.com.
In the past I built a system to help me manage these for other projects. I approve all variations manually, and skip those which aren't accurate. That might sound like a lot of work, but it's easy when they're close matches.
This process might help increase the number of brand consistency matches, but I'm careful not to merge anything which is not a correct match. The Nike Fundamental weighted rope and Nike Fundamental jump rope are not the same product.
Finally, I'm not counting brand mentions when they're mentioned in a sentence with other companies. It must be a separate recommendation.
What's the plan for this page going forward? ▸
If there's enough interest, I'm happy to keep this page running, while being conscious of not going overboard with the tracking.
This page is purposefully only looking at one query per category, and responses in one platform (ChatGPT). I'm happy to add more sources, like Perplexity and Google's AI Mode, going forward.
If you have any comments or feature requests, please send an email to [email protected]


