How Colgate Replaced a $50k Survey with 8 Prompts
Synthetic Consumers: Reproducing Real Behavior with LLMs for Market Research
What if you could replace a $50,000 consumer survey with eight prompts? Thomas Wiecki’s team did it for Colgate: 57 real products, synthetic respondents that matched real human purchasing behaviour at 90% of the ceiling, no fine-tuning, no training data needed.
Thomas co-created PyMC, the probabilistic programming library, did a PhD in computational psychiatry at Brown, ran data science at Quantopian, and now leads PyMC Labs, applying Bayesian modelling to generative AI problems. We met in Bayesian inference circles (the real generative modelling!) and he dropped into a recent session of our Building AI Applications course to share the Colgate work.
The paper (first author: Ben Maier) caught Ethan Mollick’s attention and spread fast. The idea of using LLMs as synthetic survey respondents has floated around for two or three years, but nobody had got them to work reliably. Thomas’s team cracked it with a simple trick: stop forcing the model to pick a number and let it talk.
This post captures some of our favourite parts of the session, including
why asking LLMs for numerical ratings produces garbage,
a simple method that turns textual responses into accurate survey data,
how to validate that your synthetic users behave like real ones,
demographic effects that shouldn’t work but do,
why synthetic respondents might be more honest than real humans, and
what this means for anyone building products.
👉 This was a guest Q&A from our November cohort of Building AI Applications for Data Scientists and Software Engineers. It’s a live cohort with hands on exercises and office hours. Our final cohort is in March. Here is a 25% discount code for readers. 👈
You can check out the full presentation here:
Timestamps:
00:00 Introduction and Thomas’s Background
02:00 The Business Problem: Consumer Survey Research
04:00 How to Measure Success
07:00 The Naive Method: Asking LLMs for Ratings
09:00 The SSR Method: Let the Model Talk
11:00 Results: Correlation and Distribution Recovery
14:00 Demographic Effects: Income, Age, Price
16:00 Benefits Over Human Surveys
17:00 Q&A: Overfitting, Personas, and Broader Applications
LLMs and the Always-Three Problem
Lets say that you have a product and you want to know whether people will buy it. The standard approach is show it to a few hundred people, ask them to rate their purchase intent from 1 to 5, and look at the distribution.
Thomas’s team worked with Colgate on 57 personal care products to see if they could reproduce such expensive survey results with synthetic LLM-generated responses.
They already had real survey data to compare it to: each of the 57 products had been surveyed with 150 to 400 real participants, with demographics across age, gender, region, and income. The method isn’t limited to products (you could test ads, packaging, pricing) but product surveys were where they started.
The first thing they tried was treating the LLM like a survey respondent. Give it a persona, show it the product image, ask: “How likely would you be to purchase this product?“ Constrain the output to 1 through 5.
The correlation comes in at R=0.66, which is 82% of the theoretical ceiling:
Not bad! Until you look at the distribution, that is.
No matter which product you show it, the model gravitated toward 3. The distributions are nearly identical across all 57 products, with a mean similarity of just 0.26.
Research on LLMs as random number generators shows the same pattern: ask for a number between 1 and 10 and 7 is massively overrepresented. Ask for 1 to 5 and you get 3.
“It will just try and avoid conflict, and say something middle of the road.” Thomas Wiecki, [07:00]
Let Your Model Speak!
Thomas’s team spent a long time staring at this problem before Ben Meyer, the first author, found the fix.
Instead of constraining the LLM to a number, let it respond in natural language: Show it a product, give it a demographic persona, ask the same question, but allow a brief textual answer. The model might say: “I’m somewhat interested. If it works well and isn’t too expensive, I might give it a try.”
The embedding trick
Now you have text, but you need a number. Here’s how they bridge the gap:
Embed the textual response into a vector;
Embed a set of reference responses, one for each point on the 1-to-5 scale (”It’s unlikely that I’d buy it” for 1, “It’s somewhat possible I’d buy it” for 4, and so on).
Measure how close the response vector is to each reference (using cosine similarity)
A response like “I might try it if the price is right” lands mostly on 3 and 4. “This is exactly what I’ve been looking for” loads up on 5.
“The answer is simple. Instead of saying just give responses 1, 2, 3, 4, 5, you actually allow it to give a brief textual response.” Thomas Wiecki, [09:00]
How Do You Know Your Synthetic Users Are Real?
If you ran the same survey twice with real humans, you wouldn’t get identical results.
So before evaluating SSR, Thomas’s team established a ceiling: split the real respondents in half at random, correlate the two halves, repeat 1,000 times. That gives you the maximum correlation any method could hit, real or synthetic.
With SSR, the correlation jumps to R=0.72: 90% of that ceiling, up from 82% with the naive method.
Where it really shows is in the distributions. The naive method scored 0.26, where every product looked the same. SSR hits 0.88. The model now produces distributions that vary by product and match the shape of real human responses.
Both GPT-4o and Gemini 2.0 Flash reach these numbers (SOTA frontier models at the time of publishing), despite being different architectures, without any fine-tuning or survey data shown to the model.
“I personally was really amazed by how well it worked. I still can’t quite believe it... There’s no training here. We’re just asking, just vanilla LLM. Here is the product, what do you think? And then it reproduces these effects.” Thomas Wiecki, [11:00]
The Mango Toothpaste Test
Thomas’ team then looked at the data across demographics.
Income and age
In real surveys, people with lower incomes are less likely to say they’ll buy premium personal care products.
SSR reproduces this and matches the shape of the curve, not just the direction.
Caption - Income effects on purchase intent: real surveys vs synthetic
Then there’s age: In the human survey data, purchase intent peaks in the middle age ranges and drops at both extremes: an inverted U-shape.
As Thomas put it, “that’s a pretty specific pattern that I’m not sure should be in the model.”
But it is!
Younger consumers show more openness to novel products (a teenager is more likely to try mango-flavoured toothpaste than a retiree!) and the synthetic consumers reproduce this too.
The model picks up on price sensitivity across product segments and cultural attitudes: when a product is labelled as AI-generated, both real humans and synthetic consumers rate it lower.
Without demographics, it falls apart
Remove the persona and prompt the model without specifying age, income, or gender, and the correlation drops to R=0.39. The model can’t tell products apart and it says everything is pretty good, which ironically pushes distribution similarity up to 0.91 (because real humans have the same positivity bias), but the signal is gone.
“What made me believe that this is something real is how well we reproduce the
demographic data.” Thomas Wiecki, [14:00]
The Only Time AI Is More Honest Than Real Humans
Positivity bias is one of the most replicated findings in consumer research: put people in a survey, especially if they’re being paid, and they’ll rate products higher than they’d actually buy.
In contrast, when a product is overpriced or unnecessary, LLMs say so!
GPT-4o: It seems a bit too high-end for my needs and budget.
Gemini: Seems kinda bougie for personal care.
What real humans say vs what LLMs say
LLMs also provide more detailed responses, on average.
Ask real survey respondents “What did you like about the concept?” and you get: “It’s good.” “Inexpensive and affordable. New & light. Idea is cool.” “Not much, just the steps and how it tells you what it was for.”
Ask the LLM and you get: “I might consider purchasing it. The ease of use and safety are appealing, but I’d want to know more about its effectiveness and any potential side effects.” Or: “I might consider purchasing it. The ease of use and the promise of gentleness are appealing. Plus, it’s from a trusted brand.”
You can then summarise the LLM responses, cluster them by theme, run qualitative analysis: the kind of work that’s impossible with one-word human answers.
“Something in the pre-training must make it think like humans do. And in that is the ability to answer to novel product ideas. Who would have thought? This stuff is wild.” Thomas Wiecki, [12:00]
Eight Prompts or Fine-Tuning?
The whole pipeline:
give the LLM a demographic persona (age, gender, income level, region),
Show it the product image and description,
Ask “How likely would you be to purchase this product?”,
Allow a brief textual response,
Map it back through embeddings.
When I asked Thomas how many times they iterated on the prompts, he replied: “Eight or ten. Low double digits.” For context, some of our course participants iterate 10 to 20 times for production prompts, and others in healthcare iterate several thousand. SSR got to 90% of the human ceiling in about eight goes.
Thomas and his team tried a range of LLMs. GPT-4o and Gemini 2.0 Flash gave the best results; other models at the time fell short. Both were frontier when the research was done. Both are now mid-tier. Gemini 3, Claude Opus 4.6, and the latest GPT models have all shipped since. If SSR hit 90% of the ceiling with the models that were barely clearing the bar, current frontier should push that further.
PyMC Labs is already building on this: they’re launching a product for ad testing using the same approach.
What This Means for Builders
Thomas’ team worked with Colgate on personal care products, but the method extends to ads, messaging, pricing: anything you want consumer feedback on. If you have a product and you know your users, you could build a synthetic version of your customer base in an afternoon. Just make sure to have robust evaluation methods in place!
Say you’re considering a new pricing tier: prompt a synthetic panel across income and usage levels, ask open-ended questions about the change, map responses back to a quantitative scale through embeddings. You get signal before running a full survey, and you can rerun it as the product changes.
Eight prompts and an embedding model, no fine-tuning, no training data, no six-figure research contract, and the models have only got better since Thomas ran these experiments. The paper is on arXiv. The code is on GitHub. Try it on your own product and see what your synthetic customers think.
👉 This was a guest Q&A from our November cohort of Building AI Applications for Data Scientists and Software Engineers. It’s a live cohort with hands on exercises and office hours. Our final cohort is in March. Here is a 25% discount code for readers. 👈











