Shopify Just Removed the Last Excuse for Not Testing
One of the questions I get asked most often by brands that should be running experiments is some version of "What tool do we need to get started with A/B testing?"
It's usually not the real question. The real question is closer to: "What's the minimum viable reason to start?"
Shopify's Winter '26 Edition answered that. They shipped two tools in the same release that together remove almost every structural objection brands have historically used to avoid testing. One is called Rollouts. The other is SimGym.
They do different things and they're worth understanding separately before thinking about how they work together.
Rollouts allows you to test theme changes
Until this year, running a proper A/B test on a Shopify storefront required a third-party app. Shoplift, Intelligems, Convert, ABlyft, or Optimizely for example. All capable tools, all adding injected JavaScript to your storefront, a monthly subscription cost, and an ongoing maintenance overhead that somebody on the team has to own.
The JavaScript injection problem was more significant than most brands appreciated. External testing scripts add page weight and introduce latency at precisely the moment that matters most. For a drop-model brand where a significant proportion of revenue clears in a 48-hour window, a few hundred milliseconds of additional load time isn't an abstract performance metric. It's a conversion rate problem sitting inside the tool you're using to improve your conversion rate.
Rollouts is server-side. The split happens before the page is served to the visitor. You create a version of your theme, push a defined percentage of real traffic to it, and Shopify tracks performance natively. There are no third-party scripts, no external platforms, and no additional monthly costs. It's built directly into the admin and it's free on all Shopify plans.
The practical workflow: you duplicate your published theme, make whatever change you want to test, set a traffic split, and let it run. Shopify handles the assignment and the measurement. You look at the data and make a decision.
What it currently tests is theme-level changes. Layout, content blocks, navigation structure, hero sections, PDP layouts. It doesn't test individual elements in isolation the way a dedicated CRO tool does. You can't run a headline test without touching anything else on the page. That's a real limitation and worth being honest about. If you're running a sophisticated experimentation programme with a high volume of granular tests, you'll still want a dedicated platform. But most brands I work with aren't doing that. Most brands I work with aren't testing at all.
For that majority, Rollouts is not a compromise. It's a better starting point than most of what they've been paying for.
What SimGym is and what it isn't
SimGym is a separate, first-party Shopify app currently in AI Research Preview. It does something genuinely new.
It sends AI-powered shoppers through your storefront. Not bots running random click patterns. Shopify has trained the shopper personas on behavioural data from billions of real transactions across their platform. It then refines those personas to the behaviour of your specific store's customer base. The simulated shoppers browse your store, navigate collections, add items to cart, and surface friction points. At the end of a simulation you get add-to-cart rate comparisons, navigation pattern data, and a winning theme recommendation.
The interesting CRO use case is pre-launch validation. Most brands change something significant. A new homepage layout, a restructured PDP, a different navigation architecture. Then they push it live to 100% of traffic based on a gut call or a team vote. If they later want to know whether it worked, they're comparing apples to oranges because thirty other things also changed in the same window.
SimGym lets you run the change through AI shoppers before any real customer sees it. You get a signal on add-to-cart behaviour and friction points, and you can make a more informed call about whether to proceed, iterate, or abandon the idea before it's live.
Two things worth being clear about. SimGym is still in research preview, which means the functionality and pricing are subject to change. The current model is pay-per-credit, with some free credits allocated during the preview period. It's not the same as Rollouts, which is free on all plans. And the simulated shoppers are not a substitute for real traffic data. Shopify is explicit about this. Results from AI personas will differ from actual buyer behaviour. What SimGym gives you is a better-informed hypothesis before you commit real traffic, not a replacement for testing with real visitors.
Use SimGym and the test using Rollouts
Used in sequence, the workflow makes sense. Run a SimGym simulation to validate the hypothesis and identify obvious friction before the test goes live. Then run it through Rollouts with real traffic to get the actual result. You spend less time running tests on weak hypotheses and more time running tests on things that have already survived an initial signal check.
What this changes for CRO on Shopify
I want to be direct about the strategic shift here because I think most commentary on these tools undersells it.
The reason most Shopify brands don't test systematically isn't budget. It isn't technical capability. It's the combination of perceived complexity, setup cost, and the lack of a clear owner for the work. Third-party testing tools required someone to own the platform, manage the JavaScript, interpret the results, and justify the monthly fee. That combination meant it lived in the gap between technical and trading teams and never got prioritised.
Rollouts removes most of that friction. The tool is in the admin. The data is in the admin. The person who manages the Shopify theme is already in the admin. There's no integration to maintain and no external platform to justify.
That doesn't mean testing becomes automatic. You still need a hypothesis. You still need to run tests for long enough to get a meaningful result. You still need someone who can read the data and make a decision from it.
None of those things happen by default just because the tool exists.
But the structural excuse is gone. The question of whether to test is now a question about prioritisation and process. It’s not about tooling or budget.
For drop-model brands specifically, where the conversion opportunity concentrates in a compressed window and there are genuine periods between drops for preparation work, this should shift the pre-drop checklist.
Testing a new PDP layout or a revised collection page structure in the weeks before a drop, using real traffic to a percentage split rather than pushing changes live and hoping, is now something you can do without a third-party tool or a developer setting up an external platform.
The preparation window between drops has always been where the compounding gains are. You do the structured data work, the entity maintenance, the email list warm-up, the creative refinement. Systematic testing of on-site changes belongs in that same window. Rollouts makes that realistic for brands that couldn't justify the overhead before.
The limitation worth keeping in mind
Rollouts tests themes, not elements. That's the distinction that matters.
If you want to know whether a different headline on your PDP improves conversion, you can't isolate that change cleanly in Rollouts. Not without it being the only change in your test theme, which requires discipline and slows down the rate at which you can iterate. A dedicated CRO tool with element-level targeting and statistical significance calculations is still the better instrument for granular, high-velocity experimentation.
The brands for whom Rollouts is a complete solution are brands running a small number of meaningful tests per quarter on significant layout or structural changes. For brands running ten or fifteen tests a month across granular elements, it's a useful addition to the stack rather than a replacement for what they already have.
Most of the brands I work with are in the first category. And most of them haven't been testing at all.
For them, the question isn't whether Rollouts is as powerful as a dedicated platform. The question is whether it's good enough to start building an evidence culture inside the team. It is. And starting is the part that matters.
The brands that figure out their testing process now, even imperfectly, will have compounded a meaningful advantage by the time agentic commerce is the primary discovery surface. The ones waiting for the perfect tool to arrive are going to be designing experiments in a context that has already moved on.
Pick your highest-traffic PDP. Form a hypothesis. Set up two theme versions. Run it for three to four weeks. Read the data. The tool is free. The only thing left is the decision to start.