7 Ways to Fix Flaky Tests in Your Mobile CI/CD Pipeline

A flaky test that fails 10% of the time sounds manageable. Until you have 200 tests in your suite and your pipeline fails on every other run.

That is the math most mobile teams refuse to do. And it is exactly why flaky test automation fixes have become the most searched topic in QA engineering this year.

Here is the reality: according to Bitrise's Mobile Insights report, the likelihood of encountering a flaky test rose from 10% in 2022 to 26% in 2025. Slack's engineering team reported that flaky tests caused 56.76% of their CI failures before they built a dedicated fix system.

You do not have to live with this. Here are six specific, actionable flaky test automation fixes, from quick wins to architectural changes.

1. Use smart waits instead of hard-coded delays

The problem with sleep() and fixed timeouts

Hard-coded waits are the single biggest cause of flaky mobile tests. Async wait and timing issues account for roughly 45% of all test flakiness, according to root cause analyses across mobile CI systems.

The trap is simple. Set the wait too short and the test fails because the element has not loaded yet. Set it too long and your pipeline takes 45 minutes instead of 10.

Mobile makes this worse than web. Network latency varies wildly between Wi-Fi and cellular.

Device performance differs across hardware generations. An animation that takes 200ms on a Pixel 8 takes 600ms on a budget Samsung.

Solution: element-aware polling

Replace every sleep(3) with a smart wait that polls for a specific element state. Wait for "visible," "clickable," or "text present" instead of waiting for a fixed number of seconds.

Set a maximum timeout (say 15 seconds) with a polling interval (250ms). The test moves on the instant the element is ready, but it will not hang forever if something breaks.

If you are using Appium, WebDriverWait with ExpectedConditions handles this. If you are using an AI-native platform like Autosana, smart waits are built in. The agent waits for the screen to stabilize before acting.

This single fix typically eliminates 30-40% of flaky failures overnight.

2. Isolate test data and state

Why shared state causes flakiness

Test A creates a user account. Test B logs in with that account. Test C deletes it.

Run them in order and everything passes. Run them in parallel (which your CI system does to save time) and Test B fails because Test C deleted the account first.

Shared test data is the second most common cause of mobile test flakiness. It is also the most frustrating to debug because the failures look random.

The fix: every test run gets its own data. Generate unique usernames, unique email addresses, unique test records. Each test creates what it needs in setup and tears it down after.

This is harder than it sounds for mobile apps. Many teams use a shared staging environment where test data persists between runs.

If your app has social features, tests that follow or unfollow users can interfere with each other. If your app has inventory, tests that purchase items can exhaust stock for other tests running in parallel.

The cleanest solution is a dedicated test environment that resets between runs. If that is not possible, generate unique test data with timestamps or UUIDs so tests never collide.

For mobile apps with backend dependencies, use API shortcuts. Instead of navigating four screens to create a test account through the UI, hit the API endpoint directly. Reserve the UI steps for the actual thing you are testing.

3. Adopt self-healing test automation

Let AI fix broken selectors automatically

A developer renames a button from btn-submit to submit-button. Every test that references the old selector breaks.

Not because the feature is broken. Because the test is brittle.

Self-healing test automation solves this. The AI detects that the element has changed, identifies the new selector using multiple attributes (text, position, context), and automatically updates the test.

The numbers are striking. Self-healing tests improve reliability from typical 70-80% pass rates to consistent 95-98% pass rates. Organizations report 85-95% reductions in maintenance effort after adopting self-healing platforms.

Platforms with built-in self-healing include Autosana (which uses agentic AI to navigate apps like a real user, eliminating selector dependence entirely), mabl, and testRigor. If you are comparing self-healing tests vs traditional test scripts, the maintenance difference is the biggest factor.

For most teams, this is the highest-impact fix on this list. You can read our full guide to self-healing test automation for a deeper look.

4. Test on real devices, not just emulators

Emulator limitations that cause flakiness

Emulators are great for development. They are terrible for reliable test automation.

Performance timing differs between emulators and real hardware. Features like camera, GPS, biometrics, and push notifications behave differently or do not work at all. Android emulators are particularly inconsistent because they cannot replicate the hardware variations across Samsung, Xiaomi, OnePlus, and dozens of other manufacturers.

A test that passes on your local emulator but fails in CI is not flaky. It is revealing a real environment difference.

Cloud device farms give you on-demand access to thousands of real devices without maintaining a physical lab. Services like AWS Device Farm, BrowserStack, and Sauce Labs handle this. Autosana includes cloud-hosted real devices for both iOS and Android, so there is no separate device farm to configure.

The cost is lower than you think. Maintaining a physical device lab with 20+ devices costs $15,000-30,000 per year. Cloud device farms start at a fraction of that.

5. Implement retry logic with quarantine

Smart retries vs dumb retries

Retrying every failed test three times is not a fix. It is a band-aid that triples your pipeline runtime and hides real bugs.

Smart retry logic distinguishes between infrastructure failures and assertion failures.

If a test fails because the device farm had a connectivity blip, retry it. If a test fails because the login button is genuinely missing, do not retry. That is a real bug.

The quarantine pattern takes this further. When a test fails inconsistently more than a threshold (say, 3 times in 10 runs), automatically quarantine it. Move it out of the blocking pipeline into a separate investigation queue.

This keeps your main pipeline green and trustworthy while giving you visibility into which tests need attention. Datadog's Flaky Test Management and similar tools automate this tracking.

Track your flakiness rate over time. If it is trending down, your fixes are working. If it is trending up, you are adding tests faster than you are stabilizing them.

Set a flakiness budget for your team. Something like: "No PR merges if the flaky test rate exceeds 5%." This creates accountability and prevents the slow creep that turns a healthy pipeline into an unreliable mess.

6. Reduce test scope and dependencies

Smaller, more focused tests

A 40-step E2E test that covers login, search, add to cart, checkout, and order confirmation is doing too much. If any step is flaky, the whole test is flaky. And you have no idea which step caused it.

Each test should validate one user journey. One clear path. One assertion.

Break the mega-test into five focused tests: login test, search test, cart test, checkout test, confirmation test. Now when the cart test is flaky, you know exactly where to look.

Reduce cross-test dependencies too. If your checkout test depends on the search test having run first, you have created an invisible ordering requirement. Use API calls to set up test preconditions instead of depending on previous tests to create the right state.

This makes tests faster, more debuggable, and significantly less flaky.

7. Switch to AI-native testing platforms

The nuclear option that actually works

Sometimes the right fix is not patching your existing test framework. Sometimes it is replacing it.

AI-native testing platforms like Autosana eliminate selector fragility entirely. Instead of locating elements by XPath or CSS selectors, the AI agent looks at the screen like a user does. It reads the text, understands the layout, and taps what makes sense.

When a developer changes the button label from "Submit" to "Place Order," the AI agent figures it out. No broken selectors. No maintenance ticket.

No flaky failure at 2 AM.

Natural language tests eliminate an entire category of flakiness. You write "log in, navigate to settings, toggle dark mode" and the agent handles the rest. There is no brittle locator to break.

This is the path for teams that have tried fixes 1 through 6 and still spend more time maintaining tests than writing features. For a deeper look at the natural language approach, check our guide to no-code mobile app testing.

Frequently asked questions

What's an acceptable flaky test rate?

Google's research suggests maintaining a greater than 90% signal-to-noise ratio as the minimum acceptable standard. In practical terms, aim for under 2-3% of your test runs being flaky. If your suite has a higher rate, prioritize these fixes. Each flaky test erodes team confidence in the entire pipeline.

Should I delete flaky tests or fix them?

Fix or quarantine them. Never delete.

A flaky test usually covers a real user scenario. Deleting it means you have lost coverage for that scenario entirely.

Quarantine the test, investigate the root cause using the six fixes above, and bring it back once it is stable. The goal is zero flaky tests, not fewer tests.

Take your pipeline back

Flaky tests are a solvable problem. Start with the quick wins: smart waits (fix #1) and data isolation (fix #2) can be implemented in a day and will cut your flakiness rate significantly.

Then consider self-healing platforms (fix #3) for the structural advantage. Teams that reduce test maintenance with AI report spending 80% less time on test upkeep and 3x more time on actual feature testing.

You do not have to tolerate a red pipeline. Fix it, or replace the tools causing it.

Eliminate flaky tests entirely with Autosana's self-healing AI testing →