When users of Duolingo, the wildly popular language-learning app, neglect their Russian grammar or Korean vocabulary lessons for too long, Duo the Owl cries. The app’s pudgy, dynamic feathered mascot shows up in a user’s inbox, weeping — hoping to get them to return to the app.
What users don’t see, however, is the work Duolingo put into those tears. Duolingo embraces constant self-reinvention through small improvements, all driven by data. Since its founding, Duolingo has run over 3,000 A/B tests, gauging user response to app features large and small. It might run as many as 200 at any one time. All those experiments produce an avalanche of data that guide the company’s decisions.
And yes — as it rolled out those emails in 2018, Duolingo ran A/B tests to find out just how big Duo’s tears should be in order to get learners to return to their lessons.
While app development teams commonly use A/B testing to improve monetization or retain users, Duolingo also runs its experiments to figure out the best ways for people to learn; in the US, where Duolingo is based, more people study languages via the app than are currently learning languages in the public school system.
Duolingo team members believe they are improving their app by 1 percent each week. To do that, they encourage experimentation, analysis, and rigorous use of data to shape decisions.
Of course, all this plays out in ways unique to Duolingo — but the company’s strategic approach can help developers with completely different missions. Duolingo testing process can broken down into three steps, all aimed at weaving experimentation into a company’s development philosophy and practice.
Step 1: Foster a culture
Duolingo credits its experimental culture for boosting the app from 13% D1 retention when it first launched in 2011 to 55% D1 retention today.
Of course, internal motivation helps fuel that culture — Duolingo’s team embraces the company’s mission, which shows up in one of 10 core operating principles as: “Learners first. Our mission and entire reason for existing is to make sure everyone in the world has access to high-quality language education”. They really want to know how the app can help people learn.
But they also have another principle,“Test everything”, which urges colleagues to make informed decisions based on data. In other words, use data to keep your intuition in check.
One of Duolingo’s standard exercises asks users to translate sentences from their native language into the language they’re studying by tapping words in a jumbled collection of options.
Intuitively, it seems like seeing that one of those words is capitalized would stand out as a bug, since the capital letter reveals which word goes first in the translated sentence.
However, when Duolingo tested versions of the exercise with no capitalized words, they measured a retention drop that wasn’t worth removing the hint. In other words, the experiment found that Duolingo’s users preferred a little help along the way.
The takeaway: even if something seems like it should be true, you don’t know until you test it. And to make those tests really matter, you have to build them right. Which brings us to our next step.
Step 2: Hone the process
Another Duolingo operating principle calls for employees to “Prioritize ruthlessly.” Even as the company runs its hundreds of tests, it works to make sure to keep its focus on ideas that have the highest ROI.
They “make their time matter,” as they say internally, by doing these two things:
Every Duolingo A/B test starts as a one-page proposal. Written to a concise, clear template, this one-pager outlines key information needed to evaluate the potential experiment. To get the green light, the one-pager must successfully argue that there’s a problem to be solved, and make the business and user-experience case for tackling it. A/B testing can require significant resources. This document ensures buy-in from the right people before anything goes forward, but also makes the experiment design process accessible across the team.
Respecting your process
The company sticks to a defined set of steps for every A/B test, keeping the process grounded and results transparent. Here’s what theirs looks like:
Even when experiments result in apparent wins, the process is followed all the way through before any changes are adopted. Duolingo uses guardrail metrics to make sure changes aren’t implemented too fast, without a clear understanding of their implications.
For example, the company recently tested a download button to allow users to download offline lessons with Duolingo Plus for learning without an internet connection. When users tapped the download button, they would be brought to a screen offering Duolingo Plus.
The experimental user cohort had 20 percent higher conversion compared to the control group. But the experiment also resulted in a 1 percent drop in DAUs (daily active users). Further investigation revealed that some users were confused about what the download buttons meant, so the change was not implemented.
In this analysis, a short-term gain wasn’t worth the cost to long-term, sustainable engagement and retention. Without Duolingo’s full process, the company may not have reached that conclusion. Decisions like these are tough, but the company has found that it can usually find ways to roll out app changes without damaging critical metrics.
The takeaway: no matter how different your experiments may be, follow the same framework — and always scrutinize results with your long-term objective in mind.
Step 3: Get everyone involved
At Duolingo, there are a few key practices that really showcase the extent to which the company goes to embody its ethic of testing everything in order to draw the right conclusions.
Everyone an A/B tester
Duolingo trains everyone in the organization to be fluent in running A/B tests. Everyone!
In the company’s early days, only two team members could set up experiments, which caused bottlenecks. In September 2017, Duolingo launched an online course, developed internally, teaching employees how to create experiments, then followed up with a quiz.
The company’s approach to experimentation balances process and autonomy. After the training, anyone — from PMs to junior engineers — can dive into the process, starting with the one-pager, to propose an experiment.
The overall number of A/B tests since has skyrocketed.
Building better tests
Duolingo has done a lot to make sure their experimentation makes a difference. That starts with designing experiments carefully, so the results provide real answers to precise questions.
Here’s an under-appreciated truth: a lot of A/B testing, as traditionally done, doesn’t actually lead to improvements. If you don’t set up your experimental groups correctly, your results won’t mean what you think they mean.
Imagine you set up an A/B test in which 50 percent of users across your app form a control group, while 50 percent of users see the experimental feature you’re investigating. The problem with such broad groups is that some users in the experimental cohort might not normally encounter the feature you’re testing in their normal experience.
For instance, Duolingo language learners advance through their courses in stages of progressively harder content — just like any student of anything. It wouldn’t make sense for Duolingo to test an advanced course feature on a group that included beginners — their behavior would skew results.
To avoid that pitfall, Duolingo has evolved its practice and now embraces counterfactual A/B testing. The app’s teams develop cohorts of users defined by specific criteria, and design experiments to test the behavior of cohorts relevant to the particular question the A/B test looks to answer.
As an example, the app recently launched a feature called Stories, which helps learners build reading and comprehension skills through short stories in several languages. By identifying a cohort that engages with Stories — generally speaking, more “serious” and dedicated users — Duolingo can design experiments to specifically test features with that group.
This more nuanced approach can often mean smaller experiments that take longer to design, implement, and return results. But with more fine-tuned results in-hand, confidence in findings grows. Duolingo has also developed a host of analytics tools that make seeing and understanding those results easy for their team members. Analyzing experiment results is now an embedded part of life at Duolingo.
Talk it over
Duolingo employees take pride in the tests that they set up. After an experiment’s results are in, it’s time to spread the word. Team members share A/B test results in all-hands meetings called Parliament (the name, of course, for a group of owls) and via an opt-in email list. They are encouraged to share findings directly with learners on Duolingo’s forums and to publish stories about the process and resulting changes in the public blog. Culturally, there’s no shame in a failed experiment, as it’s just another opportunity to learn.
Recognizing that they’re not alone in their quest to improve learning innovation, they even publish tools and data to advance the field.
The takeaway: Once the process is right, ensure everyone can be involved — in both shaping and sharing the tests that you run.
A final company operating principle states: “We haven’t won yet. Until every person who wants to learn a language is doing so with Duolingo, we need to keep innovating, pushing the envelope, and learning ways to get better”.
Duolingo maintains a pipeline of test ideas, big and small. Some A/B experiments test whether minor wording changes can move users from free to paid. Others have led to major redesign decisions which have boosted retention, like the creation of leaderboards, and an overhaul of how users advance to harder courses.
Whether you’re just doing A/B testing, A/B/C/D testing, multivariate testing, or counterfactual A/B testing, the 3 precepts outlined in this article provide a framework to build a company that loves experimentation and backs all decisions with data:
- Foster a culture of relying on data to better serve your mission. Take small steps for big improvements.
- Hone the process to avoid spaghetti testing (whereby you just see what ideas stick). And then respect that process.
- Empower your full team to be involved, and invest in tools that let you scale your ability to fine-tune your approach to testing.
PS. In mature and sophisticated testing systems, like Duolingo, you get the data on not only the first-order effects (e.g. an improvement in conversion rates from free to paid) but also second- and third-order effects, like user retention, time on site, average revenue per order, etc. It helps prevent promoting seemingly good, winning tests which in reality deteriorate other parts of user experience.