Scraped data of 2.6 million Duolingo users released on hacking forum

The scraped data of 2.6 million DuoLingo users was leaked on a hacking forum, allowing threat actors to conduct targeted phishing attacks using the exposed information.

Duolingo is one of the largest language learning sites in the world, with over 74 million monthly users worldwide.

In January 2023, someone was selling the scraped data of 2.6 million DuoLingo users on the now-shutdown Breached hacking forum for $1,500.

This data includes a mixture of public login and real names, and non-public information, including email addresses and internal information related to the DuoLingo service.

While the real name and login name are publicly available as part of a user’s Duolingo profile, the email addresses are more concerning as they allow this public data to be used in attacks.

Scraped Duolingo data for sale on a hacking forum
Scraped Duolingo data for sale on a hacking forum
Source: Falcon Feeds

When the data was for sale, DuoLingo confirmed to TheRecord that it was scraped from public profile information and that they were investigating whether further precautions should be taken.

However, Duolingo did not address the fact that email addresses were also listed in the data, which is not public information.

As first spotted by VX-Underground, the scraped 2.6 million user dataset was released yesterday on a new version of the Breached hacking forum for 8 site credits, worth only $2.13.

“Today I have uploaded the Duolingo Scrape for you to download, thanks for reading and enjoy!,” reads a post on the hacking forum.

Duolingo scraped data leaked essentially for free
Duolingo scraped data leaked essentially for free
Source: BleepingComputer

This data was scraped using an exposed application programming interface (API) that has been shared openly since at least March 2023, with researchers tweeting and publicly documenting how to use the API.

The API allows anyone to submit a username and retrieve JSON output containing the user’s public profile information. However, it is also possible to feed an email address into the API and confirm if it is associated with a valid DuoLingo account.

BleepingComputer has confirmed that this API is still openly available to anyone on the web, even after its abuse was reported to DuoLingo in January.

This API allowed the scraper to feed millions of email addresses, likely exposed in previous data breaches, into the API and confirm if they belonged to DuoLingo accounts. These email addresses were then used to create the dataset containing public and non-public information.

Another threat actor shared their own API scrape, pointing out that threat actors wishing to use the data in phishing attacks should pay attention to specific fields that indicate a DuoLingo user has more permission than a regular user and are thus more valuable targets.

BleepingComputer has contacted DuoLingo with questions on why the API is still publicly available but did not receive a reply at the time of this publication.

Scraped data regularly dismissed

Companies tend to dismiss scraped data as not an issue as most of the data is already public, even if it is not necessarily easy to compile.

However, when public data is mixed with private data, such as phone numbers and email addresses, it tends to make the exposed information more risky and potentially violate data protection laws.

For example, in 2021, Facebook suffered a massive leak after an “Add Friend” API bug was abused to link phone numbers to Facebook accounts for 533 million users. The Irish data protection commission (DPC) later fined Facebook €265 million ($275.5 million) for this leak of scraped data.

More recently, a Twitter API bug was used to scrape the public data and email addresses of millions of users, leading to an investigation by the DPC.

Source: https://www.bleepingcomputer.com/news/security/scraped-data-of-26-million-duolingo-users-released-on-hacking-forum/

- Any text modified or added by CorruptionLedger is highlighted in blue.

- [...] These characters indicate content was shortened. This is used for removing unnecessary/flowery language. Example: The oppressive government imposed a curfew becomes: The [...] government imposed a curfew.