Fraud-risk analysis functions in a company are bound to reach an inflection point once a business begins to scale; a turning point where usage of global third party data networks and outsourced fraud modeling see diminishing returns as fraud incidents grow in complexity and volume.
In previous posts, we’ve covered how simplistic fraud models misclassify malicious behaviors as more variables are introduced into the risk calculus, and how more obscure but no less costly fraud typologies emerge as customer product complexity increases.
These and other factors drive an imperative for fraud teams to “get under the hood” of their risk models, unpack their data ingredients, and augment them with more refined kinds of behavioral data to achieve high levels of effectiveness.
In this post, we’ll zoom into a commonly adopted fraud mitigation tool (Google’s reCAPTCHA) and outline how it’s less than optimal to handle a common fraud type at scale: micro-deposit scams.
A primer on Google reCAPTCHA Enterprise
reCAPTCHA Enterprise is used by growing organizations as an entry point solution for dealing with fraud typologies that involve some degree of automation.
At its core, it aggregates statistics on user interactions across 5M+ sites labeled spammy or non-converting by web admins (customer reviews, checkout, and sign-in/up interactions) and other indicators of automated behavior.
This data is compiled into a numerical score between 0.0 and 1.0 (the former being absolute certainty of automation) for site interactions. Admins can then use these thresholds to trigger a CAPTCHA checkbox prompt on those interactions deemed “bad” by its models, where the intent is to never trigger on legitimate human traffic.
There are limitations to understanding this score context in practice; Google even recommends careful evaluation against internally tracked metrics to understand score efficacy.
Taken from reCAPTCHA FAQ:
How do I measure the quality of the scores reCAPTCHA Enterprise is returning?
Ultimately, it depends on your use case and desired results. Generally, we recommend that you use your own internal metrics you have about user behavior to determine if the score was accurate, such as:
- Did a user that reset their password and received a high score later report that their account was hijacked?
- Did a user that logged in with a low score proceed to spam others?
- Did a user that failed to login and received a low score, then proceed to try and login to several different usernames?
While using reCAPTCHA Enterprise can curb early signs of automation in a generalized way, there are drawbacks to relying on reCAPTCHA at scale, which we list below:
Limited ability to access the underlying data and customize models
Fraud teams have limited visibility into the specific context and data sets that result in an interaction’s numerical score.
For example, if an analyst wanted to see how device usage proficiency data ties to fund transfer fraud, or how certain physical device interactions correlate with a certain threshold of fund transfers deemed likely fraudulent, they’d be unable to come up with any results.
This black-box approach to modeling limits the ability of the analyst to understand the complete behavioral context which drives the fraud in the first place, where the only mechanism to shape the model comes mostly from binary labeling.
Aggregate models misclassify targeted fraud typologies
Related to the first drawback of reCAPTCHA, the aggregated nature of Google’s opaque risk models means that targeted fraud typologies specific to an industry or app are often misclassified.
Any seasoned risk analyst will understand how nuanced fraud contexts can be from a data perspective, and how training based on binary labels marked by companies that do not share these nuances will tend to over-index on behavioral contexts irrelevant to their specific cases.
Misclassification of malicious “human” behaviors
While it is difficult to understand the particular features which go into Google’s scoring, Google has shown it accounts for detected automation and admin labels on interactions labeled “good” or “bad”.
In more sophisticated fraud schemes, attackers have methods to bypass automation detection and stay under protective radars. If the economics of the reward for the attacker outweigh the manual human effort required in a scheme, models taking a broad approach to block automated traffic fall short in detection.
One particular example of this is a fraud trend called micro-deposit fraud, which we detail below.
Bypassing automation checks: micro-deposit fraud
When a customer opens a new bank or brokerage account, the institution or an intermediary will often issue a very small transfer to verify the ownership of an external account used for funding.
This mechanism can be abused by attackers to create accounts at scale to the thousands or tens of thousands and transfer the issued deposits to a separate account. Attackers can also use this verification mechanism to validate target victim accounts through micro-deposits, and after obtaining the relevant confirmation, issue large transfers to other accounts in their control.
These kinds of scams can be incredibly difficult to detect with things like reCAPTCHA. The transfer amounts are minimal and done over a prolonged period, and while thousands of accounts can seem large, they may pale in comparison to the total number of interactions profiled and labeled by Google’s models.
Detection becomes even more difficult when attackers have networks of people power-cycling through sign-up flows with reset devices and using stolen identities. And in the case of targeting specific customer accounts for withdrawals, attackers may only need a single account with funds aplenty to make the scam worthwhile.
Customized risk scoring catches what others miss
Gathering behavioral data like device proficiency on freshly opened user accounts can allow an analyst to ask why such a new user so quickly navigates the signup and transfer flow, despite the seemingly “human” nature of the interaction, and the average user taking longer to complete the same flows.
Without a nuanced ability to understand the human behavior across various fraud typologies, risk teams are beholden to black-box risk models which limit their input in shaping their models in a way that makes sense for their use cases.
When risk teams have the ability to break down fraud models to their essential data components and rebuild them with additional behavioral data sets as they see fit, their capability to detect micro-deposit fraud and other less obvious scams is greatly expanded.