Outliers and Credit Card Fraud Detection

Credit card fraud has become more rampant as data become digitally stored, but at the same time, credit card fraud prevention has also become more powerful—primarily because of effective outlier detection using Big Data processing techniques.

Although there are many strategies for credit card fraud detection, one of the simplest to explain is the use of outlier detection. In a nutshell, outlier detection is the automated process of discovering data points that do not match the patterns inherent in the data as a whole. For credit card companies, this means looking at a client’s spending habits, identifying patterns (e.g., cluster analysis), and tracking when spending does not match a person’s habits or patterns.

For instance, let’s think about the following scenario:

Tara is a Texas high school student who just opened her first bank and credit card accounts one year ago, after receiving her first paycheck from Taco Cabana. Last week, Tara went online and purchased a Willie Nelson concert ticket at a ridiculously low price, $5! Wow! Almost too good to be true—and it was. The next day, a representative from Tara’s bank called to inform her that her credit card was used to purchase gasoline, burgers, and a dirt bike—in Minnesota. Terrified Tara told the teller that the charges were bogus! The bank and credit card company then began to deal with the credit card fraud. Tara was not happy.

How did the bank/credit card company know that Tara did not make those purchases? Outlier detection. In this ficticious scenario, Tara had never spent more than $100 on her credit card for any single transaction, and no transaction had occurred outside the state of Texas. Both of these may have signaled to the company that this charge was an anomaly (an outlier).

Detecting Outliers Visually

Using one of our big data analysis tools (Google’s Public Data Explorer) we can visually identify an anomaly in a dynamic visualization. From there, we can begin to hypothesize a cause for the anomaly.

Navigate to the Fertility Rate vs. Life Expectancy visualization in Google Public Data Explorer to investigate an anomaly. There are many unusual trends in the data visualization; however, let’s focus on one—the case of East Timor (a.k.a., Timor-Leste, identified in the figure below).

The visualization details data along five attributes:

  • Life Expectancy—How long individuals in this country live on average. This is indicated along the x-axis.
  • Fertility Rate—The average number of times the average woman in this country gives birth. This is indicated along the y-axis.
  • Nation—This is the label applied to each bubble. Click on the bubble to view it.
  • Region—The colors associated with each bubble indicate what part of the world in which they are located.
  • Population—The size of each bubble is proportional to the population of the country represented. This is calculated yearly.

Note that actual values for these data can be obtained by hovering over the data point you wish to inspect. The x- and y-axis values appear along their respective axes, and the population appears in the legend.

East Timor is obviously an outlier in the static image above. It is clearly set apart from the rest of the clustered data. As the graphic indicates, life expectancy was a staggering 32.9 years of age, though fertility rates were average in comparison to the rest world!

However, as can be seen in the dynamic visualization, the true nature of the decline in life expectancy is drastic, taking place over the 1970s. What happened in the 1970s in East Timor? The identification of East Timor as an anomaly might indicate that some variable majorly affected its life expectancy during this time.

That variable, it turns out, was a civil war that broke out between East Timorese political parties, leading to an invasion and eventual occupation by Indonesia. The occupation was marked by extreme violence and brutality and resulted in more than 100,000 deaths, with approximately 20% of those resulting from killings and 80% from hunger and illness. This excessively high number of conflict-related deaths artificially skewed the mortality rate in the region during the time immediately after and during the occupation, as can be seen in the visualization.

Although the outlier in the static graph for 1977 establishes East Timor as a troubled area, the narrative expressed through the visualization not only gives context, but also indicates that the formation of the outlier was itself an anomaly.

Outliers tend to be meaningful and tell stories about unique circumstances. Identifying outliers and investigating them can make descriptive statistics more accurate and also tell a story on their own.