Alternative data in credit scoring

While recent innovations in alternative data have drawn increasing attention from banks, lenders have increasingly integrated alternative data into their scoring models. The first to start using alternative data were however not banks, but fintech companies.

Banks and other financial institutions are now starting to catch up, and are increasingly using alternative data to supplement their traditional data sources.

Fintech companies have started to use alternative data to offer new products and services, such as peer-to-peer lending, that were not possible with traditional data sources. The traditional banking sector that relied on credit history was not able to reach a large number of customers.

Alternative data as a proxy to traditional credit history

Alternative data usually serves as a proxy for traditional credit history data. However, there are many risks associated with alternative data exactly for the reasons it is not direct hard credit history data, but soft proxy data.

Some of the main reasons for it being problematic are that the alternative data may correlate with a credit default, but may not be causal. And the correlation might be only seasonal and not really connected directly with repayment behavior. In other words, the correlation may be just accidental.

Additionally, when alternative data is used in scoring, it might open up new possibilities for discrimination. If, for example, alternative data sources include social media data, this data might contain ethnic, gender, or other types of information that could lead to discriminatory lending practices.

Another challenge with using alternative data is that it is often unstructured and not easy to use in a traditional credit scoring model. Therefore, special data processing techniques need to be used, and this often requires the use of artificial intelligence (AI) and machine learning (ML) methods.

The summary of the main advantages and disadvantages of using alternative data in credit scoring.

Advantages:

  • Alternative data can help to reach a larger number of potential borrowers, including those without a traditional credit history.
  • Alternative data can provide a more holistic picture of a borrower's financial situation and can therefore help to make better lending decisions.
  • The use of alternative data can help to reduce the cost of credit scoring, as traditional data sources can be expensive to obtain and process.

Disadvantages

  • As alternative data is often unstructured and not easily integrated into traditional credit scoring models, special data processing techniques are required, which can be costly.
  • There is a risk that alternative data might be used to discriminate against certain groups of people.
  • There is a risk that the correlation between alternative data and credit default might be accidental and not really indicative of repayment behavior.

Building a credit scoring model with alternative data

The choice of data attributes that would come from alternative data in the credit scoring should be put through the same or higher scrutiny as when building traditional credit scoring.

This means that data attributes should be:

  • Related to the creditworthiness of the borrower: The data attributes should be able to predict the probability of default.
  • Available for a large number of borrowers: The data attributes should be available for a large number of borrowers in order to be able to train a robust predictive model.
  • Easy to obtain and process: The data attributes should be easy to obtain and process in order to keep the cost of credit scoring down.
  • Stable in time: The data attributes should be stable over time, in order to avoid building a model that quickly becomes outdated. The correlation between default and an attribute value should also be stable on an “out of time” sample.
  • Unbiased: The data attributes should not be biased in order to avoid discrimination.

Examples of alternative data sources for credit scoring

Some examples of data attribute that could come from alternative data sources and that fulfill the above criteria are:

  • Payment history on utility bills
  • Rent payment history
  • History of taking out short-term loans
  • Social media activity
  • Cell phone usage
  • Ecommerce transaction data
  • Ride-hailing history

Credit scoring methods

Once the data attributes have been selected, the next step is to build the credit scoring model. This can be done using a traditional approach, such as logistic regression, or using more advanced machine learning methods.

Among methods used for credit scoring with alternative data are

  • Random forest
  • Gradient boosting (XGBoost, LightGBM)
  • Neural networks
  • SVM
  • Logistic regression
  • Other regressions models such as MARS

The advantage of using machine learning is that it can automatically learn the complex relationships between the data attributes and the probability of default, and does not require human expertise to build the model. The disadvantage is that it can be more difficult to explain the results of the credit scoring model to humans, and there is a risk of overfitting the model to the data. For that reason, scrutiny should be applied to each attribute coming into the final model.

Once the credit scoring model has been built, it needs to be validated in order to assess its predictive power and avoid overfitting. This can be done using a traditional approach, such as cross-validation, or using more advanced methods, such as out-of-time cross-validation.

Conclusion

Alternative data can be used to supplement traditional data sources in credit scoring. The use of alternative data can help to reach a larger number of potential borrowers, including those without a traditional credit history.

Alternative data can provide a more holistic picture of a borrower's financial situation and can therefore help to make better lending decisions.

The use of alternative data can also help to reduce the cost of credit scoring, as traditional data sources can be expensive to obtain and process.

However, there are also risks associated with the use of alternative data, such as the risk of discrimination and the risk that the correlation between alternative data and credit default might be accidental and not really indicative of repayment behavior