7 minute read 27 Oct. 2020
EY - Credit Report with Score rating app on smartphone

Assessing and mitigating unfairness in credit models

By EY Canada

Multidisciplinary professional services organization

7 minute read 27 Oct. 2020

EY and Microsoft have worked together to explore how the Microsoft Fairlearn toolkit can help financial institutions embed fairness in their lending practices.

EY and Microsoft  have worked together to explore how the Microsoft Fairlearn toolkit, an open-source toolkit started by Microsoft, can be used to assess and improve the level of fairness of a machine learning (ML) model for loan adjudication. As AI plays an increasingly prominent role in the financial services industry, organizations must use it responsibly and plan for unintended consequences, such as amplifying existing gender or racial biases or violating laws such as the Canadian Human Rights Act. To address these kinds of concerns, fairness must be explicitly prioritized throughout the AI development and deployment lifecycle. Thus, transparency and fairness quickly became the most important aspects of a sound and trustworthy artificial intelligence/machine learning (AI/ML) model.

The importance of fairness in financial modeling

Financial services organizations play a central role in the financial wellbeing of individuals, communities and businesses. Every day, these organizations make decisions that impact people’s lives, such as approving loan applications, foreclosing on mortgages, or paying out life insurance claims.

In recent years, financial services organizations have begun to use artificial intelligence (AI) and machine learning (ML) in their decision-making processes. With the ability to ingest and analyse vast amounts of data, both structured and unstructured, AI algorithms have the potential to help financial services organizations better meet their customers’ needs.

As AI plays an increasingly prominent role in the financial services industry, organizations must use it responsibly and plan for unintended consequences. When an organization deploys an AI system, there’s the potential to inadvertently deny people services, amplify existing gender or racial biases, or violate laws such as the Canadian Human Rights Act.

To address these kinds of concerns, fairness must be explicitly prioritized throughout the AI development and deployment lifecycle. Transparency and fairness are the most important aspects of a sound and trustworthy AI/ML model.

Here we explore how the Microsoft Fairlearn toolkit can be used to assess and improve the fairness of a ML model for loan adjudication.

Contact us to learn more and to request a full copy of this whitepaper.

How is fairness defined?

Fairness metric quantify the extent to which a model satisfies a given definition of fairness. Fairlearn covers several standard definitions of fairness for binary classification, as well as definitions that are appropriate for regression. These definitions either require parity in model performance (e.g., accuracy rate, error rate, precision, recall) or parity in selection rate (e.g., loan approval rate) between different groups defined in terms of a sensitive feature such as “gender” or “age.” We note that the sensitive feature need not be used as an input to the model, it is only required when evaluating fairness metrics.

The following three definitions are widely used in classification settings where being classified as “positive” results into an allocation of an opportunity or resource (e.g., a loan) and having a positive label in a dataset means that the corresponding individual is “qualified”:

  • Demographic parity (independence): Individuals within each group should be classified as positive at equal rates. Equivalently, the selection rates across all groups should be equal.
  • Equal opportunity (separation): The qualified individuals in each group should be classified as positive at equal rates. Equivalently, the true-positive rates across all groups should be equal.
  • Equalized odds (sufficiency): The qualified individuals within each group should be classified as positive at equal rates; the unqualified individuals within each group should also be classified as positive at equal rates. Equivalently, the true-positive rates across all groups should be equal and the false-positive rates across all groups should be equal.

Fairlearn

Fairlearn is an open-source Python toolkit for assessing and improving the fairness of AI systems. The design of Fairlearn reflects the understanding that there is no single definition of fairness and that prioritizing fairness in AI often means making trade-offs based on competing priorities. Fairlearn enables data scientists and developers to select an appropriate fairness definition, to navigate trade-offs between fairness and model performance, and to select an unfairness mitigation algorithm that fits their needs.

Fairlearn focuses on fairness-related harms that affect groups of people, such as those defined in terms of race, gender, age, or disability status. Fairlearn supports a wide range of fairness definitions for assessing a model’s effects on groups of people, covering both classification and regression tasks. These fairness definitions can be evaluated using an interactive visualization dashboard, which also helps with navigating trade-offs between fairness and model performance. Besides the assessment component, Fairlearn also provides a range of unfairness mitigation algorithms that are appropriate for a wide range of contexts.

Fairlearn offers an interactive visualization dashboard that can help users assess which groups of people might be negatively impacted by a model and compare multiple models in terms of their fairness and performance.

When setting up the dashboard for fairness assessment, the user selects two items:

  • The sensitive feature (e.g., gender or age) that will be used to assess the fairness of one or multiple models
  • The performance metric (e.g., accuracy rate) that will be used to assess model performance

These selections are then used to generate visualizations of a model’s impacts on groups defined in terms of the sensitive feature (e.g., accuracy rates for “female” and “male” as defined in terms of the gender feature). The dashboard also allows the user to compare the fairness and performance of multiple models, enabling them to navigate trade-offs and find a model that fits their needs.

Unfairness mitigation algorithms

Fairlearn includes two types of unfairness mitigation algorithms — postprocessing algorithms and reduction algorithms — to help users improve the fairness of their AI systems. Both types operate as “wrappers” around any standard classification or regression algorithm.

Fairlearn’s postprocessing algorithms take an already-trained model and transform its predictions so that they satisfy the constraints implied by the selected fairness metric (e.g., demographic parity) while optimizing model performance (e.g., accuracy rate). There is no need to retrain the model.

For example, given a model that predicts the probability an applicant will default on a loan, a postprocessing algorithm will try to find a threshold above which the applicant should get a loan. This threshold typically needs to be different for each group of people (defined in terms of the selected sensitive feature). This limits the scope of postprocessing algorithms because sensitive features may not be available to use at deployment time or may be inappropriate to use or, in some domains, prohibited by law.

Fairlearn’s reduction algorithms wrap around any standard classification or regression algorithm, and iteratively re-weight the training data points and retrain the model after each re-weighting. After 10 to 20 iterations, this process results in a model that satisfies the constraints implied by the selected fairness metric while optimizing model performance.

Reduction algorithms do not need access to sensitive features at deployment time, and they work with many different fairness metrics. These algorithms also allow for training multiple models that make different trade-offs between fairness and model performance, which users can compare using Fairlearn’s interactive visualization dashboard.

Case study: Fairer loan adjudication models

When making the decision to approve or decline a loan, financial institutions gather data from the applicant, data from third parties and internal data to assess the applicant’s creditworthiness. Several models contribute to the decision, including the prediction of the borrower’s probability of default. This is usually formulated as the task of predicting the likelihood that the borrower will fall behind on payments by more than 90 days at some point in the coming year, given the previous experience the bank has had with defaulting and non-defaulting customers.
In our analysis, we used a dataset of loan applications to illustrate how a ML model fitted with standard machine learning algorithm (specifically, LightGBM) can lead to unfairness that affects groups defined in terms of the sensitive feature “gender,” even though “gender” is not used as an input to the model. We then show how the Fairlearn toolkit can be used to assess and mitigate this unfairness. This is a good illustration that while a model might not have direct access to a sensitive attribute, it might still result in fairness issues due to other non-sensitive features acting as a proxy, leaking information about the removed sensitive attribute into the model lifecycle.
Two classes of Fairlearn’s mitigation algorithms (reduction and post-processing algorithms) are shown in this case study to successfully mitigate the observed unfairness across gender subgroups without much of an effect on the overall performance.

Conclusion

Although we focused on the use of AI in a loan adjudication scenario, AI systems are increasingly used throughout the entire credit lifecycle. After an applicant is approved for a loan, AI systems are often used to determine the loan amount and the interest rate that are offered to the applicant. During the remaining life of a credit product, a financial services organization can also use AI systems for account management based on the account behaviour, credit bureau information, and other customer information. AI systems can support pre-approval of credit-card limit increases or assist with a collection strategy when a customer exhibits difficulty with repaying.

As financial services organizations use AI systems more extensively, prioritizing fairness in AI is fundamental to their success.

Although this collaboration between Microsoft and EY focuses on the use of a software toolkit, we emphasize that fairness in AI is a sociotechnical challenge, so no software toolkit will “solve” fairness in all AI systems. That is not to say that software toolkits cannot play a role in developing fairer AI systems — simply that they need to be precise and targeted, embedded within a holistic risk management framework that considers AI systems’ broader societal context, and supplemented with additional resources and processes.

Contact us

Like what you’ve seen?  Get in touch to learn more and to request a full copy of this whitepaper.

Summary

In recent years, financial services organizations have begun to use artificial intelligence (AI) and machine learning (ML) in their decision-making processes. Discover how EY and Microsoft feel the Microsoft Fairlearn toolkit can be used to assess and improve the level of fairness of a machine learning model for loan adjudication.

About this article

By EY Canada

Multidisciplinary professional services organization