“It is better that ten guilty persons escape than that one innocent party suffer.”, Sir William Blackstone (1765) paraphrased.
Machines mess up. Humans even more so. The latter can be difficult, even impossible, to really understand. The former is a bit more straightforward. This short essay describes how we can understand some of the root causes of machine model errors. Particular as those machine model errors relate to group bias and unfairness. It is elementary, really, as John Lee Miller would say. Look at your model’s confusion matrix defined by its false positives and negatives as well as its true results. Then, reflect on this overall and for well-defined groups that exist within your sample population under study. I intend to point out (the obvious maybe?) that the variations in each of your attributes, feed into your learning machine model, will determine the level of confusion that your model ultimately will have towards individual groups within your larger population under study. Model confusion that may cause group biases and unfair treatment of minority groups lost in the resolution of your data and chosen attributes.
Intelligent machines made in our image in our world.
We humans are cursed by an immense amount of cognitive biases clouding our judgments and actions. Maybe we are also blessed by for most parts of life being largely ignorant of those same biases. We readily forgive our fellow humans mistakes. Even grave ones. We frequently ignore or are unaware of our own mistakes. However, we hold machines to much stricter standards than our fellow humans. From machines we expect perfection. From humans? … well the story is quite the opposite.
Algorithmic fairness, bias, explainability and ethical aspects of machine learning are hot and popular topics. Unfortunately, maybe more so in academia than elsewhere. But that is changing too. Experts, frequently academic scholars, are warning us that AI fairness is not guarantied even as recommendations and policy outcomes are being produced by non-human means. We do not avoid biased decisions or unfair actions by replacing our wet biological carbon-based brains, subject to tons of cognitive biases, with another substrate for computation and decision making that is subjected to information coming from a fundamentally biased society. Far from it.
Bias and unfairness can be present (or introduced) at many stages of a machine learning process. Much of the data we use for our machine learning models reflect society’s good, bad and ugly sides. For example, data being used to train a given algorithmic model could be biased (or unfair) either because it reflect a fundamentally biased or unfair partition of subject matter under study or because in the data preparation process the data have become biased (intentionally or un-intentionally). Most of us understand the concept of GiGo (i.e., “Garbage in Garbage out”). The quality of your model output, or computation, is reflected by the quality of your input. Unless corrected (often easier said than done) it is understandable that an outcome of a machine learning model may be biased or fundamentally unfair, if the data input was flawed. Likewise, the machine learning architecture and model may also introduce (intentional as well as un-intentional) biases or unfair results even if the original training data would have been unbiased and fair.
At this point, you should get a bit uneasy (or impatient). I haven’t really told you what I actually mean by bias or unfairness. While there are 42 (i.e., many, but 42 is the answer to many things unknown and known) definitions out there defining fairness (or bias), I will define it as “a systematic and significant difference in outcome of a given policy between distinct and statistically meaningful groups” (note that in case of in-group systematic bias it often means that there actually are distinct sub-groups within that main group). So, yes this is a challenge.
How “confused” is your learned machine model?
When I am exploring outcomes (or policy recommendations) of my machine learning models, I spend a fair amount of time trying to understand the nature of my false positives (i.e., predicted positive outcomes that should have been negative) as well as false negatives (predicted negative outcomes that should have been positive). My tool of choice is the so-called confusion matrix (i.e., see figure below) which summarizes your machine learning model’s performance in terms of its accuracy as well as the inability to predict outcomes. It is a simple construction. It is also very powerful.
The above figure provides a confusion matrix example of a loan policy subjected to machine learning. We have
- TRUE NEGATIVE (Light Blue color): Model suggests that the loan application should be rejected consistent with the actual outcome of the loan being rejected. This outcome is a mitigating loss measure and should be weighed against new business versus the risk of default providing a loan.
- FALSE POSITIVE (Yellow color): Model suggests that the loan application should be approved in opposition to the actual outcome of the loan being rejected. Note once this model is operational, this may lead to increased risk of financial loss to the business offering the loans that the applicant is likely to default on. It may also lead to a negative socio-economical impact on the individuals that are offered a loan they may not be able to pay back.
- FALSE NEGATIVE (Red color): Model suggests that the loan application should be rejected in opposition to the actual outcome of the loan being accepted. Note once this model is operational, this may lead to loss of business by rejecting a loan application that otherwise would have had a high likelihood of being paid back. Also may lead to a negative socio-economical impact on the individuals being rejected due to lost opportunities for individuals and the community.
- TRUE POSITIVE (Green color): Model suggests that the loan application should be approved consistent with the actual outcome of the loan being approved. This provides for new business opportunities and increased topline within an acceptable risk level.
The confusion matrix will identify the degree of bias or unfairness that your machine learning model introduces between groups (or segments) in your business processes and in your corporate decision-making.
The following example (below) illustrates how the confusion matrix varies with changes to a group’s attributes distributions, e.g., variance differences (or standard deviation), mean value differences, etc..
What is evident from the above illustration is that policy outcome on a group basis is (very) sensitive to the attribute’s distribution properties between those groups. Variations in the characteristics between groups can illicit biases that ultimately may lead to unfairness between groups but also within a defined group.
Thus, the confusion matrix leads us back to your chosen attributes (or features), their statistical distributions, the quality of your data or measurements that make up those distributions. If your product or app or policy applies to many different groups, you better understand whether those groups are treated the same, good or bad. Or … if you intend to differentiate between groups, you may want to be (reasonably) sure that no unintended harmful consequences will negatively expose your business model or policy.
A word of caution: even if the confusion matrix gives your model “green light” for production, you cannot by default assume that the results produced may not result in systematic group bias and, ultimately unfairness against minority groups. Moreover, in real-world implementations, it is unlikely to completely free your machine models from errors that may lead to a certain degree of systematic bias and unfairness (however slight).
Indeterminism: learning attributes reflects our noisy & uncertain world.
So, let’s say that I have a particular policy outcome that I would like to check whether it is biased (and possibly unfair) against certain defined groups (e.g., men & women). Let’s also assume that the intention with the given policy was to have a fair and unbiased outcome without group dependency (e.g., independence of race, gender, sexual orientation, etc.). The policy outcome is derived from a number of attributes (or features) deemed necessary but excludes obvious attributes that is thought likely to cause the policy to systematically bias towards or against certain groups (e.g., women). In order for your machine model to perform well, it needs, in general, lots of relevant data (rather than Big Data). For each individual in your population (understudy), you will gather data for the attributes deemed suitable for your model (and maybe some that you don’t think matter). Each feature can be represented by a statistical distribution reflecting the variation within the population or groups under study. It will often be the case that an attribute’s distribution will be fairly similar between different groups. Either because it really is slightly different for other groups or because your data “sucks” (e.g., due to poor quality, too little to resolve subtle differences, etc… ).
If a policy is supposed to be unbiased, I should not be able to predict with any (statistical) confidence which group a policy taker belongs to, given the policy outcome and the attributes used to derive the policy. Or in other words, I should not be able to do better than what chance (or base rate) would dictate.
For each attribute (or feature), deemed important for our machine learning model, we either have, or we collect, lots of data. Furthermore, for each of the considered attributes, we will have a distribution represented by a mean value and a variance (and high order moments of the distribution such as skewness, i.e., the asymmetry around the mean and kurtosis, i.e., the shape of distributions tails). Comparing two (or more) groups, we should be interested in how each attribute’s distribution compares between those groups. These differences or similarities will point towards why a machine model ends up biased against a group or groups. And ultimately be a significant factor in why your machine model ends up being unfair.
Assume that we have a population consisting of two (main) groups that we are applying our new policy to (e.g., loans, life insurance, subsidies, etc..). If each attribute for both groups has statistically identical distributions, then … no surprise really … there should be no policy outcome difference between one or the other group. Even more so, unless there are attributes that are relevant for the policy outcome and have not been considered in the machine learning process, you should end up with an outcome that has (very) few false positives and negatives (i.e., the false positive & false negative rates are very low). Determined by the variance level of your attributes and the noise level of your measurements. Thus, we should not observe any difference between the two groups in the policy outcome, including the level of false positives and negatives.
From the above chart, it should be clear that I can machine learn a given policy outcome for different groups given a bunch of features or attributes. I can also “move” my class tags over to the left side and attempt to machine-learn (i.e., predict) my classes given the attributes that are supposed to make up that policy. It should be noted that if two different groups’ attributes only differ (per attribute) in their variances, it is not possible to reliably predict which class belongs to what policy outcome.
Re: Fairness It is, in general, more difficult to judge whether a policy is fair or not than whether it is biased. One would need to look between classes (or groups) as well as in-class differentiation. For example, based on the confusion matrix, it might be unfair for members of a class (i.e., sub-class) to end up in the false positive or false negative categories (i.e., in-group unfairness). Further along this line, one may also infer that if two different classes have substantially different false positive and negative distributions that this might reflect between-class unfairness (i.e., in-class is treated less poorly than another). Unfairness could also be reflected in how True outcomes are distributed between groups and maybe even within a given group. To be fair (pun intended), fairness is a much richer context-dependent concept than a confusion matrix (although it will signal that attention should be given to unfairness).
When two groups’ have statistically identical distributions for all attributes considered in the policy-making or machine learning model, I would also fail to predict group membership based on the policy outcome or the policy’s relevant attributes (i.e., sort of intuitively clear). I would be no better of than flipping a coin in identifying a group member based on features and policy. In other words, the two groups should be treated similarly within that policy (or you don’t have all the facts). This is also reflected by the confusion matrix having approximately the same values in each position (i.e., if normalized, it would be ca. 25% at each position).
As soon as an attribute’s (statistical) distribution starts to differ between different classes, the machine learning model is likely to result in a policy outcome difference between those classes. Often you will see that any statistically meaningful difference in just a few of the attributes that may define your policy will result in uniquely different policy outcomes and thus possibly identify bias and fairness issues. Conversely, it will also quickly allow a machine to learn a given class or group given those attribute differences and therefore allude to class differences in a given outcome.
Heuristics for group comparison
If the attribute distributions for different groups are statistically similar (per attribute) for a given policy outcome, your confusion matrix should be similar across any group within your chosen population under study, i.e., all groups are (treated) similar.
If attribute distributions for different groups are statistically similar (per attribute) and you observe a relatively large ratio of false positives or false negatives, you are likely missing significant attributes in your machine learning process.
If two groups have very different false positive and/or false-negative ratios, you are either (1) missing descriptive attributes or (2) having a high difference in distribution variation (i.e., standard deviation) for at least some of your meaningful attributes. The last part may have to do with poor data quality in general, higher noise in data, sub-groups within the group making that group a poor comparative representative, etc..
If one group’s attributes have larger variations (i.e., standard deviations) than the “competing” group, you are likely to see a higher than expected ratio of false positives or negatives for that group.
Just as you can machine learn a policy outcome for a particular group given its relevant attributes, you can also predict which group belongs to what policy outcome from its relevant attributes (assuming there is an outcome differentiation between them).
Don’t equate bias with unfairness or (mathematical) unbiasedness with fairness. There is much more to bias, fairness, and transparency than what a confusion matrix might be able to tell you. But it is the least you can do to get a basic level of understanding of how your model or policy performs.
Machine … Why ain’t thee fair?
Understanding your attributes’ distributions and, in particular, their differences between your groups of interest will upfront prepare you for some of both obvious as well as more subtle biases that may occur when you apply machine learning to complex policies or outcomes in general.
So to answer the question … “Machine … why ain’t thee fair?”… It may be that the machine has been made in our own image with data from our world.
The Good news is that it is fairly easy to understand your machine learning model’s biases and resulting unfairness using simple tools such as the confusion matrix and understanding your attributes (as opposed to just “throw” them into your machine learning process).
The Bad news is that correcting for such biases are not straightforward and may even result in unintended consequences leading to other biases or policy unfairness (e.g., by correcting for bias of one group, your machine model may increase the bias of another group which arguably might be construed as unfair against that group).
Julia Angwin & Jeff Larson, “Machine Bias: There’s software used across the country to predict future criminals. Ands it’s biased against blacks” (May 2016), ProPublica. See also the critique of the ProPublica study; Flores et al.’s “False Positives, False Negatives, and False Analyses: A Rejoinder to “Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And it’s Biased Against Blacks.”” (September 2016) Federal Probation 80.
Alexandra Chouldechova (Carnegie Mellon University), “Fair prediction with disparate impact: A study of bias in recidivism prediction instruments” (2017).
Rachel Courtland, “Bias detectives: the researchers striving to make algorithms fair” (Nature, 2018, June).
Kate Crawford (New York University, AI Now Institute) keynote at NIPS 2017 and her important reflections on bias; “The Trouble with Bias”.
Arvind Narayanan (Princeton University) great tutorial; “Tutorial: 21 fairness definitions and their politics”.
Kim Kyllesbech Larsen, “A Tutorial to AI Ethics – Fairness, Bias and Perception” (2018), AI Ethics Workshop.
Kim Kyllesbech Larsen, “Human Ethics for Artificial Intelligent Beings” (2018), AI Strategy Blog.
I rely on many for inspiration, discussions and insights. In particular for this piece I am indebted to Amit Keren & Ali Bahramisharif for their suggestions of how to make my essay better as well as easier to read. Any failure from my side in doing so is on me. I also greatly acknowledge my wife Eva Varadi for her support, patience and understanding during the creative process of writing this Blog.