Skip to main content

Generic risks and biases: Data bias types

Image

About this sub-guideline

This sub-guideline is part of the guideline Generic risks and biases. Refer to the main guideline for context and an overview. For a discussion of risks that relate more specifically to the unique work of parliaments, refer to the guideline Risks and challenges for parliaments.

This sub-guideline focuses on data biases – a type of error in which certain elements of a data set are more heavily weighted or represented than others, painting an inaccurate picture of the population. A biased data set does not accurately represent a model’s use case, resulting in skewed outcomes, low accuracy levels and analytical errors.

Selection bias

Selection bias occurs when selecting data.

In one example, an AI system for detecting Parkinson’s disease was trained using a data set containing only 18.6% women. Consequently, the rate of accurate detection of symptoms was higher among male than female patients even though, in reality, the symptoms in question are more frequently manifested by female patients.

In another example, an AI system for detecting skin cancer was not able to detect the disease in people of African descent. Researchers observed that, as rates of skin cancer were increasing in Australia, the United States and Europe, the data set used to train the system consisted largely of people of European descent.

Sampling bias

Sampling bias is a form of selection bias in which data is not randomly selected, resulting in a sample that is not representative of the population. For example, if a poll for a national presidential election targets only middle-class voters, the sample will be biased because it will not be diverse enough to represent the entire electorate.

Coverage bias

Coverage bias is a form of sampling bias that occurs when a selected population does not match the intended population. For example, general national surveys conducted online may miss groups with limited internet access, such as the elderly and lower-income households.

Participation bias

Participation bias is a form of sampling bias that occurs when people from certain groups decide not to participate in the sample. It occurs when the sample consists of volunteers, which also creates a bias towards people who are willing and/or available to participate. The results will therefore only represent people who have strong opinions about the topic, omitting others.

Omitted variable bias

Omitted variable bias is a form of sampling bias that occurs when an important variable is omitted during data collection, compromising an expected result. For instance, when designing an algorithm that determines the price of second-hand cars, the developers include the following variables: make, number of seats, accident history, distance on the clock and engine size. However, they forget to include the car’s age. The algorithm is likely to give biased estimates because two cars with exactly the same values for the other variables will probably have different prices according to their age.

Popularity bias

Popularity bias is a form of sampling bias that occurs when items that are more popular gain more exposure, while less popular items are underrepresented. For example, recommendation systems tend to suggest items that are generally popular rather than personalized picks. This happens because the algorithms are often trained to maximize engagement by recommending content that is liked by many users.

Data inaccuracy

Data inaccuracy is a result of failures in data entry. For example, with a system that registers temperature automatically, if there is a failure in the sensor, the data set will not be trustful for using temperature as a variable. Sometimes, systems are not rigid with data entry and accept data without standards or with errors.

Obsolete data

Obsolete data is data that is too old to reflect current trends. For example, a system designed to predict how long a public procurement exercise will take is trained on an excessively large data set, consisting mostly of procurement exercises that happened 10 years ago under different legislation. As a result, this system will likely produce inaccurate predictions.

Temporal bias

Temporal bias occurs when the training data is not representative of the current context in terms of time. For example, census data – which is only collected once every 10 years – is used for many predictions. However, if the last available census data was collected in 2021, i.e. in the middle of the COVID-19 pandemic, then algorithms that use this data may be biased in a number of ways.

Variable selection bias

Variable selection bias occurs when a chosen variable is not fit for purpose. For example, a national health agency looking to give an additional benefit to citizens selects, as the variable for allocation of the benefit, total health spending according to age. The algorithm selects people of European descent and those on higher incomes to receive the additional benefit. This biased outcome happened because people in this group spent more on their health. The chosen variable was the seed of the problem.

Confounding variable

A confounding variable, in research investigating a potential cause-and-effect relationship, is an unmeasured third variable that influences both the supposed cause and the supposed effect. For example, when researching the correlation between educational attainment and income, geographical location can be a confounding variable. This is because different regions may have varying economic opportunities, influencing income levels irrespective of education. Without controlling for location, it is impossible to determine whether education or location is driving income.

Simpson’s paradox

Simpon’s paradox is a phenomenon that occurs when subgroups are combined into one group. The process of aggregating data can cause the apparent direction and strength of the relationship between two variables to change. For example, a study shows that, within an organization, male applicants are more successful than women. However, comparing the rates within departments paints a different picture, with female applicants having a slight advantage over men in most departments.

Linguistic bias

Linguistic bias occurs when an AI algorithm favours certain linguistic styles, vocabularies or cultural references over others. This can result in output that is more relatable to certain language groups or cultures, while alienating others.


The Guidelines for AI in parliaments are published by the IPU in collaboration with the Parliamentary Data Science Hub in the IPU’s Centre for Innovation in Parliament. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International licence. It may be freely shared and reused with acknowledgement of the IPU. For more information about the IPU’s work on artificial intelligence, please visit www.ipu.org/AI or contact [email protected].