Intro
One of the really hot but completely ignored topics with any powerful technology is ethics, the moral guidelines and usage of a certain technology. In case of Artificial Intelligence and Machine Learning this should be a high priority not because the technology itself is dangerous but more because of how it can be used and the amount of harm which can be done to individuals and societies. For sure, the researchers who discovered the nuclear power where happy to be praised by the society and happy to show that it can be used for producing electricity, but in the end they did not thought it through: that it will be used as weapons and even the "harmless" Power Plants can bring a lot of harm and destruction.

In AI and ML we work with data, sometimes quite sensitive or private data and we have to be sure that this technology can't be used against us by our human counterparts, and at least the tech giants follow some rules when create products with this technology, or at least is what they say. If we think at the beginning of "big data" and how the companies treated our personal data - as exchange currency for huge monetary gains, how our data was used without remorse, without shame and without even asking for permission - we can ask ourselves, rightfully, how our data will be used now. We have to be afraid of the AI and ML era, when people without any form of respect will gain powerful software to serve their own personal needs - meaning money and power.

But wait, the title is saying something about bias. True, this is the main topic and not some dystopian scenario about how ML and AI will destroy the humanity ... I am afraid that humans will do that before anyways.

 

Bias, when someone talks about bias is always going into the skin color or religion direction, but not today. Today is about other biases, other than skin or gender, more about the bias a business/process has when takes people into account.

Let's talk insurance, statistics, recommendations and prediction. And I do not talk about what Netflix or Amazon recommend you to watch tonight but rather something more subtle which can affect a person in a more tangible way.
Today, any serious software company tries to bring intelligence in the field of recommendations and predictions to avoid any human error or human bias, and one area highly interested in this is the financial sector - or how to earn more and pay less. From the first step we start by writing the code with some bias in mind and is not the developer bias, how is trendy to say now, but is the business bias itself. Imagine that having access to more data dimensions this bias becomes our enemy. 
A bank usually gives a loan if a person has enough potential to pay back, although a bank never looses even if the loan is not paid in full by using insurances or mortgage, and for the risk assessment the bank uses many strange tools taking into account lots of parameters but never the person itself. For example, there are also private companies who somehow gained the title of authority in risk assessment, who can deem a person not worthy only by looking at some numbers.

In Germany, for example is SCHUFA, which is a private company who gathers data about every individual and, at the beginning, one can say, well is a good thing that they gather data about the people who are known to be bad payers, the companies can avoid giving them loans or products without an upfront payment. But the things get hairy and scary when you find out that you actually have your name in their database even when you make a contract and you pay without issues but at some point you decide to change the provider because it got too expensive without a reason. I heard of people who are moving from one provider to the other to get also some negative points for doing this. For me this is a BIAS from the start, because a person can and should be free to change a provider as long as the prices are increased without any quality increase or just because the company wants so. As long as there are serious providers who can maintain their prices why should I stay with the greedy one?
Another creepy story regarding this company is that they look also at your neighbourhood, where you live. If is a low area and people in the area tend to have a lot of debts, an individual will get some penalty points just because of the neighbourhood. Here I can say again .... BIAS!

I can't say if the above information is 100% accurate because no one can find any information how SCHUFA is making their score. 
If we look at the insurance companies, there is also a lot of information missing and bias. Car insurance classes are a mystery because they do not factor only the price and accident statistics, on top of that for an individual they throw the age, how much the car will be driven in a year, accident history and some other magic numbers. First bias in their case is the date on which the driving licence was obtained, second is the age. Although the statistics show a trend regarding those two numbers there is no factor for the individual and they do not look that maybe you were rally champion at 16 or whatever. In my personal case I got the motorcycle first, did not had a car, no incidents at all, but many years later I registered a car and to my surprise it was a huge quote just because well... I had no car history and I did keep the motorcycle too so I did not transferred the insurance from one vehicle to another.
Another bias is in the yearly mileage. When I said that I will drive more than 60.000Km per year I received a huge invoice due to the associated risk of driving more than 3 times the number of kilometres an average person insures per year.
After 3 years of successfully driving 60.000Km per year without incidents I changed my mileage to a lover number, much lower ... although in 3 years I drove the equivalent of at least 9 years for their average person without any issues the fee was still reflecting only 3 years without incidents. Is this a bias? I would say yes as long as you do not factor the individual but statistic data. They factor the years when convenient and the mileage when fits a larger fee.

Now imagine that this insurance company builds a super AI to help them with better savings and yearly earnings. The model being biased from the start will be no better than the human run one, but now they have learning and access to more data, maybe they could factor an individual group of friends and if they are all drunks and have lots of accidents statistically speaking is a high probability that the individual is also drinking and driving. Is still the individual relevant? No, but the environment becomes relevant and the insurance company will argue that in this way it can keep lower fees for their "standard" average person by having AI prediction on possible higher costs for other groups outside their "standard" image. Bias? Yes!

How about health and life insurance? Never been ill but over 40? You have to pay more! Own a motorcycle? You just got more expensive! And the list can continue to grow. Of course at this time an individual can say that there is no motorcycle, the databases are not very well connected. If dies on a motorcycle crash probably will not get the life insurance but if dies of whatever other reason yes. The individual won because the cause of death was other than something undeclared dangerous activity and also won because he/she had to pay less. Did the insurance company lost money? I would say no because it wold not have paid for the motorcycle crash anyway. BUT from their point of view they loose money, they could have grabbed extra cash for nothing.
And they do not like to pay anyways, if my handlebar is not approved by whatever authority, even if the crash was not because of my handlebar I risk to be left alone to pay for the crash because without an approved handlebar the motorcycle is deemed unworthy.
And I kid you not, I had a high quality competition handlebar which is more than superior to the "average person" handlebar. Is build to survive rally and offroad so is more than good for road and will not break when I hit the first pothole. To my surprise, my bike was deemed unworthy for the road so I had to purchase a "big price lower quality" road approved handlebar which is inferior in any way and aspect. I can only imagine how happy my insurance company will be to know that I have purchased an unapproved handlebar or that they know I have some "bad habits" which I omit to tell them. Or how happy will be to have direct access to my medical history. 
Well this is possible with machine learning because they can argue that they do not have direct access to the data, but their software can factor that data in their calculations and just offer a quote at the end. From a point of view the data itself is not known to the company employees so it is safe, the data is known only by the software who uses it for whatever algorithms and gives just a final quote.

But let’s return to BIAS, I heard few weeks ago that a study relates the individuals who are not vaccinated against COVID to higher road crash risk. I would say that the bias is already floating in the air, insurance companies would love to raise the quotes "just because".
Is the AI / ML by default biased? No! Is the human who tries to see if there is a statistical connection between 2 factors? Maybe not, or maybe yes, depending on own personal motivation and how the study is conducted. Is the person who decides it can use such data set to increase the income? Maybe yes, maybe not, it is just a goal. The end result of all these factors can be biased? Hell yeah!

How can we avoid building biased solutions?
Take the individual into account

Maybe we need to take the dataset and have it do what it does but then factor the object itself and have the object as a starting point instead of a database.
One cool and actual example is "racial profiling" by the police - which is considered evil and should be erased. Well, the statistics are saying that a certain type of person, from a certain background in a certain area is more predisposed to commit certain crimes, so that the police will perform more routine checks on a person fitting the profile. Personally, I do not find it very wrong - before you burst into instant combustion, please read further - and at some points in life I had fit the profile and experienced more routine checks but I could not add the "racial" word due to my caucasian inflexion. 
On one occasion I was "harassed" by police in civilian clothes because my train arrived next to the train arriving from Amsterdam. I was young, a bit on the rebellious side, backpack only and walking very quickly, while the other guys looked relaxed, business casual, having a trolley, in their prime or retirement, etc. The result was that I got stopped and searched for drugs.
Other occasions - Having a specific number plate on a shabby car driving through certain areas, the result being stoped and checked, asked why and where. Again could not add the word "racial".
Changed the car, different region on the number plates, I can look like a bum inside my car but not once stopped. So now I do not fit the profile, before because of some minor differences I was fitting the bill.
Well, the society find this to be the source of all evil and must be banished and all the policemen jailed if found using it.

Now, let's take the "harmless" example: Person A wants to close a life insurance for whatever reasons. Person A is compared against all known statistics and is found "guilty" for coming from a certain background, living in a certain area, being a certain type of person (based on having certain hobbies, owning certain things, etc). What is the difference between this and "racial profiling by police"? You would say that is a world of difference, I would upset you and tell you that "IT ISN'T!!!", from the data point of view is the same thing. One determines how probable is that you carry drugs or whatever, the other determines how probable is that you die before you pay in full whatever they are supposed to pay after your death. One is considered harmless and people are cheering when one person is deemed to be possibly too dangerous, and should pay more because whatever, the other is considered racist if one person is deemed to be possibly too dangerous and should be checked. From the point of view of the dataset, this is double standards and double morale. This would brand one software racist and one software a light bearer, although they both do risk assessment based on statistics.

Let's pretend we are that far, and we have the initial risk assessment based on statistics, but our software has access to the details of the person in question. Not just name and date of birth, but in case of a police check, past criminal record, parking tickets, speeding tickets, other known history on the legal side. This should be the data that has more weight. Let's be honest, as a well dressed gentleman, with a decent car, traveling with what seems to be a family, when nothing screams "fake", I can smuggle stuff easier than as a metalhead with a shabby car. And maybe the nicely dressed gentleman or lady that just passed by has a larger criminal record than the half drunk metalhead with a Phd title who just crossed your path, but you too will have a bias and consider him a bum.
In the case of harmless insurance assessment, what if would be more relevant how, where and how often and if any incidents while driving the motorcycle than just owning one, how about the overall health rather than just the age and the list can continue. I have seen people in their 70s healthier than 20 year olds.

Rant continues on the Part 2