Does the Therac-25 tragedy have anything to teach us about the hidden dangers of medical AI?
On March 21, 1986, a medical nightmare unfolded in the city of Tyler, Texas. A male cancer patient received radiotherapy treatment for a small tumor on his back from a new type of radiotherapy machine, the Therac-25. The patient had been prescribed a small dose of 180 rads. When the machine turned on, the Therac-25 began buzzing loudly, the patient felt a horrific flash of pain and the skin on his back began to sizzle. He had received a massive overdose of radiation. Instead of 180 rads, he had received tens of thousands of rads. He was immediately given treatment for radiation sickness, but it was too late: the patient died five months later.
Unfortunately, this incident with the Therac-25 was not an isolated one. While many patients were successfully treated with it, between 1985 and 1987 there were several other similar incidents. Six were fatal. The Therac-25 was launched in 1983 but it was not until 1987 that the Therac-25 was finally decommissioned.
The Therac-25 is sometimes used as a cautionary tale of how poorly designed computerized medical devices problems can have catastrophic consequences. This case prompted the development of tighter controls over medical devices. But how did things go so badly wrong? Also, we are currently witnessing the emergence of an AI revolution in medicine. Does the Therac-25 tragedy have anything to teach us about the hidden dangers of medical AI?
What was the Therac-25?
The Therac-25 was a radiotherapy system, consisting of a bed and a beam emitter, which was placed over the site of a tumor. It was manufactured by a partnership between the company Atomic Energy of Canada Limited (which, these days, is involved in managing radioactive waste and decommissioning) and the French engineering company CGR.
The machine could be set to emit either a beam of electrons, for shallow tumors, or a more penetrating beam of gamma rays, for deeper tumors. Crucially, for the electron beam, magnets would be placed between the emitter and the patient, to spread out the beam over the body. For the gamma-ray beam, a metal plate would be placed between the emitter and the body to convert the electrons into more gamma rays.
How did the Therac-25 go wrong?
Central to the story of the Therac-25 is software bugs. Importantly, the Therac-25 was designed to be solely controlled by software. This was different from earlier versions of the machine (the Therac-6 and the Therac-20). Despite the differences between the earlier machines and the Therac-25, some of the software developed for the earlier versions had been migrated to the Therac-25. These bugs were not so important for the earlier versions because these had built hardware safety controls. But with the Therac-25 they could be catastrophic.
The bugs meant that, for example, if the radiographer changed the beam type from x-ray to electron beam within 8 seconds, the machine would give the dose but display a message that the dose had not been given. This could cause the radiographer to give repeated doses in succession while wondering why the machine was not giving doses at all.
Even more dangerous were the systems for controlling the devices between the beam and the patient. These were an arrangement of magnets to spread the beam over an area and a metal plate for converting electrons into x-rays. Mistakes here meant a patient could be hit with a beam 100 times more intense than intended, and focused like a laser, instead of being spread over an area of skin.
The Therac-25 as a ‘black box’ problem
As ever, the truth is rarely plain and never simple. There were several factors leading to the Therac-25 tragedy; it was not just about software bugs. These factors included inadequate testing of the machine prior to release and the unwillingness of AECL to accept the possibility that the machine could have faults. Fundamentally though, this case can be also seen as a ‘black box’ type problem.
A ‘black box’ is a system that allows you to see the input and output but gives you no idea about the processes at work in between. Its use is a matter of faith for the operator. The early versions of the Therac were not black boxes. The software was merely a convenience, and the machine was under the control of technicians and radiographers. There were hardware safety controls in place which were easy to understand. For example, there was a fuse that blew if someone tried to activate a dangerous mode of operation, such as a high energy beam without the magnets or the metal plate in place. The technicians and radiographers were aware if the machine had been configured wrongly because they were responsible for setting it up.
But with the Therac-25, these hardware safety measures had been removed. The machine was essentially controlled by the software which was totally inaccessible (and indeed was never released). This new version had a quicker setup, allowing more treatments to be done. But the working principles of the machine were now invisible to the operators and lurking dangers were impossible to spot. It should be added that this black box problem was also deepened by AECL’s assurances that overdoses were impossible, and by the system’s error messages, which were hard to understand.
The ‘Black Box Problem’ of AI
What is the connection between the Therac-25 and modern medical AI? AI systems also have a ‘black box problem’. Unlike conventional statistical tools like regression, with most AI-based systems we may know that the AI system works, and how well it works. But we have no idea how they work.
Let’s take the example of what is probably the best-known type of machine learning algorithm: the neural network. This is an arrangement of ‘neurons’ (instantiated in software), which take numerical inputs and yield numerical outputs, according to their ‘weights’. These weights are learned after being repeatedly exposed to training data. The trained neural network is simply a huge grid of numbers corresponding to these weights.
With a simple neural network (with just a few neurons) we could try and understand how the network works, i.e., how numbers flow through the network. Even with the simplest neural network, this would be hard, but we might consider it worth a try, even if it wouldn’t be very useful. However, the kinds of neural networks typically used in medical AI (especially for image processing) have thousands of neurons, maybe many thousands. The number of connections between the neurons is much larger still and does not usually get counted because it is generally useless information. The idea of trying to understand how these many thousands of weights generate predictions far too complex to be worth trying. It should be noted that techniques are being developed to improve the interpretability of machine learning models and I will examine this topic in a future article. For the time being, however, only machines can ‘understand’ the huge complexity of machine learning models. Thus, to humans (even the computer engineers who designed it) the neural network is a black box.
At this point, we should reflect that most of the machines we use are mysterious to us. I have very little idea about the workings of my fridge, my washing machine, or my radio. It has even been said, very plausibly, that no one understands the entire workings of something as complex as a smartphone. We do not lose sleep over these things, as our lives do not depend on them. But what about my car? If that goes wrong at high speed I am in serious trouble. We simply trust our machines because life is too short to ask too many questions.
It might be reasonable then, to conclude that the black box is not really a problem at all. As long as we can verify that an AI solution works well in the real world, that is what really matters. Under this argument, if the Therac-25 software had been better designed and tested, the black box problem would not have been an issue. The higher throughput and speed of the Therac-25 might have saved lives. It is purely an empirical question: which machine saved more lives? The Therac-25 or its earlier less automated versions? Perhaps that is the proper criterion for medical AI, not whether there is a black box issue per se.
‘Grey Box’ Problems with AI systems: Algorithmic bias
The above argument seems straightforward enough but may not satisfy everyone. This brings us on to what I would call a ‘grey box’ problem with AI. This is where the workings of a machine are not impossible to understand but this is very difficult in practice. Recently, and particularly in the US, a spotlight has been shone on the relationship between AI and an aspect of social discourse which is, at present, particularly sensitive: racial bias.
In 2019 Brian Powers, a computer scientist at Harvard, published a paper in Science that reported that many AI-based health systems suffer from racial bias. Actually, this shouldn't surprise us at all. This is because AI systems are only as good as their data and medical data in the US tends to be disproportionately based on white people. Why? The answer is probably highly complex and involves several factors and this is too complex an issue for the current article. But it remains possible that some AI health systems in Western countries will work better for whites than for other racial groups. In other parts of the world, there will likely be corresponding biases for other racial groups.
There may be no good solution to this because data collection and data processing are highly complex processes and biases can be introduced in many ways, with different degrees of transparency. These kinds of concerns are not ‘black box’ issues because the data being inputted in AI systems can, in principle, be inspected for bias. For example, the racial makeup of the dataset can be checked. But doing this may not be practical for all sorts of reasons. And so, such problems involve the same kinds of issues as black-box problems: the workings of our systems can become opaque, and this can blind us to their dangers.
A Defense of Black Box Medical AI Systems
In practice, one can never eliminate sources of bias. Medical AI systems will always involve ‘unknown unknowns’. There is also a philosophical question: would we rather have a system that works better for some groups than others, but still better for all groups? Or would we rather have a system that works worse for all groups, but at least does not discriminate between groups? It’s a practical question because the easiest way to address racial biases in Western medical AI systems would be to simply throw away data from overrepresented groups until one had equal representation for all groups. That would likely make the system less accurate for everyone, because, as a rule, more data means better accuracy.
The point here is that the current debate about algorithmic bias is more complex than it might seem at first sight. There are no obvious conclusions to be drawn from the facts of algorithmic bias about the appropriateness of medical AI systems. These systems should be judged on their own merits, not on arguments about whether or not they are biased. As is said in British politics sometimes, ‘the perfect can be the enemy of the good’. Medical AI should not be dismissed simply because it is imperfect.
The Counter Cases: Radiotherapy Overdoses from Human Error
To err is also human, even in the medical profession. Indeed, there have been several cases, where cancer patients have been accidentally given fatal overdoses of radiation by human radiographers, even where these radiographers have been properly qualified and experienced. This can even happen when strict safety protocols are in place and are followed. One such case, widely reported in the British press, occurred at the Western General Hospital in Edinburgh in September 2015. A patient was given twice the intended dosage of radiation. This resulted from two radiographers both making the same error in manually calculating the dose. In this case, both radiologists were experienced and properly qualified.
Such cases are thankfully rare. But they should remind us that the proper comparator for medical AI is not perfection, it is human performance. Medical AI should not be dismissed because it makes errors. The issues surrounding medical AI, both practical and ethical are highly complex. There may be hidden biases in the data, hidden assumptions in testing procedures and all sorts of ‘unknown unknowns’ in data which could lead to surprising consequences. This will be an ongoing discussion between industry, the public, the medical profession, and regulators.
The Therac-25 is a warning about the dangers of automated health systems whose workings are hidden from view. This is an important lesson in the context of medical AI systems, because these tend to suffer from ‘black box’ type problems, i.e., their workings are hidden from view. This points to the need for rigorous real-world testing of medical AI systems, and ongoing scrutiny by a range of stakeholders, including the public. We also need to strike the right balance and not be overly cautious. A basic question is: if a machine had been repeatedly demonstrated to be more successful in the real world than a human doctor… what would you rather be treated by? Such questions are no longer so hypothetical. Medical AI has already surpassed human expert performance in some areas of diagnostics and medical AI will continue to improve in an increasing number of ways. With time, more of us may come to prefer the machine doctor over the human one. But let’s not forget the lessons of how this can sometimes go wrong.
Felix Beacher heads Omdia’s health care technology team. He has direct responsibility for the ultrasound intelligence service and is currently working on Omdia’s forthcoming intelligence service on medical AI.