Computer Problems Similar to The Millennium Bug That Have Already Happened

	Sign In Sign-Up

by Jon Huntress

The Tenagra Corporation/Year2000.com Partnership

jon@year2000.com

------------------------------------------------------------------------

What kind of difficulties will we have when the year 2000 bug hits? The problem starts with a date routine in a program, but the results could manifest in a completely different area and in an unforeseen way. Computers nowadays do their job so well, and are so transparently efficient, that only the programmers who have written the software know what a Rube Goldberg labyrinth lies just beneath the smooth tin skin of these machines. Depending on how the program was written and by whom, there could be a minor date anomaly or a program failure when the century changes. Programmers told us this for years but we ignored them.

Computers are "literal machines" that can do nothing they aren't told to do. They follow a program written by one or a thousand programmers, one of whom may have had a headache one day while programming and made a mistake. Which may have been what happened in 1990 to AT&T when they switched over the country to their new software. A month after it was installed, a combination of circumstances made a software error manifest and every long distance switch shut down causing a nine hour nationwide communications blackout. For want of a nail the horse was lost, for want of a horse the battle was lost.... Our technical society is built with billions of computer "nails," and a missing piece of code, depending on what it does or doesn't do, could have wide spread consequences.

Software is naturally "buggy," because it is created by individual programmers, who often write code like poets write verse. Programming standards and object oriented programming promise future uniformity but we aren't there yet, not by a long shot. Scientific American Magazine compares the current state of software development to the state of industrial development before Eli Whitney invented the assembly line in 1799. In other words, software is essentially a handicraft item. A single wrong letter in a statement of code, in a critical spot, can crash a program. Also, two identical machines can have chips with identical properties and specifications but were programmed by different programmers in different countries so that one machine is year 2000 compliant and the next one isn't. Programmers write a program to specifications. Hopefully the person that wrote the specifications understood the job he was describing, and hopefully the programmer understood the specifications. If the specifications aren't clear, if the process the computer is to do has not been thoroughly broken down into clear units, or if there isn't a good "fail safe" routine when an anomaly is encountered, there is a good chance the program will also have problems. The program could just quit, or go into an infinite loop, or follow some other unplanned for chain of logic. But it will probably not do what it is supposed to do. This may have few consequences, or many. A badly programmed chip could turn a switch or Programmable Logic Controller off line, or it could have a cascading effect like 1990 and knock the country off line.

Consider these computer-related problems. Some were the result of bad or lazy programming practices while others were caused by combinations of events that would have been almost impossible to predict. Some happened from a person ignoring or overriding a computer warning, because they thought the computer was in error (we all know they often are). Some were from simple mathematical errors, the kind we expect from the year 2000 bug. Some problems have nothing to do with date routines at all but could still crop up because when software is rewritten, new errors are introduced. The following problems happened individually in the recent past. They might all happen at once after December 31, 1999.

------------------------------------------------------------------------

Except where noted, these cases are taken from Peter G. Newmann's book, Computer Related Risks (ACM Press, New York, 1995) and from the on line magazine he moderates, The Risks Digest, located at http://catless.ncl.ac.uk/Risks/.

•One of the most interesting computer failures is the telephone outage that occurred in New York on September 17, 1991. This outage also shut down the three New York airports for four hours. The problem started with a city wide brown out (low voltage). Low voltage can hurt computer equipment and the center was automatically taken off the city current and a signal was sent to start AT&T's back up generators. Because of a hookup failure, the backup generators didn't come on line. The system was then running on standby batteries alone while warnings of the problem were sent to the emergency center and to alarms. The two technicians responsible for emergencies couldn't respond because they were both at a meeting discussing what to do when there is an emergency. The alarms didn't help because they had been purposefully disabled when construction in the area kept setting them off. After six hours the batteries went dead and so did all the phones. It was estimated that 5 million phone calls were stopped and almost 1200 flights were delayed or cancelled. Cause: A comedy of errors stemming from overconfidence and inadequate testing and security procedures. The contingency plan was not followed. Even a system with triple redundancy isn't secure if one or more of the checks are removed from the system. This type of thing could happen after 2000 if part of a program or system failed and a warning was issued and ignored. Fix: Follow the contingency plan. Never assume you can bend the plan for a few hours, or that an alarm is crying "Wolf!" again, even though it has been for the last week. Find out why it's going off and fix that!

•In the Gulf War, the highly touted Patriot missile system actually did a terrible job shooting down Scud missiles. There was an unrecognized clock drift over a 100-hour period that resulted in a tracking error of 678 meters. Cause: This was from an error in the computer clock that caused guidance problems. There was also a problem in the software with the number .01 in the 24 bit and 48 bit representations. .01 is represented in a computer as an endlessly repeating binary number (0.000110011001100...). Depending on where the computer truncates the number, two versions of .01 could actually be different. Similarly in other applications, two numbers that are the same can be seen as different by the computer. Fix: More testing in the field would have found this problem.

•Found in the simulator, the software for the F-16 fighter would cause the plane to fly upside down whenever it crossed the equator. Cause: The minus sign was missing for latitudes south of the equator. The problem was found in the simulator (which is where problems are supposed to be found) and fixed.

•In Berlin in 1993, two trains had a head on collision killing two engineers and a passenger because the track was set on the holiday two-way traffic setting by the computer. (it was a holiday) A superintendent reset the system for weekday (one-way) to continue some construction on one of the tracks and overrode the warning because he thought it was in error. Cause: Accident caused by the difference in weekend and weekday schedules and failure to believe an alarm. Fix: Never ignore an alarm or a warning, even if you are sure it is in error.

•At a cement factory, three staged conveyor belts took large boulders to the top of a rock crusher. A flaw in the MOSTEK RAM chips would drop occasional bits, which caused the second conveyor to shut down while the first and third conveyors kept running. The boulders piled up at the base of the second conveyor, then toppled off into the parking lot and crushed several cars. Cause: Embedded chip failure, similar to a failure of the programming in an embedded system. Fix: This would not have been a problem if the program for the conveyors had an adequate fail safe mode that should have switched off all three conveyors if there was a problem with any one.

•There have been several deaths caused by industrial robots. While ignoring the danger areas around the machines caused most, in a few cases the robot's actions were probably triggered by stray electronic interference. Cause: Workers in the vicinity ignored the off limits areas around the robots. Fix: Better safety planning and education, better fail safe routine.

•There have been six deaths due to receiving overdoses of radiation from one kind of cancer treating machine (no longer in use) from flaws in the software. Cause: Several different mistakes in the software allowed lethal doses.

•A 99-year-old man in an emergency room had a highly abnormal white blood count that the computer reported as normal because the readings were within the limits for an infant child. The man was born in 1889. The computer assumed the birth date to be 1989. Cause: 2 digit date problem. Mistake was caught by a doctor.

•Failures and glitches with a new software package for a London ambulance service caused many delays, leading to the deaths of as many as 20 people when the ambulances didn't arrive in time. Cause: Inadequate testing of new software before implementation. (This could be a real problem for the year 2000 because of all the "new" programs that have to come on line at the same time and the lack of time left to do testing.) Fix: More testing and having a better contingency plan for getting to a particular address.

•In 1984, the massive power outage that affected 10 western states was caused by a computer error in a substation in Oregon. The computer propagated a minor event, misreading it as a major power outage.

•In Colorado Springs one child was killed and another injured when the traffic light systems continued in weekend mode and ignored the school schedule because there was a failure getting the time transmitted to them by the atomic clock in Boulder. Cause: This kind of problem will happen after 2000 if embedded systems in traffic lights reset themselves to January 1, 1900, which was a Monday. They will think Thursday January 6, 2000 is a Saturday and go into weekend mode. At least we will have several days of warning on this.

•Continental Airlines rented aircraft for the whole day, even if the time the airplane was used was only one hour. The programmer figured the time spent by subtracting the date when the plane was signed out from the date when it was signed in. He forgot to turn any remainder hours into a one. The billings were one day short which can be very expensive when renting airplanes. Cause: this is a good (and very expensive) example of an "off by one" computer error, so common in programming it has its own name. It should have been found by Quality Assurance.

•A man parked in the San Diego airport parking lot. His ticket was stamped 1992 Feb 30. (1992 was a leap year) On his return on March 10, he was presented with a bill for 342 days at $11.00 a day totaling $3771.00. Cause: Mistake in the calendar and in the math. 2000 is also a leap year but is one of the special years that end in 00. Only 00 years that are divisible by 400 are leap years. Many programmers have ignored this second leap year test. Quality Assurance (QA) should have caught this.

•The first leap year test is any year evenly divisible by 4 that doesn't end in 00. Some programmers even forgot to apply this test to their programs. The Feb 22, 1997 New Zealand Herald reported that at midnight, Dec 31, 1996 all 660 process control computers at an aluminum smelter in Southland shut down, destroying five pot cells. Two hours later in Tasmania the Comalco Bell Bay smelter shut down too, indicating what the problem was date related. The computers couldn't handle the extra leap day in 1996. Cause: Failure to program the 366th day into the software or failure in the fail safe mode.

•An ATM program for a bank in New Zealand also had a leap year problem. On January 1, 1992 the machines erased the magnetic strip on customers' cards, then refused all further transactions. The bug in a National Cash Register program was found and fixed before it had traveled around the globe with the coming year. New Zealand will be the first large industrialized country to experience the first day of the new century and will be the first to experience any Y2K problems. New Zealand then Australia will show us what to expect. Unfortunately, we will only have a few hours to act at that point, although it will be enough time to pull known system failures off line before they have a chance to crash.

Some of these problems aren't year 2000 related but this is the same kind of thing that will happen spontaneously when the century changes because problems can manifest in many different ways. There will be a lot of simple mistakes and some of them will cause serious losses. Parts of some programs will no longer work with other parts of the same program. Which will cause harm or cost money? There is no way to tell with certainty and there isn't enough time left to fix and check everything. The best thing to do now is to look at every business from a business system viewpoint and find and provide for any system failure. Have contingency plans in place for every system that just has to work. Make sure your suppliers and vendors, and any systems in between (transportation and communication) are also aware of the problem and are planning to deal with it.