The Golden Age of Software Testing
Istvan Forgacs
Posted On: June 9, 2022
10549 Views
11 Min Read
LambdaTest asked me to write blogs for them. I liked the idea and accepted this kind request. I’ve been dealing with software testing since 1986. Originally, I was a researcher, and later I worked in the industry; thus, I’m knowledgeable about both the academic and the industrial side of software testing. My idea is to write articles starting from the basics to the solution of concrete problems practitioners face every day, but Google search gives poor results. Each article will be understandable as the episodes in Friends or The Big-Bang Theory, but they are also connected. The basic difference is that I cannot guarantee a happy end; it’s your turn to reach.
There are good books and documentaries about mathematics and physics, such as The Story of Maths, that is a four-part British television series outlining aspects of the history of mathematics. We are testers, and testing also has history, even if it is much shorter! Unfortunately, there is neither a complete book nor any documentary about the history of testing. In this article, I emphasize only one period, the Golden Age of Software Testing. This period starts in 1975 and ends in 1980. I know that it’s subjective, and you may disagree. But it’s my view, and I will try to tell you what happened during this period; unfortunately, I was too young to play a part in it.
The earliest conception was that you “write a program and then test and debug it.”
We know that Grace Hopper invented the word ‘debugging,’ a US admiral and computer scientist who is a key contributor to COBOL, perhaps the most successful programming language ever, whose codebase is in production and assumed to hit 800+ billion lines.
Testing was considered a follow-on activity and embraced the effort to discover errors and correct and remove them. A number of the earliest papers on “testing” actually covered “debugging,” and the difficulty of correcting and removing errors was long thought to be the more interesting problem. However, it was not until 1957 that program testing was clearly distinguished from debugging. Even after this separation, software testing was a side effect of programming, and testers should wait until 1972 for the first-ever software testing conference. It was organized by William C. Hetzel and was held at the University of North Carolina, to bring together a broad cross-section of people interested in software testing. The first software testing book ‘Program Test Methods’ was published as an outgrowth of this conference and established the view that “testing” encompassed a wide array of activities all associated with “obtaining confidence that a program or system performed as it was supposed to.”
On a side note, to make your life of testing easy and fuss-free while debugging, you can try the LT Debug Chrome extension.
Also, to have a deeper understanding on software testing, have a look at few of the best software testing YouTube channels!
And now we arrived at 1975 when the golden age started. The trigger was the first-ever theoretical software testing research paper by John B. Goodenough and Susan L. Gerhart: Toward A Theory of Test Data Selection published in the journal IEEE Trans. Software Engineering. This paper introduced some very important principles and methods of software testing.
The first concept is the data selection criterion (now we call it the test selection criterion), which determines when to stop designing more test cases, or how many test cases have to be designed to reach an acceptable software quality. The test selection criterion for exhaustive testing requires testing each possible input in the entire input domain. The criterion for the equivalence partition method is to select one test case for each equivalence class. It’s very important to choose an appropriate test selection criterion (or criteria) based on risk analysis (I will give details in a subsequent article.)
The next concept is the reliability of test selection criteria. Here I use a simpler description to be more understandable. A reliable test set T has the property that either for each t∊T, t passes if and only if the code is correct or there is at least one test t for which the software fails. A test selection criterion C is reliable if any generated test T(C) satisfying C is reliable. We will show that only the test selection criterion resulting in exhausting testing is reliable for each software. Despite this, test selection criterion reliability is a very important concept. There may exist reliable test selection criteria for a class of programs or a class of programming language elements. For example, there is a reliable test selection criterion for testing predicates. If we use this criterion, we can be sure that each erroneous predicate will be detected. Unfortunately, the ISTQB glossary doesn’t contain any definition for test set or test selection criterion reliability.
I think that these authors mention first equivalence partition testing. Then, they suggested that a solution can be to weaken exhaustive testing so that the error detection capability remains while the number of test cases is lower. Here they mention:
‘Although some programs cannot be exhaustively tested, a basic hypothesis for the reliability and validity of testing is that the input domain of a program can be partitioned into a finite number of equivalence classes such that a test of a representative of each class will, by induction, test the entire class, and hence, the equivalent of exhaustive testing of the input domain can be performed.’
The following primary result is from Bill Howden’s famous paper ‘Reliability of the path analysis testing strategy.’ He showed that ‘there is no procedure which, given an arbitrary program P and output specification, will produce a nonempty finite test set T C D such that if P is correct on T, then P is correct on all of D. The reason behind this result is that the nonexistent procedure is expected to work for all programs, and thus familiar noncomputability limitations are encountered.’ What does it mean? The sad truth is that except for exhausting testing, there is not a general method resulting in a reliable test set for all the programs. Therefore, we testers cannot tell that we have found all the bugs after testing the software. But this is not an excuse for poor testing, as we can find most bugs.
Though Goodenough and Gerhart introduced a class of program errors, Howden introduced a better one; we can use it today. According to his classification, three types of errors exist:
- Computation error – some calculation is faulty, such as x = 1 instead of x = 0
- Domain error – the control is faulty, such as if (x > 1) instead of if (x >= 1)
- Subcase error – something is missing from the code, such as a missing part if (x === 1) y = 2
If we know the error classes, we can find better test design methods or coverage criteria to detect them. For example, a test should traverse the fault to detect a computation error. But it’s not enough as the faulty result should propagate to some output to manifest the fault as a failure. Domain errors can be detected by applying general predicate testing will be described in a subsequent article. The most difficult error class is the subcase error, fortunately, most of them can also be detected by applying some advanced test design techniques using states.
While test selection criteria are defined concerning the test design, i.e., independent of the implementation, test data adequacy criteria are defined concerning program execution. ISTQB calls them coverage criteria. Almost all the test data adequacy criteria were introduced during this golden age. The two main classes of criteria are control flow and data flow. The most straightforward control flow criterion is statement coverage, but the minimum acceptable one is branch/decision coverage. Nowadays, several tools support these criteria.
The other class of test data adequacy criteria is based on data flow. These are not spread and almost unknown to testers. However, to find computation errors, data-flow coverage is necessary. The primary criterion is all d-u pair criterion, where d is a definition, and u is a use. A definition is an assignment to a variable; a use is a reference. A d-u pair is the following:
x=1 //definition for x
… //there is no assignment to x
y = x //use of x
Here y gets the value of x. If y is output and x=1 is a faulty assignment statement, the bug will be detected. If both statements are inside an if conditional statement, satisfying branch coverage may not detect the bug, but the d-u coverage will.
The result of Howden stating that there is no ideal method for creating reliable test cases doesn’t mean that software testing is dead. Fortunately, in the golden age, some hypotheses are introduced based on which test selection criteria and test case sets can be reliable, and the software can be close to bug-free.
The first hypothesis is the competent programmer hypothesis (CPH) identified by DeMillo et al. in their famous paper Hints on test data selection: Help for the practicing programmer published in 1978. They observed that “Programmers have one great advantage that is almost never exploited: they create programs that are close to being correct.” Developers do not implement software randomly. They start from a specification, and the software will be very similar to their expectations, hence, close to the specification.
CPH makes test design possible. If the implemented code could be anything, we should test the application to differentiate it from an infinitive number of alternatives. This would need an infinitive number of test cases. Fortunately, based on CPH, we only have to test the application to separate it from the alternative specifications very close to the one being implemented.
To demonstrate CPH, let’s consider the following requirements:
R1. Input two integers, say x and y, from the standard input device
R2. If x and y are both negative, then print x + y to the standard output device
R3. If x and y are both positive, then print their greatest common divisor to the standard output device
R4. If x and y have opposite signs (one of them is positive and the other is negative), then print their product to the standard output device
R5. If x × y is zero, print zero to the standard output device.
The test set T = { ([-2, -3]; -5), ([4, 2]; 2), ([2, -4]; -8), ([2, 0]; 0) } is adequate. However, an alternative specification can be the following:
R1. Input two integers, say x and y from the standard input device
R2. If x and y are both negative, then print x + y to the standard output device
R3. If x and y are both positive and less than 100,000, print their greatest common divisor to the standard output device; otherwise, print the smaller value to the standard output device.
R4. If x and y have opposite signs (one of them is positive and the other is negative), then print their product to the standard output device.
R5. If x × y is zero, print zero to the standard output device.
Assume that the developer implemented this specification. In this case, T would not be adequate as all test cases would pass, yet the code is wrong. What’s more, any positive pair of test input below 100,000 would not be error revealing. Assuming that the implementation can be anything, we easily modify the specification again so that selecting any number of positive test input pairs would not be reliable.
Fortunately, because of the competent programmer hypothesis, it is unrealistic to implement the modified R3 since this requirement is far from the original, and the probability that a developer swaps them is minimal.
The coupling effect hypothesis has been introduced in the same paper. It means that complex faults are coupled to simple faults in such a way that a test data set detecting all simple faults in a program will detect a high percentage of the complex faults as well. A simple fault is a fault that can be fixed by making a single change to a source statement. On the other hand, a complex fault is a fault that cannot be fixed by making a single change to a source statement. If this hypothesis holds, then it is enough to consider simple faults, i.e., faults where the correct code is modified by a single change.
Empirical studies show that this hypothesis hold for most cases. This is very good as, based on this hypothesis, we can have test design techniques that will detect almost all subcase errors. I will show you in another article.
Here I only listed the most important results of the golden ages of software testing, and I omitted some significant results, such as domain testing by White and Cohen (1980). If you are interested in the history of testing, I suggest two articles: The Growth of Software Testing and The History of Software Testing.
Got Questions? Drop them on LambdaTest Community. Visit now