Read The Numbers Behind NUMB3RS Online
Authors: Keith Devlin
In practice, the actual probabilities vary, depending on several factors, but the figures calculated above generally are taken to be a fairly reliable indicator of the likelihood of a random match. That is, the RMP is accepted as a good indicator of the rarity of a particular DNA profile in the population at large, although this interpretation needs to be viewed with care. (For example, identical twins share almost identical DNA profiles.)
The denominator in the FBI's claimed figure of 1 in 26 quintillion in the Jenkins case seems absurdly high, and really of little more than theoretical value, when you consider the likelihood of other errors, such as data entry mistakes, contamination errors during sample collection, or laboratory errors during the analysis process.
Nevertheless, whatever actual numbers you compute, it is surely the case that a DNA profile match on all thirteen of the sites used by the FBI is a virtual certain identificationâ
provided that the match was arrived at by a process consistent with the randomness that underpins the RMP
. As we shall see, however, the mathematics is very sensitive to how well that assumption is satisfied.
USING DNA PROFILING
Suppose that, as often occurs, the authorities investigating a crime obtain evidence that points to a particular individual as the criminal, but fails to identify the suspect with sufficient certainty to obtain a conviction. If the suspect's DNA profile is in the CODIS database, or if a sample is taken and a profile prepared, it may be compared with a profile taken from a sample collected at the crime scene. If the two profiles agree on all thirteen loci, then for all practicalâand all legalâpurposes, the suspect can be assumed to have been identified with certainty. The random match probability (1 in 10 trillion) provides a reliable estimate of the likelihood that the two profiles came from different individuals. (The one caveat is that relatives should be eliminated. This is not always easy, even for close relatives such as siblings; brothers and sisters are sometimes separated at birth and may not be aware that they have a sibling, and official records do not always correspond to reality.)
Of course, all that a DNA match does is identifyâwithin a certain degree of confidenceâan individual whose DNA profile was the same as that of a sample (or samples) found at the crime scene. It does not imply that the individual committed the crime. Other evidence is required to do that. For example, if semen taken from the vagina of a woman who was raped and murdered provides a DNA profile match with a particular individual, then, within the calculated accuracy of the DNA matching procedure, it may be assumed that the individual had sex with the woman not long before her death. Other evidence would be required to conclude that the man raped the woman, and possibly further evidence still that he subsequently murdered her. A DNA match is only that: a match of two profiles.
As to the degree of confidence that can be vested in the identification of an individual by means of a DNA profile match obtained in the above manner, the issues to be considered are:
A likelihood of 1 in 10 trillion attached to the second of these two possibilities (such as is given by the RMP for a thirteen-loci match) would clearly imply that the former possibility is far more likely, since hardly any human procedure can claim a one-in-ten-trillion fallibility rate. Put differently, if there is no reason to doubt the accuracy of the sample collection procedures and the laboratory analyses, the DNA profile identification could surely be viewed with considerable confidence. Provided, that is, the match is arrived at by comparing a profile from a sample from the crime scene with a profile taken from a sample from a suspect
who has already been identified by means other than his or her DNA profile
. But this is not what happened in the Jenkins case. There, Jenkins became a suspect solely as a result of investigators trawling through a DNA database (two databases, in fact) until a match was foundâthe so-called “cold hit” process.
And that brings in a whole different mathematical calculation.
COLD HIT SEARCHES
In general, a search through a DNA database, carried out to see if a profile can be found that matches the profile of a given sampleâsay, one obtained from a crime sceneâis called a cold hit search. A match that results from such a search would be considered “cold” because prior to the match the individual concerned was not a suspect.
For example, CODIS enables government crime laboratories at a state and local level to conduct national searches that might reveal that semen deposited during an unsolved rape in Florida could have come from a known offender from Virginia.
As in the case where DNA profiling is used to provide identification of an individual who was already a suspect, the principal question that should be asked after a cold hit search has led to a match is: Does the match indicate that the profile in the database belongs to the same person whose sample formed the basis of the search, or is the match purely coincidental? At this point, the mathematical waters rapidly become unexpectedly murky.
To illustrate the problems inherent in the cold hit procedure, consider the following analogy. In a typical state lottery, the probability of winning a major jackpot is around 1 in 35,000,000. To any single individual, buying a ticket is clearly a waste of time. Those odds are effectively nil. But suppose that each week, at least 35,000,000 people actually do buy a ticket. (This is a realistic example.) Then, every one to three weeks, on average, someone will win. The news reporters will go out and interview that lucky person. What is special about that person? Absolutely nothing. The only thing you can say about that individual is that he or she is the one who had the winning numbers. You can make absolutely no other conclusion. The 1 in 35,000,000 odds tell you nothing about any other feature of that person. The fact that there is a winner reflects the fact that 35,000,000 people bought a ticketâand nothing else.
Compare this to a reporter who hears about a person with a reputation of being unusually lucky, accompanies them as they buy their ticket, and sits alongside them as they watch the lottery result announced on TV. Lo and behold, that person wins. What would you conclude? Most likely, that there has been a swindle. With odds of 1 in 35,000,000, it's impossible to conclude anything else in this situation.
In the first case, the long odds tell you nothing about the winning person, other than that they won. In the second case, the long odds tell you a lot.
A cold hit measured by RMP is like the first case. All it tells you is that there is a DNA profile match. It does not, in and of itself, tell you anything else, and certainly not that that person is guilty of the crime.
On the other hand, if an individual is identified as a crime suspect by means other than a DNA match, then a subsequent DNA match is like the second case. It tells you a lot. Indeed, assuming the initial identification had a rational, relevant basis (such as a reputation for being lucky in the lottery case), the long RMP odds against a match could be taken as conclusive. But as with the lottery example, in order for the long odds to have any weight, the initial identification has to be
before
the DNA comparison is run (or at least demonstrably independent thereof). Do the DNA comparison first, and those impressive-sounding long odds could be meaningless.
NRC I AND NRC II
In 1989, eager to make use of the newly emerging technology of DNA profiling for the identification of suspects in a criminal case, including cold hit identifications, the FBI urged the National Research Council to carry out a study of the issue. The NRC formed the Committee on DNA Technology in Forensic Science, which issued its report in 1992. Titled
DNA Technology in Forensic Science
, and published by the National Academy Press, the report is often referred to as NRC I. The committee's main recommendation regarding the cold hit process was:
The distinction between finding a match between an evidence sample and a suspect sample and finding a match between an evidence sample and one of many entries in a DNA profile databank is important. The chance of finding a match in the second case is considerably higherâ¦. The initial match should be used as probable cause to obtain a blood sample from the suspect, but only the statistical frequency associated with the additional loci should be presented at trial (to prevent the selection bias that is inherent in searching a databank).
In part because of the controversy the NRC I report generated among scientists regarding the methodology proposed, and in part because courts were observed to misinterpret or misapply some of the statements in the report, in 1993 the NRC carried out a follow-up study. A second committee was assembled, and it issued its report in 1996. Often referred to as NRC II, the second report,
The Evaluation of Forensic DNA Evidence
, was published by National Academy Press in 1996. The NRC II committee's main recommendation regarding cold hit probabilities was:
When the suspect is found by a search of DNA databases, the random-match probability should be multiplied by N, the number of persons in the database.
The statistic that NRC II recommends using is generally referred to as the “database match probability,” or DMP. This is an unfortunate choice of name, since the DMP is
not
a probabilityâalthough in all actual instances it is a number between 0 and 1, and it does (in the view of the NRC II committee) provide a good indication of the likelihood of getting an accidental match when a cold hit search is carried out. (The intuition is fairly clear. In a search for a match in a database of N entries, there are N chances of finding such a match.) For a true probability measure, if an event has probability 1, then it is certain to happen. However, consider a hypothetical case where a DNA database of 1,000,000 entries is searched for a profile having an RMP of 1/1,000,000. In that case, the DMP is
1,000,000 Ã 1/1,000,000 = 1
However, in this case the probability that the search will result in a match is not 1 but approximately 0.6312.
The committee's explanation for recommending the use of the DMP to provide a scientific measure of the accuracy of a cold hit match reads as follows:
A special circumstance arises when the suspect is identified not by an eyewitness or by circumstantial evidence but rather by a search through a large DNA database. If the only reason that the person becomes a suspect is that his DNA profile turned up in a database, the calculations must be modified. There are several approaches, of which we discuss two. The first, advocated by the 1992 NRC report, is to base probability calculations solely on loci not used in the search. That is a sound procedure, but it wastes information, and if too many loci are used for identification of the suspect, not enough might be left for an adequate subsequent analysisâ¦. A second procedure is to apply a simple correction: Multiply the match probability by the size of the database searched. This is the procedure we recommend.
This is essentially the same logic as in our analogy with the state lottery. In the Jenkins case, the DMP associated with the original cold hit search of the eight-loci Virginian database (containing 101,905 profiles) would be (approximately)
Â
100,000 Ã 1/100,000,000 = 1/1,000
Â
With such a figure, the likelihood of an accidental match in a cold hit search is quite high (recall the state lottery analogy). Thus, what seemed at first like a clear-cut case suddenly begins to look less so. That's what the courts think, too. At the time of writing, the Jenkins case is still going through the legal system, having become one of several test cases across the country.
NUMBERS IN COURT: THE STATISTICAL OPTIONS
So far, the courts have shown reluctance for juries to be presented with the statistical arguments involved in cold hit DNA cases. This is reasonable. To date, experts have proposed at least five different procedures to calculate the probability that a cold hit identification produces a false positive, that is, identifies someone who, by pure happenstance, has the same profile as the sample found at the crime scene. The five procedures are: