Virus tests, probability and false-positives
Wednesday 3 June 2020
My apologies for another blog post (see 13 May post) which includes the words "virus" and "probability" but undoubtedly some interesting mathematical components are embedded in aspects of the current global pandemic.
A recent mathematical model constructed by epidemiologists at Columbia University in New York City estimated that for every documented novel coronavirus infection in the United States, 12 more go undetected. Determining whether or not someone is infected (or was infected and has developed antibodies) relies on some kind of medical test – and the results of any medical test are not certain. We need to seriously consider the probabilistic realities of medical testing - especially the concepts of false-positives and false-negatives.
Firstly, it will help to clarify some of the terminology often used when discussing the current virus pandemic. According to the European Centre for Disease Prevention and Control, Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) is the name given to the 2019 novel coronavirus. COVID-19 is the name given to the disease associated with the virus. SARS-CoV-2 is a new strain of coronavirus that has not been previously identified in humans.
Here are some other important terms associated with the mathematics (namely, probability) of medical testing.
■ Prevalence rate (or base rate): The prevalence rate for a virus infection is the proportion of people in a population that are infected with the virus. It is important to note that the prevalence rate for novel coronavirus infection will be higher than the prevalence rate for COVID-19 disease because not everyone who becomes infected with the novel coronavirus (SARS-CoV-2) develops the disease. Typically, prevalence rates are estimated using other data. At the moment, there are a few estimates for the prevalence rate for COVID-19 (the disease) in different countries but very little on estimated prevalence rates for infection with the novel coronavirus. Although we are nearly three months into the global pandemic the data available on different aspects of the pandemic is patchy and unreliable. Also, it is worth noting that the word “rate” is often synonymous to “probability.” For example, if the prevalence (or base) rate for the novel coronavirus was 5% in a certain population then if you randomly chose a person from the population the probability is 0.05 that the person is infected.
■ False-positive and false-negative: According to the excellent book, The Art of Statistics: Learning from Data by David Spiegelhalter, a false-positive is “an incorrect classification of a ‘negative’ case as a ‘positive’ case.” It follows that the reverse situation is a false-negative.
Consider an example where a test is created to detect a disease. Let the following symbols represent the indicated events. Positive test result: Pos; Negative test result: Neg; person has the disease: D; person does not have the disease: ~ D. It is established that the disease has a 0.02 prevalence rate. There is a test with a 0.9 probability of giving a positive result when a person has the disease (the test’s sensitivity is 0.9), and a 0.95 probability of giving a negative result when a person does not have the disease (the test’s specificity is 0.95). In this case, according to Spiegelhalter, there is a false-positive probability (or rate) of 0.05 because of the people that do not have the disease 5% of them will receive a positive test result; and a false-negative probability of 0.1 because of the people that do have the disease 10% of them will receive a negative test result. Experts have commented that the sensitivity of a test for SARS-CoV-2 is likely to be lower than the test’s specificity because many of the tests are difficult to carry out sometimes causing the test sample of an infected person to be defective so the virus goes undetected.
However, when it comes to medical testing the phrase false-positive (or false-negative) is not always consistently applied. For our example, the false-positive probability of 0.05 is a conditional probability (in the SL and HL maths syllabus); it is the probability that your test is positive given that you do not have the disease. That is, \({\textrm{P}}\left( {Pos\left| { \sim D} \right.} \right) = 0.05\). This is what we might refer to as a forward conditional probability because it follows the chronological order that first you either have or do not have the disease and then you are tested. What about the backward conditional probability? This is the probability that you have the disease given that your test is positive, i.e. \({\textrm{P}}\left( {D\left| {Pos} \right.} \right)\), which is ‘backwards’ in that it first takes into account your test result and then asks whether or not you have the disease. This type of backward conditional probability can be computed using Bayes’ theorem (in the HL maths syllabus). However, I think the various probabilities related to this example are better illustrated by considering what would occur on average if a large sample of people, e.g. 100 000, are tested. See the table below.
6700 received a positive test but only 1800 of them actually have the disease. Thus, the backward conditional probability that you have the disease given that your test is positive is given by \(\frac{{1800}}{{6700}} \approx 0.2686567\); \({\textrm{P}}\left( {D\left| {Pos} \right.} \right) \approx 0.269\).
Here is the calculation of \({\textrm{P}}\left( {D\left| {Pos} \right.} \right)\) using Bayes’ theorem:
\({\textrm{P}}\left( {D\left| {Pos} \right.} \right) = \frac{{{\textrm{P}}\left( D \right) \cdot {\textrm{P}}\left( {Pos\left| D \right.} \right)}}{{{\textrm{P}}\left( D \right) \cdot {\textrm{P}}\left( {Pos\left| D \right.} \right) + {\textrm{P}}\left( { \sim D} \right) \cdot {\textrm{P}}\left( {Pos\left| { \sim D} \right.} \right)}} = \frac{{0.02 \cdot 0.9}}{{0.02 \cdot 0.9 + 0.98 \cdot 0.05}} \approx 0.2686567\)
\({\textrm{P}}\left( {D\left| {Pos} \right.} \right) \approx 0.269\) says that if your test is positive there is only about a 27% chance that you have the disease. Is this a false-positive probability? If we strictly apply Spiegelhalter’s definition of a false-positive as “an incorrect classification of a ‘negative’ case as a ‘positive’ case” then we should only call the forward conditional probability \({\textrm{P}}\left( {Pos\left| { \sim D} \right.} \right)\) a false-positive. However, it is very common for the backward probability \({\textrm{P}}\left( {D\left| {Pos} \right.} \right)\) to also be referred to as a false-positive. I think that one of the reasons for this is because this probability is considerably lower than expected. For our example, it is counterintuitive that a test that detects the disease in people who have it with an accuracy of 90% (the test’s specificity) can only give us a 27% probability that a positive test means we actually have the disease. Why is this? Well, if a disease is not common then it is more likely that that your positive result is a mistake. In the ‘average’ calculations for the sample of 100 000 shown above, the number of people who test positive and have the disease (1800) is significantly smaller than the number who test positive and do not have the disease (4900). The more prevalent the disease, the more you should trust a positive test. Hence, the prevalence rate is a critical statistic when ascertaining how much we can trust a positive test result.
Which leads directly to one of the biggest challenges facing scientists and politicians while they try to advise the general public how best to live with a new virus that some experts have said may infect every human on the planet unless an effective vaccine is invented. This challenge is determining an accurate measure of what proportion of a population is currently infected with the novel coronavirus (i.e. the prevalence rate). Some recent studies have attempted to do this, but the results have varied widely – giving prevalence rates from around 1% up to 28% (granted these studies have been conducted in different populations/countries). Unfortunately, we will not know how reliable the tests for detecting whether someone is infected with SARS-CoV-2 until we have a better grasp of the probability that a random member of a population is infected even though they may or may not have any symptoms associated with the Covid-19 disease.
A colorized scanning electron micrograph of a cell infected with coronavirus particles, in yellow (National Institute of Allergy and Infectious Diseases, USA)