Sensitivity and Specificity

Click on the arrows to page through the tutorial on the statistical concepts of specificity and sensitivity. In my experience, people get confused when they try to learn it first as a formula and not as a concept. The words sensitive and specific are not difficult, so if you start from there, you will find it is much easier to learn. This tutorial is the joint effort of Profjameson and two Pharm. D. candidates, Caleb Bryant and Nicholas Anderson.

Ideally all tests would be very sensitive and very specific.  Unfortunately, that is rarely the case.   Sensitivity and Specificity are more or less important depending on the purpose of the test.  A broad screening test needs to be fairly sensitive to be of any value.   On the other hand a test must be specific to be of value for a definitive diagnosis.  That is why the more sensitive (a less specific) ELISA test is used to screen for HIV, but a Western Blot (very specific) test is done to confirm it.

Interestingly, the rapid strep screen is only about 75% sensitive  (misses 25% of people who truly have strep), but is fairly specific (very few false positives).

Type One and Type Two Error

The purpose of inferential statistics is to predict differences between groups in the general population by measuring the difference in a small sample.


  • Blood pressure lowering of two drugs: Whocaresapine vs Lowpressure
  • The rate of venous thrombosis in knee replacements prophylaxed with Digabigatran vs Goldiloxaparin

As you know, a p value is the probability that the observed difference is due to random chance. However there are two main types of errors you can make.


Type One Error in statistics
You detected a difference in the sample when there truly is no difference in the larger population (oops)

  • This is like a false positive (also known as an alpha error)
  • Associated with alpha / p-value
    • Alpha is the highest ACCEPTABLE probability that the measured outcome was due random chance
        Standard value for alpha is 0.05, or 0.025 for a two-tailed test
  • the p value is the MEASURED probability that your outcome is due to random chance
  • If your p-value (measured) is less than alpha (highest acceptable) then the difference is considered to be unlikely to have occurred due to chance.
  • A type I error can only occur when your p value is less than alpha. However, as your p-value increases towards alpha it is more likely that you are committing a type I error.
  • The probability of random chance producing a difference is additive with multiple comparisons. The more things you compare, the more likely you are to commit a type I error.

Let’s do a fun example:

  • A new drug (LowStress is coming to market. During the testing period, LowStress was shown to make people happier than placebo. On placebo 15 % of people were happy. On LowStress, 22% of people were happy. The alpha was set at 0.05, and the p-value that LowStress made more people happy versus placebo was 0.04. This indicates that there is a 4% chance that more people were happier due to random events not related to LowStress.
  • An alpha value of 0.05 means there is a 5% probability that more people would be happy due to random chance and not LowStress, which is the standard acceptable value. Because the p-value of 0.04 is less than alpha we are believe that the results seen with LowStress were not due to random chance.
  • Notice : There is still a 4% chance (1 out of 25) that this difference was, in fact, due to random chance. If it is due to random chance , we have committed a type I error.

Type Two Error
You fail to detect a difference in the Sample when there truly is a difference in the Population

AKA false negative or a beta error
  • Beta is directly related to power (1-beta = power).
    • Acceptable standard for power is 80% (see 1 minute genus on power), therefore the acceptable standard for beta is 20%.
    • This means that there is a 20% chance that you will detect a difference when there is no difference or you are 80% confident that you would have detected a difference in the sample if it exists in the population
  • As power increases your risk of committing a type II error decreases
  • If your p value is statistically significant <0.05) then you had enough power !!! Even if the number in the study was less than originally ESTIMATED. It was only an estimation. IF your p value is statistically insignificant (>0.05) THEN

    Eitherthere is really no difference OR you committed a type II error.

    Let’s revisit our fun example:

    You had drastic cuts made to your research budget and could only enroll 30 people in each group (LowStress vs Placebo). You found that 15% of people on placebo were happy and 22% of people on Lowstress were happy. But the p value is 0.34. You found no difference. However, you have probably committed a Type II error due to the small study size (inadequate power)

    Power Calculations

    Statistical Power

    OK, so John Lennon didn’t really write this , but statistical power is a very abstract concept and the ability to “imagine” really helps.

    Power is the probability that you will find a statistically significant difference in your study SAMPLE if it truly exists in the larger POPULATION.

    Beta is the probability that you will not be able to detect a difference if it is truly there in the population.

    Hypothetical Dilemma

    The study you can’t afford: There are 72,000,000 people with hypertension in the U.S. If you could study them all you would find that the new drug Lowpressure® lowers blood pressure by 7mm more than Whocaresapine.

    To test the difference on an affordable scale, you need:

    Power Calculations Before the Study !!

    We will describe the process in four simple steps.

    power before the study one


    Decide how big a difference you consider clinically important.
    For our Hypothetical Dilemma Example: You think a 7mm difference or more is clinically important.


    power before the study one


    How variable is the outcome we are testing? (this is a guess, based on available facts)
    Fact: The measure of variability used in power calculations is variance or (standard deviation)2

    Hypothetical Dilemma: From previous studies, we know that standard deviation of the mean blood pressure has been 5mmg Hg ( so variance for the calculation would be 25 (5)2)

    Fact: The more variable the outcome, the more difficult it is to be statistically confident that the difference you observe is real and not due to random chance (or variation)

    Fact: the more variable the data , the more people you have to study to get statistical significance.

    power before the study three


    How sure do we need to be?

    The usual beta is 0.20 (giving a power of 80%)
    If you haven’t picked this up yet, One minus beta = Power.

    power before the study four


    The Dreaded Calculation

    Because the concept is the important thing, we will spare you the headache of the power equation and just tell you that these assumptions yield a calculated N required of approximately 100 in each group.

    Hypothetical Dilemma Example: You will need to study 200 people, randomized to the drug “Lowpressure” or the drug “Whocaresapine” to have an 80 percent power to detect a 7mm difference or more.


    However, studies often don’t enroll exactly the number of people they need so you may have to do ….

    Power Calculations After the Study !!

    Don’t despair, there are only three steps for this part.
    power after the study one


    If you found a statistically significant difference (p less than 0.05)…You had enough POWER. You don’t need power calculations. Really.


    power after the study one


    If you found a statistically non-significant difference (p greater than 0.05). There are two main possibilities.

    A. There really is no difference in the population

    B. You didn’t have enough power. (Congratulations! You have succeeded in making a Type II error

    power afterthe study three


    Since you are the insatiably curious type, we can now calculate The Power we had to detect the difference we said was significant.

    We will use:

    • N: the number of people you actually enrolled
    • Sigma: the measured variance of the blood pressures in your study population (not estimated as before)
    • The difference you decided before the study was clinically important (7 mm Hg in this case)
    • The power you calculate from this is the probability you had of detecting a clinically important difference if it is present in the larger population.



    Number Needed to Treat

    Definition:The Number Needed to treat is the number of patients that you would need to treat to prevent one primary outcome (heart attack, death, stroke, whatever)

    • This applies to patients: with the same problem studied
    • treated for the same duration as the study
    • Calculation:

      • First calculate the Absolute Risk Reduction (ARR)
      • Then take the ARR in decimal form (e.g. .05 for 5%) and divide it INTO 1. (1/ ARR = NNT)
      • Example:
        – 8% stroke rate with A. Fib decreased to 3% with Coumadin
        – Absolute risk reduction of 5%
        – NNT = 1 / ARR or 1/.05 = 20
        Therefore you need to treat 20 A. Fib patients for one year with warfarin to prevent one stroke.

        Number Needed to Harm (NNH): this is the same concept as the Number Needed to Treat except that you use:
              Incidence of Adverse Effect MINUS Incidence in the Placebo Group = Absolute Risk Increase

        The calculation is then the same using Absolute Risk Increase instead of ARR.

        – Incidence of gynecomastia is almost zero with placebo
        – Incidence of gynecomastia is 10% with spironolactone
        – Therefore: Absolute increase in risk is 10% – 0% = 10%
        – 1 / 0.10 = 10 = NNH You need to treat 10 patients with spironolactone to cause one case of gynecomastia.

    Absolute vs Relative Risk Reduction

    Albert and I developed an acute interest in risk reduction at about 3500 feet.


    Example 1A:

    • Consider the benefit of using Coumadin for Stroke prevention in Atrial Fibrillation. Moderate risk patients on placebo have 8% risk of stroke in ONE year
    • Coumadin decreases that to 3% risk of stroke in ONE year
    • Quick !! Instinctively, what is the risk reduction? ….. 5% , right? That’s absolute risk reduction, NOT relative to anything else.

      Relative Risk Reduction is RELATIVE to the baseline 8% so… 0.05/0.08 or 5% reduction /8% baseline = .62 or 62% relative risk reduction

      Example 1B:
      OK, now consider if there was a very high baseline risk of 93%

      • Suppose Coumadin decreased the risk to 88%
      • Quick !! The absolute reduction is? …. You’re right! 5% (the same as the first example)
      • The relative risk though is different. 5 / 93 = 5.3% relative risk reduction

      So which is the most important? Absolute reduction or Relative reduction. Well, they each give you different kinds of information. I prefer the absolute risk reduction, but both are important. See also the Number Needed To Treat

    Evils of Pickle Eating-101

    Pickles are associated with all the major diseases of the body. Eating them breeds war and Communism. They can be related to most airline tragedies. Auto accidents are caused by pickles. There exists a positive relationship between crime waves and consumption of this fruit of the cucurbit family. For example …

    Nearly all sick people have eaten pickles. The effects are obviously cumulative.

    • 99.9% of all people who die from cancer have eaten pickles.
    • 100% of all soldiers have eaten pickles.
    • 96.8% of all Communist sympathizers have eaten pickles
    • 99.7% of the people involved in air and auto accidents ate pickles within 14 days preceding the accident.
    • 93.1% of juvenile delinquents come from homes where pickles are served frequently.

    Evidence points to the long term effects of pickle eating.
    Of the people born in 1839 who later dined on pickles, there has been a 100% mortality.

    • All pickle eaters born between 1849 and 1859 have wrinkled skin, have lost most of their teeth, have brittle bones and failing eyesight if the ills of pickle eating have not already caused their death.
    • Even more convincing is the report of a noted team of medical specialists: rats force fed with 20 pounds of pickles per day for 30 days developed bulging abdomens. Their appetites for WHOLESOME FOOD were destroyed.

    In spite of all the evidence, pickle growers and packers continue to spread their evil. More than 120,000 acres of fertile U.S. soil are devoted to growing pickles. Our per capita consumption is nearly four pounds.
    Eat orchid petal soup. Practically no one has as many problems from eating orchid petal soup as they do with eating pickles.

    SOURCE: “Evils of Pickle Eating,” by Everett D. Edington, originally printed in Cyanograms.