Debiasing Strategies

CH 13267

All men make mistakes, but a good man yields when he knows his course is wrong, and repairs the evil. The only crime is pride.

— Sophocles, Antigone

(Thanks to my friend Michelle Tanner, MD who contributed immensely to this article).

In the post Cognitive Bias, we went over a list of cognitives biases that may affect our clinical decisions. There are many more, and sometimes these biases are given different names. Rather than use the word bias, many authors, including the thought-leader in this field, Pat Croskerry, prefer the term cognitive dispositions to respond (CDR) to describe many situations where clinicians’ cognitive processes might be distorted, including the use of inappropriate heuristics, cognitive biases, logical fallacies, and other mental errors. The term CDR is thought to carry less of a negative connotation, and indeed, physicians have been resistant to interventions aimed at increasing awareness of and reducing errors due to cognitive biases.

After the publication of the 2000 Institute of Medicine Report To Err is Human, which attributed up to 98,000 deaths per year to medical errors, many efforts were made to reduce errors in our practices and systems. Development of multidisciplinary teams, computerized order entry, clinical guidelines, and quality improvement task forces have attempted to lessen medical errors and their impact on patients. We have seen an increased emphasis on things like medication safety cross-checking, reduction in resident work hours, using checklists in hospital order sets or ‘time-outs’ in the operating room. But most serious medical errors actually stems from misdiagnosis. Yes, every now and again a patient might have surgery on the wrong side or receive a medication that interacts with another medication, but at any given time, up to 15% of patients admitted to the hospital are being treated for the wrong diagnosis – with interventions that carry risk – while the actual cause of their symptoms remains unknown and likely untreated. To Err Is Human noted that postmortem causes of death were different from antemortem diagnoses 40% of the time in autopsies! How many of those deaths might have been prevented if physicians had been treating the correct diagnosis?

Most of these failures of diagnosis (probably two-thirds) are related to CDRs and lot of work has been done since 2000 to elucidate various causes and interventions, but physicians have been resistant to being told that there might be a problem with how they think. Physicians love to blame medical errors on someone or something else – thus the focus has been on resident’s lack of sleep or medication interaction checking. Seeking to reduce physicians resistance due to a feeling of being criticized is a prime reason why Croskerry and others prefer to use the term cognitive disposition to respond rather than negative words like bias or fallacy. I’m happy with either term because I’m not sure that relabeling will change the main problem: physicians tend to be a bit narcissistic and therefore resistant to the idea that all of us are biased and all of us have to actively work to monitor those biases and make decisions that are overly-influenced by them.

We make poor decisions for one of two reasons: either we lack information or we don’t apply what we know correctly. Riegleman, in his 1991 book Minimizing medical mistakes: the art of medical decision making, called this ‘errors of ignorance’ and ‘errors of implementation.’ One of the goals of To Err is Human was to create an environment where medical errors were attributed to systematic rather than personal failures, hoping to make progress in reducing error by de-emphasizing individual blame. Our focus here, of course, is to evaluate the errors of implementation. Graber et al., in 2002, further categorized diagnostic errors into three types: No-fault errors, system errors, and cognitive errors. No-fault errors will always happen (like when our hypothetical physician failed to diagnose mesenteric ischemia despite doing the correct work-up). System errors have been explored heavily since the publication of To Err is Human. But the cognitive errors remain and understanding our CDRs (our biases, etc.) is the first step to reducing this type of error.

Croskerry divides the CDRs into the following categories:

  • Errors of overattachment to a particular diagnosis
    • Anchoring, confirmation bias, premature closure, sunk costs
  • Errors due to failure to consider alternative diagnoses
    • Multiple alternatives bias, representativeness restraint, search satisfying, Sutton’s slip, unpacking principle, vertical line failure
  • Errors due to inheriting someone else’s thinking
    • Diagnosis momentum, framing effect, ascertainment effect, bandwagon effect
  • Errors in prevalence perception or estimation
    • Availability bias, ambiguity effect, base-rate neglect, gambler’s fallacy, hindsight bias, playing the odds, posterior probability error, order effects
  • Errors involving patient characteristics or presentation context
    • Fundamental attribution error, gender bias, psych-out error, triage cueing, contrast effect, yin-yang out
  • Errors associated with physician affect, personality, or decision style
    • Commission bias, omission bias, outcome bias, visceral bias, overconfidence, vertical line failure, belief bias, ego bias, sunk costs, zebra retreat

Some additional biases mentioned above include the bandwagon effect (doing something just because every one else does, like giving magnesium to women in premature labor), ambiguity effect (picking a diagnosis or treatment because more is known about it, like the outcome), contrast effect (minimizing the treatment of one patient because, in contrast, her problems pale in comparison to the last patient), belief bias (accepting or rejecting data based on its conclusion or whether it fits with what one already believes rather than on the strength of the data itself), ego bias (overestimating the prognosis of your patients compared to that of others’ patients), and zebra retreat (not pursuing a suspected rare diagnosis out of fear of being view negatively by colleagues or others for wasting time, resources, etc.).

We are all vulnerable to cognitive dispositions that can lead to error. Just being aware of this is meaningful and can make us less likely to make these mistakes, but we need to do more. We need to actively work to de-bias ourselves. Let’s look at some strategies for this (adapted from Croskerry):

Develop insight/awareness: Education about CDRs is a crucial first step to reducing their impact on our clinical thinking, but it cannot stop with reading an article or a book. We have to look for examples of them in our own practices and integrate our understanding of CDRs into our quality improvement processes.  We need to identify our biases and how they affect our decision-making and diagnosis formulation. An analysis of cognitive errors (and their root causes) should be a part of every peer review process, quality improvement meetings, and morbidity and mortality conferences. Most cases that are reviewed in these formats are selected because a less than optimal outcome occurred; the root cause (or at least a major contributor) in most cases was a cognitive error.  

Consider alternatives: We need to establish forced consideration of alternative possibilities, both in our own practices and in how and what we teach; considering alternatives should be a part of how we teach medicine routinely. Always ask the question, “What else could this be?” Ask yourself, ask your learner, ask your consultant. The ensuing conversation is perhaps the most educational thing we can ever do. Even when the diagnosis is obvious, always ask the question. This needs to become part of the culture of medicine. 

Metacognition: We all need to continually examine and reflect on our thinking processes actively, and not just when things go wrong. Even when things go right, it is a meaningful and important step to consider why things went right. We focus to much on negative outcomes (this is a form of bias); consequently, we develop a skewed sense of what contributed to the negative outcome. So try thinking about what went right as well, reinforcing the good things in our clinical processes. 

Decrease reliance on memory: In the pre-computer days, a highly valued quality in a physician was a good memory. Unfortunately, medical schools today still emphasizes this skill, selecting students who might excel in rote memorization but lag behind in critical thinking skills. In the 1950s, memory was everything: there was no quick way of searching the literature, of comprehensively checking drug interactions, of finding the latest treatment guidelines, etc. But today, memory is our greatest weakness. Our memories are poor and biased, and there is more data that we need to have mastery of than ever before in order to be a good doctor. So stop relying on your memory. We need to encourage the habitual use of cognitive aids, whether that’s mnemonics, practice guidelines, algorithms, or computers. If you don’t treat a particular disease every week, then look it up each time you encounter it. If you don’t use a particular drug all the time, then cross check the dose and interactions every time you prescribe it. Even if you do treat a particular disease every day, you should still do a comprehensive literature search every 6 months or so (yearly at the very least).

Many physicians are sorely dated in their treatment. What little new information they learn often comes from the worst sources: drug and product reps, throw-away journals, popular media, and even TV commercials. Education is a life-long process. Young physicians need to develop the habits of life-long learning early. Today, this means relying on electronic databases, practice guidelines, etc. as part of daily practice. I, for one, use Pubmed at least five times a day (and I feel like I’m pretty up-to-date in my area of expertise).

Our memory, as a multitude of psychological experiments have shown, are among our worst assets. Stop trusting it.

Specific training: We need to identify specific flaws in our thinking and specific biases and direct efforts to overcome them. For example, the area that seems to contribute most to misdiagnosis relates to a poor understanding of Bayesian probabilities and inference, so specific training in Bayesian probabilities might be in order, or learning from examples of popular biases, like distinguishing correlation from causation, etc. 

Simulation: We should use mental rehearsal and visualization as well as practical simulation/videos exhibiting right and wrong approaches. Though mental rehearsal may sound like a waste of your time, it is a powerful tool. If we appropriately employ metacognition, mental rehearsal of scenarios is a natural extension. Remember, one of our goals is to make our System 1 thinking better by employing System 2 thinking when we have time to do so (packing the parachute correctly). So a practical simulation in shoulder dystocia, done in a System 2 manner, will make our “instinctual” responses (the System 1 responses) better in the heat of the moment when the real shoulder dystocia happens. A real shoulder dystocia is no time to learn; you either have an absolute and definitive pathway in your mind of how you will deliver the baby before it suffers permanent injury or you don’t. But this is true even for things like making differential diagnoses. A howardism: practice does not make perfect, but good practice certainly helps get us closer. A corollary of this axiom is that bad practice makes a bad doctor; unfortunately, a lot of people have been packing the parachute incorrectly for many years and they have gotten lucky with the way the wind was blowing when they jumped out of the plane. 

Cognitive forcing strategies: We need to develop specific and general strategies to avoid bias in clinical situations. We can use our clinical processes and approaches to force us to think and avoid certain biases, even when we otherwise would not. Always asking the question, “What else could this be?” is an example of a cognitive forcing strategy. Our heuristics and clinical algorithms should incorporate cognitive forcing strategies. For example, an order sheet might ask you to provide a reason why you have elected not to use a preoperative antibiotic or thromboembolism prophylaxis. It may seem annoying to have to fill that out every time, but it makes you think. 

Make tasks easier: Reduce task difficulty and ambiguity. We need to train physicians in the proper use of relevant information databases and make these resources available to them. We need to remove as many barriers as possible to good decision making. This may come in the form of evidence-based order sets, clinical decision tools and nomograms, or efficient utilization of evidence-based resources. Bates et al. list “ten commandments” for effective clinical decision support. 

Minimize time pressures: Provide sufficient quality time to make decisions. We fall back to System 1 thinking when we are pressed for time, stressed, depressed, under pressure, etc. Hospitals and clinics should promote an atmosphere where appropriate time is given, so that System 2 critical thinking can occur when necessary, without further adding to the stress of a physician who already feels over-worked, under-appreciated, and behind. I won’t hold my breath for that. But clinicians can do this too. Don’t be afraid to tell a patient “I don’t know” or “I’m not sure” and then get back to them after finding the data you need to make a good decision. We should emphasize this idea even on simple decisions. Our snap, instinctive answers are usually correct (especially if we have been packing the parachute well) but we need to always take the time to do something if it is the right thing to do. For example, in education, you might consider always using a form of the One-minute preceptorThis simple tool can turn usually non-educational patient “check-outs” into an educational process for both you and your learner. 

Accountability: Establish clear accountability and follow-up for decisions. Physicians too often don’t learn from cases that go wrong. They circle the wagons around themselves and go into an ego-defense mode, blaming the patient, nurses, the resident, or really anyone but themselves. While others may have some part in contributing to what went wrong, you can really only change yourself. We have to keep ourselves honest (and when we don’t, we need honest and not-always-punitive peer review processes to provide feedback). Physicians, unfortunately, often learn little from “bad cases,” or the “crashes,” but they also learn very little from “near-misses.” Usually, for every time a physician has a “crash,” there have been several near-misses (or, as Geroge Carlin called them, “near-hits”). Ideally, we would learn as much from a near-miss as we might from a crash, and, in doing so, hopefully reduce the number of both. We cannot wait for things to go wrong to learn how to improve our processes.

Using  personal or institutional databases for self-reflection is one way to be honest about outcomes. I maintain a database of every case or delivery I do; I can use this to compare any number of metrics to national, regional, and institutional averages (like primary cesarean rates, for example). We also need to utilize quality improvement conferences, even in nonacademic settings. Even when things go right, we can still learn and improve. 

Feedback: We should provide rapid and reliable feedback so that errors are appreciated, understood, and corrected, allowing calibration of cognition. We need to do this for ourselves, our peers, and our institutions. Peer review processes should use modern tools like root-cause analysis, and utilize evidence-based data to inform the analysis. Information about potential cognitive biases should be returned to physicians with opportunities for improvement. Also, adverse situations and affective disorders that might lead to increased reliance on CDRs should be assessed, including things like substance abuse, sleep deprivation, mood and personality disorders, levels of stress, emotional intelligence, communications skills, etc. 

Leo Leonidas has suggested the following “ten commandments” to reduce cognitive errors (I have removed the Thou shalts and modified slightly):

  1. Reflect on how you think and decide.
  2. Do not rely on your memory when making decisions.
  3. Have an information-friendly work environment.
  4. Consider other possibilities even though you are sure you are right.
  5. Know Bayesian probability and the epidemiology of the diseases (and tests) in your differential.
  6. Rehearse both the common and the serious conditions you expect to see in your speciality.
  7. Ask yourself if you are the right person to the make this decision.
  8. Take time when deciding; resist pressures to work faster than accuracy allows.
  9. Create accountability procedures and follow-up for decisions you have made.
  10. Use a database for patient problems and decisions to provide a basis for self-improvement.

Let’s implement these commandments with some examples:

1. Reflect on how you think and decide.

Case: A patient presents in labor with a history of diet-controlled gestational diabetes. She has been complete and pushing for the last 45 minutes. The experienced nurse taking care of the patient informs you that she is worried about her progress because she believes the baby is large. You and the nurse recall your diabetic patient last week who had a bad shoulder dystocia. You decide to proceed with a cesarean delivery for arrest of descent. You deliver a healthy baby weighing 7 lbs and 14 ounces.

What went wrong?

  • Decision was made with System 1 instead of System 2 thinking.
  • Ascertainment bias, framing effect, hindsight bias, base-rate neglect, availability, and probably a visceral bias all influenced the decision to perform a cesarean. 
  • This patient did not meet criteria for an arrest of descent diagnosis. Available methods of assessing fetal weight (like an ultrasound or even palpation) were not used and did not inform the decision. Negative feelings of the last case influenced the current case.

2. Do not rely on your memory when making decisions.

Case: A patient is admitted with severe preeclampsia at 36 weeks gestation. She also has Type IIB von Willebrand’s disease. Her condition has deteriorated and the consultant cardiologist has diagnosed cardiomyopathy and recommends, among other things, diuresis. You elect to deliver the patient. Worried about hemorrhage, you recall a patient with von Willebrand’s disease from residency, and you order DDAVP. She undergoes a cesarean delivery and develops severe thrombocytopenia and flash pulmonary edema and is transferred to the intensive care unit where she develops ARDS (and dies). 

What went wrong?

  • Overconfidence bias, commission bias. The physician treated an unusual condition without looking it up first, relying on faulty memories. 
  • DDAVP is contraindicated in patients with cardiomypathy/pulmonary edema. DDVAVP may exacerbate severe thrombocytopenia in Type IIB vWD. It also may increase blood pressure in patients with preeclampsia.

3. Have an information-friendly work environment.

Case: You’re attending the delivery of a 41 weeks gestation fetus with meconium stained amniotic fluid (MSAF). The experienced nurse offers you a DeLee trap suction. You inform her that based on recent randomized trials, which show no benefit and potential for harm from deep-suctioning for MSAF, you have stopped using the trap suction, and that current neonatal resuscitation guidelines have done away with this step. She becomes angered and questions your competence in front of the patient and tells you that you should ask the Neonatal Nurse Practitioner what she would like for you to do.

What went wrong?

  • Hindsight bias, overconfidence bias on the part of the nurse.
  • The work environment is not receptive to quality improvement based on utilizing data, and instead values opinion and anecdotal experience. This type of culture likely stems from leadership which does not value evidence based medicine, and institutions that promote ageism, hierarchy, and paternalistic approaches to patient care. An information-friendly environment also means having easy access to the appropriate electronic information databases; but all the databases in the world are useless if the culture doesn’t promote their routine utilization. 

4. Consider other possibilities even though you are sure you are right.

Case: A previously healthy 29 weeks gestation pregnant woman presents with a headache and she is found to have severe hypertension and massive proteinuria. You start magnesium sulfate. Her blood pressure is not controlled after administering the maximum dose of two different antihypertensives. After administration of betamethasone, you proceed with cesarean delivery. After delivery, the newborn develops severe thrombocytopenia and the mother is admitted to the intensive care unit with renal failure. Later, the consultant nephrologist diagnoses the mother with new onset lupus nephritis.

What went wrong?

  • Anchoring, availability, confirmation bias, premature closure, overconfidence bias, Sutton’s slip or perhaps search satisfying. In popular culture, these biases are summed up with the phrase, If your only tool is hammer, then every problem looks like a nail.
  • The physician failed to consider the differential diagnosis. 

5. Know Bayesian probability and the epidemiology of the diseases (and tests) in your differential.

Case: A 20 year old woman presents at 32 weeks gestation with a complaint of leakage of fluid. After taking her history, which sounds likes the fluid was urine, you estimate that she has about a 5% chance of having ruptured membranes. You perform a ferning test for ruptured membranes which is 51.4% sensitive and 70.8% specific for ruptured membranes. The test is positive and you admit the patient and treat her with antibiotics and steroids. Two weeks later she has a failed induction leading to a cesarean delivery. At that time, you discover that her membranes were not ruptured.

What went wrong?

  • Premature closure, base-rate neglect, commission bias.
  • The physician has a poor understanding of the positive predictive value of the test that was used. The PPV of the fern test in the case is very low, but when the test came back positive, the patient was treated as if the PPV were 100%, not considering what the post-test probability of the hypothesis was. 

6. Rehearse both the common and the serious conditions you expect to see in your speciality.

Case: You are attending the delivery of a patient who has a severe shoulder dystocia. Your labor and delivery unit has recently conducted a simulated drill for managing shoulder dystocia and though the dystocia is difficult, all goes well with an appropriate team response from the entire staff, delivering a healthy newborn. You discover a fourth degree laceration, which you repair, using chromic suture to repair the sphincter. Two months later, she presents with fecal incontinence.

What went wrong?

  • Under-emphasis of the seemingly less important problem. This is a form of contrast bias. We are biased towards emphasizing “life and death” scenarios sometimes at the expense of other unusual but less important problems. Simulation was a benefit to the shoulder dystocia but rehearsal could have been a benefit too for the fourth degree laceration.

7. Ask yourself if you are the right person to the make this decision.

Case: Your cousin comes to you for her prenatal care. She was considering a home-birth because she believes that the local hospital has too high a cesarean delivery rate. She says she trusts your judgment. While in labor, she has repetitive late decelerations with minimal to absent variability starting at 8 cm dilation. You are conflicted because you know how important a vaginal delivery is to her. You allow her to continue laboring and two hours later she gives birth to a newborn with Apgars of 1 and 4 and a pH of 6.91. The neonate seizes later that night.

What went wrong?

  • Visceral bias.
  • In this case, due to inherent and perhaps unavoidable bias, the physician made a poor decision. This is why we shouldn’t treat family members, for example. But this commandment also applies to the use of consultants. Physicians need to be careful not to venture outside their scope of expertise (overconfidence bias).

8. Take time when deciding; resist pressures to work faster than accuracy allows.

Case: A young nurse calls you regarding your post-operative patient’s potassium level. It is 2.7. You don’t routinely deal with potassium replacement. You tell her that you would like to look it up and call her back. She says, “Geez, it’s just potassium. I’m trying to go on my break.” Feeling rushed, you order 2 g of potassium chloride IV over 10 minutes (this is listed in some pocket drug guides!). The patient receives the dose as ordered and suffers cardiac arrest and dies.

What went wrong?

  • Overconfidence bias.
  • Arguably, the physician’s main problem is a lack of knowledge, but feeling pressured, he deviated from what should have been his normal habit and did not look it up (if this scenario seems far-fetched, it was taken from a case report from Australia).

9. Create accountability procedures and follow-up for decisions you have made.

Case: Your hospital quality review committee notes that you have a higher than average cesarean delivery wound infection rate. It is also noted that you are the only member of the department who gives prophylactic antibiotics after delivery of the fetus. You change to administering antibiotics before the case, and see a subsequent decline in wound infection rates.

What went wrong?

  • Nothing went wrong in this case. Peer review worked well, but it required the physician being receptive to it and being a willing participant in continuous quality improvement processes. It also required the non-malignant utilization of peer review. The situation might have been avoided if the physician had better habits of continuous education. 

10. Use a database for patient problems and decisions to provide a basis for self-improvement.

Case: You track all of your individual surgical and obstetric procedures in a database which records complications and provides statistical feedback. You note that your primary cesarean delivery rate is higher than the community and national averages. Reviewing indications, you note that you have a higher than expected number of arrest of dilation indications. You review current literature on the subject and decide to reassess how you decide if a patient is in active labor (now defining active labor as starting at 6 cm) and you decide to give patients 4 hours rather than 2 hours of no change to define arrest. In the following 6 months, your primary cesarean delivery rate is halved.

What went wrong?

  • Again nothing went wrong. This type of continuous quality improvement is the hallmark of a good physician. But it must be driven by data (provided from the database) rather than a subjective recall of outcomes. We must promote a culture of using objective data rather than memory and perception to judge the quality of care that we provide. Additionally, we must be open to the idea that the way we have always done things might not the be the best way and look continuously for ways to improve. This is another skill that is strengthened with metacognition. 

Trowbridge (2008) offers these twelve tips for teaching avoidance of diagnostic errors:

  1. Explicitly describe heuristics and how they affect clinical reasoning.
  2. Promote the use of ‘diagnostic timeout’s.
  3. Promote the practice of ‘worst case scenario medicine’.
  4. Promote the use of a systematic approach to common problems.
  5. Ask why.
  6. Teach and emphasize the value of the clinical exam.
  7. Teach Bayesian theory as a way to direct the clinical evaluation and avoid premature closure.
  8. Acknowledge how the patient makes the clinician feel.
  9. Encourage learners to find clinical data that doesn’t fit with a provisional diagnosis; Ask ‘‘What can’t we explain?’’
  10. Embrace Zebras.
  11. Encourage learners to slow down.
  12. Admit one’s own mistakes.

The Differential Diagnosis as a Cognitive Forcing Tool

I believe that the differential diagnosis can be one of our most powerful tools in overcoming bias in the diagnostic process. But the differential diagnosis must be made at the very beginning of a patient encounter to provide mental checks and raise awareness of looming cognitive errors before we are flooded with sources of bias. The more information that is learned about the patient, the more biased we potentially become. The traditional method of making a differential diagnosis is one of forming the differential as the patient’s story unfolds, usually after the history and physical; yet this may lead to multiple cognitive errors. Triage cueing from the patient’s first words may lay the ground work for availability, anchoring, confirmation bias, and premature closure. The most recent and common disease processes will easily be retrieved from our memory, limiting the scope of our thinking merely by their availability.

With bias occurring during the patient interview, by default – through system 1 thinking – we may begin to anchor on the first and most likely diagnosis without full consideration of other possibilities. This causes us to use the interviewing process to seek confirmation of our initial thoughts and it becomes harder to consider alternatives. Scientific inquiry should not seek confirmation of our hypothesis (or our favored diagnosis), but rather proof for rejection of other possibilities. Once we’ve gathered enough data to confirm our initial heuristic thinking, we close in quickly, becoming anchored to our diagnosis. A simple strategy to prevent this course of events is to pause before every patient interview and contemplate the full scope of possibilities; that is, to make the differential diagnosis after learning the chief complaint but before interviewing the patient. By using the chief complaint given on the chart, a full scope of diagnostic possibilities can be considered including the most likely, the most common, the rare and the life threatening. This will help shape the interview with a larger availability of possibilities and encourage history-taking that works to exclude other diagnoses. Here a howardism,

You can’t diagnosis what you don’t think of first.

Having taught hundreds of medical students how to make differential diagnoses, I have always been impressed how easy it is to bias them to exclude even common and likely diagnoses. For example, a patient presents with right lower quadrant pain. The student is biased (because I am a gynecologist), so the differential diagnosis focuses only on gynecologic issues. When taking the history, the student then fails to ask about anorexia, migration of the pain, etc., and fails to consider appendicitis as a likely or even a possible diagnosis. The history and physical was limited because the differential was not broad enough. In these cases, triage cueing becomes devastating.

If bias based merely on my speciality is that profound, imagine what happens when the student opens the door and sees the patient (making assumptions about class, socioeconomic status, drug-dependency, etc.), then hears the patient speak (who narrows the complaint down to her ovary or some other source of self-identified pain), then takes a history too narrowly focused (not asking broad review of system questions, etc.). I have considered lead poisoning as a cause of pelvic/abdominal pain every time I have ever seen a patient with pain, but, alas, I have never diagnosed it nor have I ever tested a patient for it. But I did exclude it as very improbable based on history.

For further reading:

  • Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med. 2003; 78(8):775-780.
  • Wachter R. Why diagnostic errors don’t get any respect–and what can be done about them. Health Aff. 2010. 29(9):1605-10.
  • Newman-Toker DE, Pronovost PJ. Diagnostic Errors: the next frontier for patient safety. JAMA. 2009; 301(10):1060-2.
  • Croskerry P. Cognitive forcing strategies in clinical decision making. Ann of Emerg Med. 2003; 41(1).
  • Graber M, et al. Cognitive interventions to reduce diagnostic error: a narrative review. BMJ Qual Saf. 2012;21:535-557.
  • Corskerry P. A universal model of diagnostic reasoning. Acad Med. 2009; 84(8):1022-28.
  • Redelmeier, DA. The cognitive psychology of missed diagnoses. Ann of Int Med. 2005; 142:115-120.

Comments Off on Debiasing Strategies

Filed under Cognitive Bias

Cognitive Bias


(This cartoon and nine more similar ones are here).

Our human reasoning and decision-making processes are inherently flawed. Faced with so many decisions to be made every day, we take short-cuts (called heuristics) that help us make “pretty good” decisions with little effort. These “pretty good” decisions are not always right and often compromise and exchange our best decision for one that is just good enough. These heuristics carry with them assumptions which may not be relevant to the individual decision at hand, and if these assumptions are not accurate for the current problem, then a mistake may be made. We call these assumptions “cognitive biases.” Thus,

When a heuristic fails, it is referred to as a cognitive biasCognitive biases, or predispositions to think in a way that leads to failures in judgment, can also be caused by affect and motivation. Prolonged learning in a regular and predictable environment increases the successfulness of heuristics, whereas uncertain and unpredictable environments are a chief cause of heuristic failure (Improving Diagnosis in Healthcare).

More than 40 cognitive biases have been described which specifically affect our reasoning processes in medicine. These biases are more likely to occur with quicker decisions than with slower decisions. The term Dual Process Theory has been used to describe these two distinct ways we make decision. Daniel Kahneman refers to these two processes as System 1 and System 2 thinking.

System 1 thinking is intuitive and largely unintentional; it makes heavy use of heuristics. It is quick and reasoning occurs unconsciously. It is effortless and automatic. It is profoundly influenced by our past experiences, emotions, and memories.

System 2 thinking, on the other hand, is slower and more analytic. System 2 reasoning is conscious and operates with effort and control. It is intentional and rational. It is more influenced by facts, logic, and evidence. System 2 thinking takes work and time, and therefore is too slow to make most of the decisions we need to make in any given day.

A System 1 decision about lunch might be to get a double bacon cheeseburger and a peanut butter milkshake (with onion rings, of course). That was literally the first meal that popped into my head as I started typing, and each of those items resonates with emotional centers in my brain that recall pleasant experiences and pleasant memories. But not everything that resonates is reasonable.

As the System 2 part of my brain takes over, I realize several things: I am overweight and diabetic (certainly won’t help either of those issues); I have to work this afternoon (if I eat that I’ll probably need a nap); etc. You get the idea. My System 2 lunch might be kale with a side of Brussel sprouts. Oh well.

These two ways of thinking actually utilize different parts of our brains; they are distinctly different processes. Because System 1 thinking is so intuitive and so affected by our past experiences, we tend to make most cognitive errors with this type of thought. Failures can occur with System 2 thinking to be sure, and not just due to cognitive biases but also due to logical fallacies, or just bad data; but, overall, System 2 decisions are invariably more correct than System 1 decisions.

We certainly don’t need to overthink every decision. We don’t have enough time to make System 2 decisions about everything that comes our way. Yet, the more we make good System 2 decisions initially, the better our System 1 decisions will become. In other words, we need good heuristics or algorithms, deeply rooted in System 2 cognition, to make the best of our System 1 thoughts. Thus the howardism:

The mind is like a parachute: it works best when properly packed.

The packing is done slowly and purposefully; the cord is pulled automatically and without thinking. If we thoroughly think about where to eat lunch using System 2 thinking, it will have a positive effect on all of our subsequent decisions about lunch.

How does this relate to medicine? We all have cognitive dispositions that may lead us to error.

First, we need to be aware of how we make decisions and how our brains may play tricks on us; a thorough understanding of different cognitive biases can help with this. Second, we need to develop processes or tools that help to de-bias ourselves and/or prevent us from falling into some of the traps that our cognitive biases have laid for us.

Imagine that you are working in a busy ER. A patient presents who tells the triage nurse that she is having right lower quadrant pain; she says that the pain is just like pain she had 6 months ago when she had an ovarian cyst rupture. The triage nurse tells you (the doctor) that she has put the patient in an exam room and that she has pain like her previous ruptured cyst. You laugh, because you have already seen two other women tonight who had ruptured cysts on CT scans. You tell the nurse to go ahead and order a pelvic ultrasound for suspected ovarian cyst before you see her. The ultrasound is performed and reveals a 3.8 cm right ovarian cyst with some evidence of hemorrhage and some free fluid in the pelvis. You quickly examine and talk to the patient, confirm that her suspicious were correct, and send her home with some non-narcotic pain medicine and ask her to follow-up with her gynecologist in the office.

Several hours later, the patient returns, now complaining of more severe pain and bloating. Frustrated and feeling that the patient is upset that she didn’t get narcotics earlier, you immediately consult the gynecologist on-call for evaluation and management of her ovarian cyst. The gynecologist performs a consult and doesn’t believe that there is any evidence of torsion because there is blood flow to the ovary on ultrasound exam. He recommends reassurance and discharge home.

The next day she returns in shock and is thought to have an acute abdomen. She is taken to the OR and discovered to have mesenteric ischemia. She dies post-operatively.

While this example may feel extreme, the mistakes are real and they happen every day.

When the patient told the nurse that her ovary hurt, the nurse was influenced by this framing effect. The patient suffered from triage cueing because of the workflow of the ER. The physician became anchored to the idea of an ovarian cyst early on. He suffered from base-rate neglect when he overestimated the prevalence of painful ovarian cysts. When he thought about his previous patients that night, he committed the gambler’s fallacy and exhibited an availability bias. When the ER doctor decided to get an ultrasound, he was playing the odds or fell victim to Sutton’s slip. When the ultrasound was ordered for “suspected ovarian cyst,” there was diagnosis momentum that transferred to the interpreting radiologist.

When the ultrasound showed an ovarian cyst, the ER physician was affected by confirmation bias. The ER doctor’s frequent over-diagnosis of ovarian cysts was reinforced by feedback sanction. When he stopped looking for other causes of pain because he discovered an ovarian cyst, he had premature closure. When he felt that the patient’s return to the ER was due to her desire for narcotics, the ER doctor made a fundamental attribution error. When he never considered mesenteric ischemia because she did not complain of bloody stools, he exhibited representativeness restraint. When he consulted a gynecologist to treat her cyst rather than explore other possibilities, he was exploited by the sunk costs bias.

Each of these are examples of cognitive biases that affect our reasoning (see definitions below). But what’s another way this story could have played out?

The patient presents to the ER. The nurse tells the doctor that the patient is in an exam room complaining of right lower quadrant pain (she orders no tests or imaging before the patient is evaluated and she uses language that does not make inappropriate inferences). The doctor makes (in his head) a differential diagnosis for a woman with right lower quadrant pain (he does this before talking to the patient). While talking to the patient and performing an exam, he gathers information that he can use to rule out certain things on his differential (or at least decide that they are low probability) and determines the pretest probability for the various diagnoses on his list (this doesn’t have to be precise – for example, he decides that the chance of acute intermittent porphyria is incredibly low and decides not to pursue the diagnosis, at least at first).

After assessing the patient and refining his differential diagnosis, he decides to order some tests that will help him disprove likely and important diagnoses. He is concerned about her nausea and that her pain seems to be out of proportion to the findings on her abdominal exam. He briefly considered mesenteric ischemia but considers it lower probability because she has no risk factors and she has had no bloody stools (he doesn’t exclude it however, because he also realizes that only 16% of patients with mesenteric ischemia present with bloody stools). Her WBC is elevated. He does decide to order a CT scan because he is concerned about appendicitis.

When the CT is not consistent with appendicitis or mesenteric ischemia, he decides to attribute her pain to the ovarian cyst and discharges her home. When the patient returns later with worsened pain, he first reevaluates her carefully and starts out with the assumption that he has likely misdiagnosed her. This time, he notes an absence of bowel sounds, bloating, and increased abdominal pain on exam. He again considers mesenteric ischemia, even though the previous CT scan found no evidence of it, realizing that the negative predictive value of a CT scan for mesenteric ischemia in the absence of a small bowel obstruction is only 95% – meaning that 1 in 20 cases are missed. This time, he consults a general surgeon, who agrees that a more definitive test needs to be performed and a mesenteric angiogram reveals mesenteric ischemia. She is treated with decompression and heparin and makes a full recovery.

These two examples represent an extreme of very poor care to very excellent care. Note that even when excellent care occurred, the rare diagnosis was still initially missed. But the latter physician was not nearly as burdened by cognitive biases as the former physician and the patient is the one who benefits. The latter physician definitely used a lot of System 1 thinking, at least initially, but when it mattered, he slowed down and used System 2 thinking. He also had a thorough understanding of the statistical performance of the tests he ordered and he considered the pre-test and post-test probabilities of the diseases on his differential diagnosis. He is comfortable with uncertainty and he doesn’t think of tests in a binary (positive or negative) sense, but rather as increasing or decreasing the likelihood of the conditions he’s interested in. He used the hypothetico-deductive method of clinical diagnosis, which is rooted in Bayesian inference.

Let’s briefly define the most common cognitive biases which affect clinicians.

  • Aggregate bias: the belief that aggregated data, such as data used to make practice guidelines, don’t apply to individual patients.
    • Example: “My patient is special or different than the ones in the study or guideline.”
    • Consequence: Ordering pap smears or other tests when not indicated in violation of the guideline (which may unintentionally lead to patient harm).
  • Anchoring: the tendency to lock onto the salient features of a diagnosis too early and not modify the theory as new data arrives.
    • Example: “Hypoglycemia with liver inflammation is probably acute fatty liver of pregnancy.”
    • Consequence: Ignoring or rationalizing away the subsequent finding of normal fibrinogen levels (which would tend to go against the diagnosis).
  • Ascertainment bias: this occurs when thinking is shaped by prior expectations, such as stereotyping or gender bias.
    • Example: “She has pain because she is drug-seeking again.”
    • Consequence: Not conducting appropriate work-up of pain.
  • Availability: the tendency to believe things are more common or more likely if they come to mind more easily, usually leading to over-diagnosis (it may also lead to under-diagnosis).
    • Example: “Ooh, I saw this once in training and it was a twin molar pregnancy!”
    • Consequence: Not considering statistically more probable diagnoses.
  • Base-rate neglect: the tendency to ignore the true prevalence of diseases, distorting Bayesian reasoning. May be unintentional or deliberate (for example, when physicians always emphasize the “worst case scenario”).
    • Example: “Its probably GERD but we need to rule out aortic dissection.”
    • Consequence: Ordering unnecessary tests with high false positive rates and poor positive predictive values. 
  • Commission bias: the tendency to action rather than inaction, believing action is necessary to prevent harm. More common in over-confident physicians.
    • Example: “This trick always works in my patients for that problem” or “It’s just a cold, but she made an appointment so she’ll be unhappy if I don’t give her antibiotics” or “I want you to be on strict bedrest since you are having bleeding in the first trimester to prevent a miscarriage.”
    • Consequence: Overuse of potentially risky or unnecessary therapeutics and perhaps guilt-commissioning (if, for example, the patient miscarries when she gets up to tend to her crying baby).
  • Confirmation bias: the tendency to look for supporting evidence to confirm a diagnosis rather than to look for data to disprove a diagnosis.
    • Example: “Aha! That’s what I suspected.”
    • Consequence: Incorrect diagnosis. We should always look for data to disprove our diagnosis (our hypothesis).
  • Diagnosis momentum: the effect of attaching diagnoses too early and making them stick throughout interactions with patient, nurses, consultants, etc. and then biasing others. 
    • Example: “This is probably an ectopic pregnancy” and writing suspected ectopic on the ultrasound requisition form.
    • Consequence: Radiologist reads corpus luteal cyst as ectopic pregnancy.
  • Feedback sanction: diagnostic errors may have no consequence because of a lack of immediate feedback or any feedback at all, particularly in acute care settings where there is no patient follow-up, which reinforces errors in diagnosis or knowledge.
    • Example: “I saw this girl with back pain due to a UTI.”
    • Consequence: Positive reinforcement of diagnostic errors (such as belief that UTIs are a common cause of back pain).
  • Framing effect: how outcomes or contingencies are framed (by family, nurses, residents, or even the patient) influences decision making and diagnostic processes.
    • Example: “The patients says that her ovary has been hurting for a week.”
    • Consequence: Focusing on ovarian or gynecological sources of pelvic pain rather than other more likely causes.
  • Fundamental attribution error: the tendency to blame patients for their illnesses (dispositional causes) rather than circumstances (situational factors).
    • Example: “Her glucose is messed up because she is noncompliant.”
    • Consequence: Ignoring other causes of the condition (e.g. infection leading to elevated glucose).
  • Gambler’s fallacy: the belief that prior unrelated events affect the outcome of the current event (such as a series of coin tosses all showing heads affecting the probability that next coin toss will heads).
    • Example: “My last three diabetics all had shoulder dystocias!!”
    • Consequence: Leads to inappropriate treatment of the current patient, based on facts that are irrelevant. 
  • Gender (racial) bias: the belief that gender or race affects the probability of a disease when no such link exists pathophysiologically.
    • Example: “We need to think about osteoporosis since she’s white.”
    • Consequence: Under- or over-diagnosing diseases. Two-thirds of published racial predilections, for example, in major text books are not supported.
  • Hindsight bias: knowledge of the outcome affects perception of past events and may lead to an illusion of failure or an illusion of control.
    • Example: “Last time I had this, she got better because I gave her ___.”
    • Consequence: Perpetuates error and encourages anecdotal medicine. For example, it is merely an assumption that the intervention affected the outcome, either positively or negatively. 
  • Multiple alternative bias: a multiplicity of diagnostic options leads to conflict and uncertainty and then regression to well-known diagnoses.
    • Example: “Well let’s just focus on what it probably is and not worry about all that other stuff for now.”
    • Consequence: Ignoring other important alternative diagnoses. 
  • Omission bias: the tendency towards inaction, the opposite of a commission bias, and more common than commission biases.
    • Example: “Group B strep infections in neonates are really rare, so I don’t see the point in the antibiotic for this mom.”
    • Consequence: May result in rare but serious harms.
  • Order effects: the tendency to focus on the beginning and the end and fill in the middle part of the story (creating a false narrative or constructing false associations), worsened by tendencies like anchoring. This bias is important to consider in patient hand-offs and presentations.
    • Example: “She had a fever but got better when we treated her for a UTI.”
    • Consequence: Leads to inappropriate causation biases, etc. (The patient got better. This may be due to antibiotics given for a possible UTI or the actual cause of her fever may still be unknown). 
  • Outcome bias: the tendency to pick a diagnosis that leads to good outcomes rather than a bad outcome, a form of a value bias.
    • Example: “I’m sure it’s just a panic attack and not a pulmonary embolism.”
    • Consequence: Missing potentially serious diagnoses.
  • Overconfidence bias: the belief that we know more than we do, leading to a tendency to act on incomplete information, hunches, or intuitions.
    • Example: “I see this all the time, just send her home.”
    • Consequence: Grave harm may occur because of missed diagnosis.
  • Playing the odds: the tendency to opt for more benign diagnoses or simpler courses of action when uncertainty in the diagnosis exists.
    • Example: “I’m sure that this ovarian cyst is benign; it’s gotta be.”
    • Consequence: Potentially devastating when coupled with an omission bias (e.g., not following-up with a repeat ultrasound in a few weeks for a questionable cyst). 
  • Posterior probability error: the tendency to believe that what has gone on before for a patient changes the probability for future events for the patient.
    • Example: “Every time she comes in her bleeding has just been from her vaginal atrophy.”
    • Consequence: Biases current evaluation and work-up (e.g., ignoring post-menopausal bleeding).
  • Premature closure: the tendency to accept a diagnosis before it has actually been confirmed when scant data supports the anchored diagnosis, often leading to treatment failures. 
    • Example: “We know what it is, she just hasn’t responded to treatment yet.”
    • Consequence: Ignoring alternative theories; evidenced by this famous phrase in medicine: When the diagnosis is made, the thinking stops.
  • Psych-out error: this occurs when serious medical problems (e.g., hypoxia, head injuries, etc) are misattributed to psychiatric diagnoses.
    • Example: “She just acts that way because she is bipolar.”
    • Consequence: Ignoring potentially catastrophic physical ailments (e.g, vascular disease or brain tumor). 
  • Search satisfying: the tendency to stop looking once something satisfying is found, both on the patient or in the medical literature (a form of premature closure).
    • Example: “This article says I’m right!” or “That’s where she’s bleeding!”
    • Consequence: Ignoring other causes of symptoms or other contradictory evidence or literature.
  • Sutton’s slip: the tendency to go where the money is, that is, to diagnosis the most obvious things, ignoring less-likely diagnoses.
    • Example: “I rob banks because that’s where the money is,” (Willie Sutton’s response when the judge asked him why he robbed banks).
    • Consequence: Under-utilizing System 2 thinking and ignoring diseases or presentations that are less common.
  • Sunk costs: the more investment that is made in a particular diagnosis or treatment, the less likely one is to release from it and consider alternatives.
    • Example: “I’ve just got to be right, I don’t know why this treatment isn’t working!”
    • Consequence: Further delay in pursuing the right treatment/diagnosis. This also results in a lot of false case reports (We present a case of such-and-such that was refractory to usual treatments but responded to some other crazy treatment or We present a case of such-and-such that presented in some weird nontypical way – in both cases, the diagnosis was likely wrong to begin with). 
  • Triage cueing: the biasing that results from self-triage or systematic triage of patients or presentations, creating tunnel vision.
    • Example: “I need a Gyn consult because she’s a female with pain.”
    • Consequence: Ignoring other organ systems or causes of pain.
  • Unpacking principle: the failure to elicit all relevant information in establishing a differential, particularly when a prototypical presentation leads to anchoring.
    • Example: “Anorexia and right lower quadrant pain is classic appendicitis.”
    • Consequence: Not considering all causes of each symptom individually (or collectively). As an aside, Occam’s razor and other cognitive processes that favor simplicity over complexity are usually wrong but feel comfortable to human imagination (it’s much simpler to blame the MMR vaccine for autism than it is to consider a polygenetic, multifactorial causation theory). 
  • Vertical line failure: this results from routine, repetitive processes that emphasize economy, efficacy, and utility (as opposed to lateral thinking).
    • Example: “I always do a diabetes screen on all pregnant women” or “When I see x I always do y.”
    • Consequence: Deemphasizes lateral thinking (e.g., What else might this be?).
  • Visceral bias: Results from countertransference and other visceral arousal leading to poor decision making.
    • Example: “That patient is just a troll” or “She is so sweet.”
    • Consequence: Leads to cognitive distortion and augments biases.
  • Yin-yang out: the tendency to stop looking or to stop trying once efforts have seemingly been exhausted even though the case hasn’t been satisfied.
    • Example: “We’ve worked her up to the yin-yang.”
    • Consequence: Leads to errors of omission.

I’m sure you can think of many other examples for these biases, and there are many other biases that have been described apart from those on the list. There is an emerging scientific literature which is examining the effects of bias on diagnostic and therapeutic outcomes and on medical error. The 2015 Institute of Medicine Report, Improving Diagnosis in Healthcare, is a good place to start exploring some of the implications of bias in the diagnostic process.

Next we will explore some strategies to mitigate our bias.

Comments Off on Cognitive Bias

Filed under Cognitive Bias



Recently, I made a disparaging comment about data that was not statistically signficant but rather was “trending” toward significance:

There was an apparent “trend” towards fewer cases of CP and less developmental problems. “Trends” are code for “We didn’t find anything but surely it’s not all a waste of time.”

This comment was in reference to the ACTOMgSO4 trial which studied prevention of cerebral palsy using magnesium sulfate. This study is often cited in support of this increasingly common but not evidence-based practice. To be clear, ACTOMgSO4 found the following: total pediatric mortality, cerebral palsy in survivors, and the combined outcome of death plus cerebral palsy were not statistically significantly different in the treatment group versus the placebo group.

Yet, this study is quoted time and time again as evidence supporting the practice of antenatal magnesium. The primary author of the study, Crowther, went on to write one of the most influential meta-analyses on the issue, which used the non-significant subset data from the BEAM Trial to re-envision the non-significant data from the ACTOMgSO4 Trial. Indeed, Rouse, author of the BEAM Trial study, was a co-author of this meta-analysis. If this seems like a conflict of interest, it is. But there is only so much that can be done to try to make a silk purse out of a sow’s ear. These authors keeps claiming that the “trend” is significant (even though the data is not).

Keep in mind that all non-significant data has a “trend,” but the bottom line is it isn’t significant. Any data that is not exactly the same as the comparison data must by definition “trend” away. It means nothing. Imagine that I do a study with 50 people in each arm: in the intervention arm 21 people get better while in the placebo arm only 19 get better. My data is not significantly different. But I really, really believe in my hypothesis, and even though I said before I analyzed my data that a p value of < 0.05 would be used to determine statistical significance, I still would like to get my study published and provide support to my pet idea; so I make one or more of the following “true” conclusions:

  • “More patients in the treatment group got better than in the placebo group.”
  • “There was a trend towards statistical significance.”
  • “We observed no harms from the intervention.”
  • “The intervention may lead to better outcomes.”

All of those statements are true, or are at least half-truths. So is this one:

  • “The intervention was no better than placebo at treating the disease. There is no evidence that patients benefited from the intervention.”

How did the authors of ACTOMgSO4 try to make a silk purse out of a sow’s ear? They said,

Total pediatric mortality, cerebral palsy in survivors, and combined death or cerebral palsy were less frequent for infants exposed to magnesium sulfate, but none of the differences were statistically significant.

That’s a really, really important ‘but’ at the end of that sentence. Their overall conclusions:

Magnesium sulfate given to women immediately before very preterm birth may improve important pediatric outcomes. No serious harmful effects were seen.

Sounds familiar? It’s definitely a bit of doublespeak. It “may improve” outcomes and it may not. Why write it this way? Bias. The authors knew what they wanted to prove when designing the study, and despite every attempt to do so, they just couldn’t massage the data enough to make a significant p value. Readers are often confused by the careful use of the word ‘may’ in articles; if positive, affirmative data were discovered, the word ‘may’ would be omitted. Consider the non-conditional language in this study’s conclusion:

Although one-step screening was associated with more patients being treated for gestational diabetes, it was not associated with a decrease in large-for-gestational-age or macrosomic neonates but was associated with an increased rate of primary cesarean delivery.

No ifs, ands, or mays about it. But half-truth writing allows others authors to still claim some value in their work while not technically lying. But it is misleading, whether intentionally or unintentionally. Furthermore, it is unnecessary – there is value in the work. The value of the ACTOMgSO4 study was showing that magnesium was not better than placebo in preventing cerebral palsy; but that’s not the outcome the authors were expecting to find – thus the sophistry.

The Probable Error blog has compiled an amazing list of over 500 terms that authors have used to described non-significant p values. Take a glance here; it’s truly extraordinary. The list includes gems like significant tendency (p = o.o9), possibly marginally significant (p = 0.116), not totally significant (p = 0.09), an apparent trend (p = 0.286), not significant in the narrow sense of the word (p = 0.29), and about 500 other Orwellian ways of saying the same thing: NOT SIGNIFICANT.

I certainly won’t pretend that p values are everything; I have made that abundantly clear. We certainly do need to focus on issues like numbers needed to benefit or harm. But we also need to make sure that those numbers are not derived from random chance. We need to use Bayesian inference to decide how probable or improbable a finding actually is. But the culture among scientific authors has crossed over to the absurd, as shown in the list of silly rationalizations noted above. If the designers of a study don’t care about the p value, then don’t publish it; but if they do, then respect it and don’t try to minimize the fact that the study did not disprove the null hypothesis. This type of intellectual dishonesty partly drives the p-hacking and manipulation that is so prevalent today.

If differences between groups in a study are truly important, we should be able to demonstrate differences without relying on faulty and misleading statistical analysis. Such misleading statements would not be allowed in a court of law. In fact, in this court decision which excluded the testimony of the epidemiologist Anick Bérard, who claimed that Zoloft caused birth defects, the Judge stated,

Dr. Bérard testified that, in her view, statistical significance is certainly important within a study, but when drawing conclusions from multiple studies, it is acceptable scientific practice to look at trends across studies, even when the findings are not statistically significant. In support of this proposition, she cited a single source, a textbook by epidemiologist Kenneth Rothman, and testified to an “evolution of the thinking of the importance of statistical significance.” Epidemiology is not a novel form of scientific expertise. However, Dr. Bérard’s reliance on trends in non-statistically significant data to draw conclusions about teratogenicity, rather than on replicated statistically significant findings, is a novel methodology.

These same statistical methods were used in the meta-analyses of magnesium to prevent CP, by combining the non-statistically significant findings of the ACTOMgSO4 study and the PreMAG studies with the findings from the BEAM trial; but, ultimately, not significant means not significant, no matter how it’s twisted.

Comments Off on Trends…

Filed under Evidence Based Medicine

How Do I Know If A Study Is Valid?


If you torture data for long enough, it will confess to anything. – Ronald Harry Coas

Imagine that you’ve just read a study in the prestigious British Medical Journal that concludes the following:

Remote, retroactive intercessory prayer said for a group is associated with a shorter stay in hospital and shorter duration of fever in patients with a bloodstream infection and should be considered for use in clinical practice.

Specifically, the author randomized 3393 patients who had been hospitalized for sepsis up to ten years earlier to two groups: the author prayed for one group and did not pray for the other. He found that the group he prayed for was more likely to have had shorter hospital stays and a shorter duration of fever. Both of these findings were statistically significant, with a p value of 0.01 and 0.04, respectively. So are you currently praying for the patients you hospitalized 10 years ago? If you aren’t, some “evidence-based medicine” practitioners (of the Frequentist school) might conclude that you are a nihilist, choosing to ignore science. But I suspect that even after reading the article, you are probably going to ignore it. But why? How do we know if a study is valid and useful? How can we justify ignoring one article while demanding universal adoption of another, when both have similar p values?

Let’s consider five steps for determining the validity of a study.

1. How good is the study?

Most published studies suffer from significant methodological problems, poor designs, bias, or other problems that may make the study fundamentally flawed. If you haven’t already, please read How to Read A Scientific Paper for a thorough approach to this issue. But if the paper looks like a quality study that made a statistically significant finding, then we must address how likely it is that this discovery is true.

2. What is the probability that the discovered association (or lack of an association) is true?

Of the many things to consider when reading a new study, this is often the hardest question to answer. The majority of published research is wrong; this doesn’t mean, however, that (in most cases) a scientific, evidence-based approach is still not the best way to determine how we should practice medicine. The fact that most published research is wrong is not a reason to embrace anecdotal medicine; almost every anecdotally-derived medical practice has been or will be eventually discredited.

It does mean, though, that we have to do a little more work as a scientific community, and as individual clinicians, to ascertain what the probability of a finding being “true” or not really is. Most papers that report something as “true” or “significant” based on a p value of less than 0.05 are in error. This fact was popularized in 2005 by John Ioannidis, whose paper has inspired countless studies and derivative works analyzing the impact of this assertion. For example,

  • This 2012 report, published in Nature, found that only 6 of 53 high-profile cancer studies could be replicated in repeat experiments.
  • This Frequentist paper, published in 2013, attempting to attack the methods of Ionnidis, still estimated that 14% of 5,322 papers published in the five largest medical journals contained false findings. This paper’s estimate is limited because it assumes that no study was biased and that authors did not manipulate p values, and it was further limited because it represents only papers published in the premier journals (presumably most poorer quality papers are published elsewhere as well as most preliminary and explorative studies).
  •  This paper, published in JAMA in 2005, found that of 49 highly cited articles, only 2o had results that were replicated while 7 were explicitly contradicted by future studies. Others suffered from differences in the size of the reported effect in subsequent studies.
  • In 2015, Nature reported that only 39 of 100 psychology studies could be reproduced.
  • This paper, published in 2010, found that 10 of 18 genetic microarray studies could not be replicated.
  • RetractionWatch provides this overview of how many studies turn out to be untrue.
  • This article in The Economist and this YouTube video both offer excellent visual explanations of these problems.
  • Maybe the very best article about the scope of the problem is here, at fivethirtyeight.

P-hacking, fraud, retractions, lack of reproducibility, and just honest chance leading to the wrong findings; those are the some of the problems, but what’s the solution? Bayesian inference.

Bayes’ Theorem states:


This equations states that the probability of A given is equal to the probability of B given A times the probability of A, divided by the probability of B. Don’t be confused. Let’s restate this for how we normally use it in medicine. What we care about is the probability that our hypothesis, H, is true, whatever our hypothesis might be. We test our hypothesis with a test, T; this might be a study, an experiment, or even a lab test. So here we can substitute those terms:


Normally, when we do a test (T), the test result is reported as the probability that the test or study (or the data collected in the study) fits the hypothesis. This is P(T | H). It assumes that the hypothesis is true, and then tells us the probability that the observed data would make our test or study positive (typically taken as a p value < 0.05). But this isn’t helpful to us. I want to know if my hypothesis is true, not just if the data collected and the tests performed line up with what I would expect to see if my hypothesis were true.

Frequentist statistics, the traditionally taught theory of statistics mostly used in biologic sciences, is incapable of answering the question, “How probable is my hypothesis?” Confused by this technical point, most physicians and even many research scientists wrongly assume that when a test or study is positive or “statistically signficant,” that the hypothesis being tested is validated. This misassumption is responsible for the vast majority of the mistakes in science and in the popular interpretation of scientific articles.


If Frequentism versus Bayesianism is confusing (I’m sure it probably is), let’s simplify it.

Imagine that you are standing in a room and you hear voices in the next room that sound exactly like Chuck Norris talking to Jean-Claude van Damme. Thrilled, you quickly record the voices. You develop a hypothesis that Chuck and van Damme are in the room next door to you. You use a computer to test whether the voices match known voice samples of the two martial artists. The computer test tells you that the voices are nearly a perfect match, with a p value of < 0.0001. This is what Frequentist statistical methods are good at. This statistical approach tells you the chance that the data you observed would exist if your hypothesis were true. Thus, it tells you the probability of the voice pattern you observed given that the JCVD and Chuck are in the room next door; that is, the probability of the test given the hypothesis, or P(T | H).

But it doesn’t tell you if your hypothesis is true. What we really want to know is the probability that Chuck and the Muscles from Brussels are in the room next door, given the voices you have heard; that is, the probability of the hypothesis given the test, or P(H | T). This is the power of Bayesian inference, because it allows us to consider other information that is beyond the scope of the test, like the probability that the two men would be in the room in the first place compared to the probability that the TV is just showing a rerun of The Expendables 2.

Bayes tells us the probability of an event based on everything we know that might be related to that event, and it is updated as new knowledge is acquired.

Frequentism tells us the probability of an event based on the limit of its frequency in a test. It does not allow for the introduction of prior knowledge or future knowledge.

Our brains naturally work using Bayesian-like logic so it should come naturally.


So we need Bayes to answer the question of how probable our hypothesis actually is. More specifically, Bayesian inference allows us to change our impression of how likely something is based on new data. So, at the start, we assume that a hypothesis has a certain probability of being true; then, we learn new information, usually from experiments or studies; then, we update our understanding of how likely the thing is to be true based on this new data. This is called Bayesian Updating. As I said, the good news is, you are already good at this. Our brains are Bayesian machines that continuously learn new information and update our understanding based on what we learn. This is the same process we are supposed to be using to interpret medical tests, but in the case of a scientific study, instead of deciding if a patient has a disease by doing a test, we must decide if a hypothesis is “true” (at least highly probable) based on an experimental study.

Let’s see how Bayesian inference helps us determine the true rates of Type I and Type II errors in a study.

Type I and Type II Errors

Scientific studies use statistical methods to test a hypothesis. If the study falsely rejects the null hypothesis (that there is no association between the two variables), then that is called a Type I Error, or a false positive (since it incorrectly lends credence to the alternative hypothesis). If there is an association that is not detected, then this is called a Type II Error, or false negative (since it incorrectly discounts the alternative hypothesis).

We generally accept that it’s okay to be falsely positive about 5% of the time in the biological sciences (Type 1 Error). This rate is determined by alpha, usually set to 0.05; this is why we generally say that anything with a p value < 0.05 is significant. In other words, we are saying that it is okay to believe a lie 5% of the time.

The “power” of a study determines how many false negatives there will be. A study may not have been sufficiently “powered” to find something that was actually there (Type II Error). Power is defined as 1 – beta; most of the time, beta is 0.2, and this would mean that a study is likely to find 80% of true associations that exist. In other words, we miss something we wanted to find 20% of the time.

In order for the p value to work as intended, the study must be powered correctly. Too few or too many patients in the study creates a problem. So a power analysis should performed before a study is conducted to make sure that the number of enrolled subjects (the n) is correct.

Most people who have taken an elementary statistics course understand these basic principles and most studies at least attempt to follow these rules. However, there is a big part still missing: not all hypotheses that we can test are equally likely, and that matters.

For example compare these two hypotheses: ‘Cigarette smoking is associated with development of lung cancer’ versus ‘Listening to Elvis Presley music is associated with development of brain tumors.’ Which of these two hypotheses seems to be more likely, based on what we already know, either from prior studies on the subject or studies related to similar subject matter? Or if there are no studies, based on any knowledge we have, including our knowledge of basic sciences?

We need to understand the pre-study probability that the hypothesis is true in order to understand how the study itself affects the post-study probability. This is the same skill set we use when we interpret clinical tests: the test is used to modulate our assigned pretest probability and to generate a new posttest probability (what Bayesians call the posterior probability). So, too, a significant finding (or lack of a significant finding) is used to modulate the post-study probability that a particular hypothesis is true or not true.

Let’s say that we create a well-designed study to test our two hypotheses (the cigarette hypothesis and the Elvis hypothesis). It is most likely that the cigarette hypothesis will show a positive link and the Elvis hypothesis will not. But what if the cigarette hypothesis doesn’t find a link for some reason? Or the Elvis hypothesis does show a positive link? Do we allow these two studies to upend and overturn everything we currently know about the subjects? Not if we use Bayesian inference.

I’m not sure what the pre-study probabilities that these two hypothesis are true should be, but I would guess that there is a 99% chance, based on everything we know, that cigarette smoking causes lung cancer, and about a 1% chance that listening to Elvis causes brain tumors. Let’s see what this means in real life.

We can actually calculate how these pre-study probabilities affect our study, and how our study affects our assumptions that the hypotheses are true. Assuming a p value of 0.05 and a beta of 0.10, the positive predicative value (PPV or post-study probability) that the cigarette hypothesis is true (assuming that our study found a p value of less than 0.05) is 99.94%. On the other hand, the PPV of the Elvis hypothesis, with same conditions, is only 15.38%. These probabilities assume that the studies were done correctly and that the resultant findings were not manipulated in any way. If we were to introduce some bias into the Elvis study (referred to as u), then the results change dramatically. For example, with u=0.3 (the author never did like Elvis anyway), the PPV becomes only 2.73%.


Bias comes in many forms and affects many parts of the production sequence of a scientific study, from design, implementation, analysis, and publication (or lack of publication). It is not always intentional on the part of the authors. Remember, intentionally or not, when people design studies, they often design the study in a way to show the effect they are expecting to find, rather than to disprove the effect  (which is the more scientifically rigorous approach).

We might also conduct the same study over and over again until we get the finding we want (or more commonly, a study may be repeated by different groups with only one or two finding a positive association – knowing how many times a study is repeated is difficult since most negative studies are not published). If the Elvis study were repeated 5 times (assuming that there was no bias at all), then the PPV of a study showing an association would only be 4.27% (and add a bit of bias and that number drops dramatically, well below 1%). Note that these probabilities represent a really well done study, with 90% power. Most studies are underpowered, with estimates ranging in the 20-40% range, allowing for a lot more Type II errors.


And remember, this type of analysis only applies to studies which found a legitimately significant p value, with no fraud or p-hacking or other issues.

What if our cigarette study didn’t show a positive result? What is the chance that a Type II error occurred? Well the negative predictive value (NPV) would be only 8.76% leaving a 91.24% chance that a Type II error occurred (a false negative), assuming that the study was adequately powered. If the power were only 30% (remember that the majority of studies are thought to be under-powered, in the 20-40% range), then the chance of a Type II Error would be 98.65%. Conversely, if the Elvis study says that Elvis doesn’t cause brain tumors, its negative predictive value would be 99.89%.

Thus, the real risk of a Type I or Type II error is dependent on study design (the alpha and beta boundaries), but it is also dependent upon the pretest probability of the hypothesis being tested.

Two more examples. First, let’s apply this first step to the previously mentioned retrospective prayer study. The author used a p value of 0.05 as a cut off for statistical significance and there was no power analysis, so we might assume the study was under-powered, with a beta of 0.5. We might also assume significant bias in the construction of the study (u=0.5). Lastly, how likely do we think that prayer 10 years in the future will affect present-day outcomes? Let’s pretend that this has a 1% chance of being true based on available data. We find that the positive predictive value of this study is only 1.42%, if our bias assumption is correct, and no better than 9.17% if there were no bias, p-hacking, or other manipulations of any kind.


When you read about this study, you knew already in your gut what Bayes tells you clearly: There is an incredibly low chance, despite a well-designed and clinically significant trial, published in a prestigious journal, that the hypothesis is much more likely than the 1% probability we guessed in the first place. Note that it is slightly more likely than your pretest probability, but still so unlikely that we should not change our practices, as the author suggested.

We can use as a second example the BEAM Trial. Recall that this trial is the basis of using magnesium for prevention of cerebral palsy. We previously demonstrated that the findings of the trial were not statistically significant in the first place and are the result of p-hacking; but what if the results had been significant? Let’s do the analysis. We know that the p value used to claim significance for reduction of non-anomalous babies was p = 0.0491, so we can use this actual number in the calculations. The power of the trial was supposed to be 80% but the “significant” finding came in an underpowered subset analysis, so will set beta equal to 0.6. There were four previous studies which found no significant effect, so we will consider 5 studies in the field. Bias is a subjective determination, but the bias of the authors is clearly present and the p-hacking reinforces this. We will set bias to 0.4. Lastly we must consider what the probability of the hypothesis was prior to the study. Since four prior studies had been done which concluded no effect, the probability that the alternative hypothesis was true must be considered low prior to the trial; in fact, prior data had uniformly suggested increased neonatal death. Let’s be generous and say that it was 10%. What do we get?


Since this was the fifth study (with the other four showing no effect), then the best post-study probability is 17.91%; that number doesn’t discount for bias. Incorporating the likely bias of the trial would push the PPV even lower, to about 3%. Of course, all of that assumes that the trial had a statistically significant finding in the first place (it did not). The negative predictive value, then, is the most precise number, which stands at 96.79%.

Either way, it is probably safe to say that as the data stands today, there is only about a ~3% chance that magnesium is associated with a reduced risk of CP in surviving preterm infants. Recall that Bayesian inference allows us to continually revise our estimate of the probability of the hypothesis based on new data. Since BEAM was published, two additional long-term follow-up studies of children exposed to antenatal magnesium have been published which showed no neurological benefit from the exposure in school-aged children. With Bayesian updating, we can use this data to further refine the probability that magnesium reduces the risk of cerebral palsy. If the estimate was ~3% after the BEAM Trial, it is now significantly lower based on this new information.

While we are discussing magnesium, how about its use as a tocolytic? Hundreds of studies have shown that it is ineffective, with at least 16 RCTs. What if a well-designed study were published tomorrow that showed a statistically-significant effect in reducing preterm labor? How would that change our understanding? With the overwhelming number of studies and meta-analyses that have failed to show an effect, let’s set the pre-study probability to about 1%:


So just counting the RCT evidence against magnesium as a tocolytic, a perfectly designed, unbiased study would have no more than a 1.71% positive predictive value. A potential PPV so low means that such a study should not take place (though people continue to publish and research in this dead-end field). Bayesian inferences tells us that the belief that magnesium may work as a tocolytic or to prevent cerebral palsy has roughly the same evidence as the idea that remote intercessory prayer makes hospital stays of ten years ago shorter.

Here’s the general concept. Let’s graphically compare a high probability hypothesis (50%) to a low probability one (5%), using both an average test (lower power with questionable p values or bias) and a really good study (significant p value and well-powered).


Take a look at the positive and negative predictive values; they are clearly influenced by the likelihood of the hypothesis. Many tested hypotheses are nowhere near as likely as 5%; the hypotheses of most epidemiological studies and other “data-mining” type studies may carry more like a 1 in 1,000 or even 1 in 10,000 odds.

3. Rejecting the Null Hypothesis Does Not Prove the Alternate Hypothesis

The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings. – Andrew Gellman

When the null hypothesis is rejected (in other words, when the p value is less than 0.05), that does not mean that THE alternative hypothesis is automatically accepted or that it is even probable. It means that AN alternative hypothesis is probable. But the alternative hypothesis espoused by the authors may not be the best (and therefore most probable) alternative hypothesis. For example, I may hypothesize that ice cream causes polio. This was a widely held belief in the 1940s. If I design a study in which the null hypothesis is that the incidence of polio is not correlated to the incidence of ice cream sales, and then I find that it is, then I reject the null hypothesis. But this does not then mean that ice cream causes polio. That is one of many alternative hypotheses and much more data is necessary to establish that any given alternative hypothesis is probable.


This is a mistake of “correlation equals causation” and this is the most likely mistake with this type of data. I’m fairly sure that organic foods do not cause autism though the two are strongly correlated (but admittedly, I could be wrong):


A study published just this week revealed a correlation between antenatal Tylenol use and subsequent behavioral problems in children. Aside from the fact that this was a poorly controlled, retrospective, epidemiological study (with almost no statistical relevance and incredibly weak pretest probability), even if it were better designed and did indeed determine that the two factors were significantly correlated, it still would be no more evidence that Tylenol causes behavioral problems in children than the above data is evidence that organic food causes autism. There are a myriad of explanations for the correlation. Perhaps mothers who have more headaches raise children with more behavioral problems? Or children with behavioral problems cause their mothers to have more headaches in their subsequent pregnancies? Correlation usually does not equal causation.

But if a legitimate finding were discovered, and it appears likely that it is causative, we must next assess how big the observed effect is.

4. Is the magnitude of the discovered effect clinically significant?

There are an almost endless number of discoveries that are highly statistically significant and probable, but they fail to be clinically significant because the magnitude of the effect is so little. We must distinguish between what is significant and what is important. In other words, we may accept the hypothesis after these first steps of our analysis, but still decide not to make use of the intervention in our practice.

I will briefly mention again a previous example: while it is likely true that using a statin drug may reduce a particular patient’s risk of a non-fatal MI by 1%age point over ten years, is this intervention worth the $4.5M it costs to do so (let alone all of the side effects of the drug)? I’ll let you (and your patient) decide. Here is a well-written article explaining many of these concepts with better examples. I will steal an example from the author. He says that saying that a study is “significant” is sometimes like bragging about winning the lottery when you only won $25. “Significant” is a term used by Frequentists whenever the observed data is below the designated p value, but that doesn’t mean that observed association really means anything in practical terms.

5. What cost is there for adopting the intervention into routine practice?

Cost comes in two forms, economic and noneconomic. The $4.5M cost of preventing one nonfatal MI in ten years is likely not to be considered cost-effective by any honest person. But noneconomic costs come in the form of unintended consequences, whether physical or psychological. Would, for example, an intervention that decreased the size of newborns by 3 ounces in mothers with gestational diabetes be worth implementation? Well, maybe. That 3 ounces might, over many thousands of women, save a few shoulder dystocias or cesareans. So if it were free, and had no significant unintended consequences, then yes it probably would be worth it. But few interventions are free and virtually none don’t expose the patients to unintended harms. So the cost must be worth the benefit.

Studies do not always account for all of the costs, intended and unintended. So often the burden of considering the costs of implementation fall upon the reader of the study. Ultimately, deciding whether to use any intervention is a shared decision between physician and patient and one which respects the values of the patient. The patient must be fully informed, understanding precisely both the risks and the benefits in terms she can understand.

Was Twain Right? 

Many are so overwhelmed by the seemingly endless problems with scientific papers that the Twain comment of “Lies, Damned Lies, and Statistics” becomes almost an excuse to abandon science-based medicine and return to the dark ages of anecdotal medicine. Most physicians have been frustrated by some study that feels wrong but they can’t explain why it’s wrong, as the authors shout about their significant p value.

But Twain’s quote should be restated: There are lies, damned lies, and Frequentism. The scientific process is saved by Bayesian inference and probabilistic reasoning. Bayes makes no claim as to what is true or not true, only to what is more probable or less probable; Bayesian inference allows us to quantify this probability in a useful manner. Bayes allows us to consider all evidence, as it comes, and constantly refine how confident we are about what believe to be true. Imagine if rather than reporting p values, scientific papers concluded with an assessment of how probable the reported hypotheses are. This knowledge could be used to directly inform future studies and draw a contrast between Cigarettes and Elvis. But for now we must begin to do this work ourselves.


Why are Frequentists so resistant to Bayesian inference? We will discuss this in later posts; but, suffice it to say, it is mostly because there seems to be a lot of guessing about the information used to make these calculations. I don’t really know how likely it is that listening to Elvis causes brain cancer. I said 1% but, of course, it is probably more like 1 in a billion. This uncertainty bothers a lot of statisticians, so they don’t want to use a system that is dependent on so much uncertainty. Many statisticians are working on methods of improving the processes for making these calculations, which will hopefully make our guesses more and more accurate.

Yet, the fact remains that the statistical methods that Frequentists want to adhere to just cannot provide any type of answer to the question, Does listening to Elvis cause brain cancer? So even a poor answer is better than no answer. I have patients to treat and I need to do the best I can today. Still, when trying to determine the probability of the Elvis hypothesis, I erred on the side of caution; by estimating even a 1% probability of this hypothesis, I remained open-minded. But even then, my study only said that there was a tiny chance that it did. If I’m worried about it, I can repeat the study a few times or power it better and get an even lower posterior probability. I am not though, so Don’t Be Cruel.

Oh, also remember this: the father of Frequentism, Fisher, pervertedly used his statistical methods to argue that cigarette smoking did not cause lung cancer – a fact that had been established by Bayesian inference.


Don’t keep reading below unless you really care about the math…

Power is defined as the probability of finding a true relationship, it it exists:


Similarly, the probability of claiming a false relationship is bound by alpha, which is the threshold that authors select for statistical significance, usually 0.05:


Recall the formula for determining the positive predictive value (PPV) is:


First, we need to determine how many true positives occur, and how many total positives occur. We must consider R, which is the ratio of true relationships to false relationships:


This can be based upon generalizations of the particular research field (such as genomics, epidemiology, etc.) or specific information already known about the hypothesis from previous evidence. This R ratio is the basis of the prior probability, that is, the probability that the alternative hypothesis is true: Piformal

This prior probability is calculated according to the formula,


We can therefore express the PPV either in terms of R:

PPVStudyOr we can express the PPV in terms of π (see here for source):

If a coefficient of bias is considered, the equation becomes (see here for derivation):


Selecting for the degree of bias is usually subjective, unless methods are used for estimating bias in particular fields or for particular subjects (the prevailing bias). If there have been multiple, independent studies performed, where n is the number of independent studies, not accounting for any bias, the equation becomes: PPVnThe False Positive Report Probability (FPRP) can be calculated simply as 1–PPV, or directly in the simple case in terms of π as:


Comments Off on How Do I Know If A Study Is Valid?

Filed under Evidence Based Medicine