It’s Either Normal … Or It Isn’t? Is this true?

Well, yes. This black and white statement sounds false to anyone with even a little bit of Bayesian thinking, but let me explain. Let’s imagine that we develop a test for an imaginary hormone called Mollynium. We study a “normal” population and we determine that normal levels of Mollynium range from 3-12 units, with an average of 7.5. Essentially, we define the normal range of Mollynium as the median ± 2 standard deviations. In this fictional case, this is 3-12 units. We assume that patients with a Mollynium level greater than 12 units have hypermollynemia, and those less than 3 units have hypomollynemia. Let’s say that the purpose of Mollynium is to make people happy. So people with low levels tend towards sadness and people with high levels are giddy all the time.

If a patient presents with normal levels of happiness, we would expect to find her Mollynium level between 3-12 units. Nevertheless, some patients who are completely “normal” will have Mollynium levels above 12 or below 3. The median ± 2 standard deviations doesn’t actually capture everyone who is “normal,” but we have to draw a line in the sand somewhere.

What if a sad patient presents? If we check the level and discover that it is only 1 unit, then we might reasonably conclude that the patient needs more Mollynium. Similarly, someone who can’t stop laughing and whose level is 15 might have too much. There are other reasons why a person might be too sad or too happy, though. Many sad or giddy patients will present with normal Mollynium levels and have a different problem – say, depression or a frontal lobe stroke.

The problem is that there are patients who will present with abnormal levels of Mollynium, as compared to a normalized population, but that “abnormal” Mollynium level is not actually the reason that they are too happy or too sad. A patient might have a level of 14 units and that happens to be normal for that individual, and the real cause for her uncontrollable laughter is a frontal lobe lesion. So from a Bayesian perspective, the diagnosis is never certain, just more likely or less likely; yet we are always open to the idea that sometimes when you hear hoofbeats, it actually is a zebra. Thus the Bayesian isn’t disturbed when the patient doesn’t fit the expected lab parameters. But there is a major pitfall that sometimes occurs in clinical thinking: diagnostic (or therapeutic) drift.

Diagnostic drift may have more than one meaning, but it essentially is when the parameters that define the diagnosis are expanded (for whatever reason) leading to more diagnoses. In the same way, I use the term “therapeutic drift” to indicate either over-treatment due to over-diagnosis or to indicate extra-treatment with an intervention (if some is good, more must be better).

Let’s talk about real examples. One method of determining amniotic fluid volume in pregnancy is to use the Amniotic Fluid Index or AFI (it is probably not the best way). The AFI is the sum of the maximum vertical depth of amniotic fluid measured in four quadrants and is reported in centimeters. We define “normal” as between 5-25 cm. In theory, these values represent the median ± 2 standard deviations. Outside of these values, we would expect to see an increase in abnormal outcomes or associated problems, and, generally speaking, we do. So if a pregnant woman has an AFI of 6 cm, is it “normal” or not? Well it is below the average, certainly, but it is still normal. She should be treated no differently than if her fluid were 12 cm. Similarly, 23 cm is definitely above average, but is still “normal.” The patient should be reassured of the normalcy and should be treated no differently than if her fluid were 12 cm.

But this isn’t what happens. Instead, the patient is often treated as abnormal. She is told that her fluid is “low normal” or “high normal” and offered interventions, either in the form of extra testing or other things believed to affect the fluid level. This over-intervention in some cases will lead to harm with no evidence of benefit (if there were evidence of benefit, we would change the definition of “normal”). Does an AFI of 6 cm mean that a patient doesn’t have ruptured membranes or uteroplacental insufficiency? Does an AFI of 24 cm mean that a patient doesn’t have uncontrolled diabetes of fetal tracheal atresia? Of course not. But from a Bayesian perspective, there should be more evidence that would factor into assessing the probabilities of those diagnoses. Normal is normal. Labelling a person with “low” or “high” amniotic fluid in these situations is an example of diagnostic drift and it is harmful if it leads to more unnecessary interventions without benefit to the patient.

The problem with drift is, where does it end? How about an AFI of 7 cm? 8 cm? Where is the line in the sand? Well, it’s at 5 cm. Another example might be ultrasound estimated fetal weight. If we decide to do extra testing and/or monitoring on fetal weights less than 10th%tile or greater than 90th%tile, what do we do with the patient whose baby is 11th%tile or 89th%tile? Nothing. Tell her that her baby is normal (unless you have other data that would indicate a problem).

What about the margin of error of our tests though? Ultrasound just isn’t that accurate, so that 11th%tile fetus might actually be 8th%tile given the margin of error of the test. But this error is already reflected in our testing parameters. We don’t really care about small babies until they are less than 5th%tile (or even 3rd%tile). So by testing fetuses less than 10th%tile (and not 11th), we capture this whole population, even given the errors in the test. This thinking is just drift.

Drift happens when we treat a temperature of 100.3 as a fever. It happens when we treat fatigued patients with a normal TSH for hypothyroidism anyway because her TSH is “high normal.” It happens in all fields of medicine and leads to over-diagnosis of everything form myocardial infarctions to cancer, with all of the unnecessary interventions and harms that one might expect.

Therapeutic drift is a natural consequence of diagnostic drift. The more we relax our standards for diagnosis, the more we apply treatments. When we treat healthy people, we usually have good outcomes. This, in turn, reinforces the behavior because the treatments seem to lead to good outcomes.

Therapeutic drift happens in other ways, too. If 600 IU of Vitamin D per day is good for you (I’m not saying it is or isn’t), then 1200 or 1800 IU must be even better. If a blood level of at least 20 ng/ml of Vit D is healthy, then having a level of at least 30 ng/ml must be even healthier. If a pap smear every 3 years is good, then one every year must be amazing. You can probably think of dozens of other examples. More is not always better, particularly when there are harms associated with more.

Less isn’t necessarily better either, of course. If a Cesarean delivery rate of 15% is better than 35%, that doesn’t necessarily mean that a rate of 5% is better than 15%. This is really the point of evidence-based medicine: to study large populations and determine the parameters for diagnosis and treatment that maximize benefits while minimizing risks. One practitioner in the scope of one practice cannot see and interpret enough data to make these decisions. So we have to rely on large populations to do so and trust the parameters that the data indicates.

This never means that we blindly accept the results of our tests, either. Bayesian reasoning frees the practitioner to realize that sometimes a patient has the disease, even though the test says that the patient does not. But that cognitive process, rooted in analyzing the total data and information available for the patient, does not excuse diagnostic drift.