In human genetics, we measure the strength of statistical evidence using a variety of maximized likelihood ratios, LODs, and empirical p values. I argue here that these statistics have highly undesirable properties as evidence measures when applied to complex disorders. Among other deficiencies, I show that when following up on an interesting finding, they will tend to erroneously indicate diminished evidence as more data are considered (e.g., the LOD will tend to go down at a linked locus as the sample size increases). This violates a fundamental assumption underlying standard linkage and association designs in which we first scan the genome for our best signals, and then follow up at those genomic positions with additional data. I argue here for a coherent theoretical approach to formalizing statistical evidence measures, and derive a set of minimal requirements that any evidence measure should meet, drawing heavily on an analogy with the thermometer. I speculate that measures of evidence that come closer to meeting these requirements will do a better job of finding and characterizing genes, and I propose an alternative evidence metric as a step in this direction.