Over at the “Uncommon Descent” blog, poster “niwrad” decided to dispute claims of high sequence similarity between human and chimpanzee genomes. “niwrad” posted a statistical test of human/chimp genome comparisons in September, 2010, and a follow-up post this week comparing two human genomes using the metric from the earlier post. These were brought to my attention by “CeilingCat” on the AtBC forum. What the pair of posts demonstrates is another instance of “intelligent design” creationism advocates engaging in mathematical hijinks. The “niwrad” performance has more to do with the style of illusionists than it does actual mathematical and statistical practice. What one has to do with these kinds of things is look for the sleight of hand. Because “niwrad” now has a pair of articles based on the same “trick”, it becomes easier to point out exactly where the prestidigitation happened and why it is reasonable to infer that “niwrad” knows full well that it is a trick.
Let’s review some highlights from “niwrad”‘s initial post:
Supporters of the neo-Darwinian theory of evolution have a strong ideological motivation for minimizing the differences between humans and chimps, as they claim that these two species evolved from a common ancestor, as a result of random mutations filtered by natural selection. Now, I don’t personally believe that humans and chimps share a common ancestry, for a host of reasons that would take me too long to explain in this post. Nor do I attach much significance to the magnitude of the genetic differences between these two species, per se, because in my opinion, the fundamental differences between these creatures lie elsewhere. […]
[…] The comparison I performed was completely different from those usually performed by geneticists, because was purely statistical in nature. In a sense, it could be described as an application of the well-known Monte Carlo method. […]
[…] While there is only one possible method of comparing identity between strings of characters (the above pairwise comparison), there are many methods of comparing similarity. In other words, there are many measures of similarity, depending on the rules of pattern matching that we choose. […]
Any final result for a complete statistical similarity test (especially if it is a unique number) is meaningful only if: 1) the distance function is mathematically defined; 2) the rules for pattern matching and the formulas for calculating the result are explained in detail; 3) it is clearly stated which parts of the input strings are being examined; 4) in the event that computer programs were used to perform the comparison, the source codes and algorithms are provided. My explanations below have the goal to meet the three first constraints. To satisfy the fourth condition, the source file of the Perl script used for the test is freely downloadable here.
For each pair of homologous chromosomes A and B, a PRNG (pseudo-random number generator) generates 10,000 uniformly distributed pseudo-random numbers which specify the offset, or starting point, of 10,000 30-base patterns that are contained in source chromosome A. The 30BPM test involves searching for all 10,000 of these DNA sub-strings of chromosome A in our target chromosome B. Now let F be the number of patterns located (at least once) in chromosome B. The 30BPM similarity is simply defined as F/100 (minimum value = 0%, maximum value = 100%). The absolute difference between 10,000 and F (minimum 0, maximum 10,000) is the 30BPM distance. […] It can easily be seen that the 30BPM distance will be zero (30BPM similarity = 100%) if the two strings are identical. In an additional test which I performed on two random 100 million-base DNA strings, the 30-BPM distance was 10,000 (i.e. no patterns on A were located in B). […]
The results obtained are statistically valid. The same test was previously run on a sampling of 1,000 random 30-base patterns and the percentages obtained were almost identical with those obtained in the final test, with 10,000 random 30-base patterns. When human and chimp genomes are compared, the X chromosome is the one showing the highest degree of 30BPM similarity (72.37%), while the Y chromosome shows the lowest degree of 30BPM similarity (30.29%). On average the overall 30BPM similarity, when all chromosomes are taken into consideration, is approximately 62%. Here we have the classic case of the glass which some people perceive as being half-full, while others perceive it as being half-empty. When compared to two random strings which are 0% similar, 62% is a very large value, so nobody would deny that human and chimp genomes are quite similar! On the other end, 62% is a very low value when compared to the more than 95% similarity percentages which are published by bioinformatics evolutionary researchers. Now, I realize that it may seem somewhat arbitrary to choose 30-base-long patterns, as I did in my test, and indeed it is arbitrary to some degree. However, if the two genomes were really 95% similar or more, as is commonly claimed, also a 30BPM statistical test should produce 95% results, and it does not.
Emphasis added to “niwrad”‘s central claim.
The claim is, of course, poppycock. Anyone with the slightest pretension to an understanding of probability or statistics would recognize that the proposed “30BPM” metric is non-linear and not directly comparable to straight-up sequence similarity numbers. What’s truly ironic is that if “niwrad” were slightly more astute, he might have realized that his “30BPM” metric actually confirms the high sequence similarity results that he claims to have rebutted.
And that brings us to “niwrad”‘s second post, the one that aims to apply his “30BPM” metric to intra-specific genome comparisons, this time done as human-to-human comparison.
One reader suggested applying an identical test in order to compare two human genomes. That sounded like a very good idea to me, so I downloaded another human genome dataset from NCBI and performed a test.
Finally, the average number of pattern matches per chromosome, shown at the bottom of the table, was very different in the two cases: 9616 for human vs. human comparisons, but only 6173 for chimp vs. human comparisons. The average number of patterns without a match for human vs. human comparisons was (10000 – 9616) = 384, or in percentage terms, 384/10000 = 3.84%. The average number of patterns without a match in human vs. chimp comparisons was (10000 – 6173) = 3827, or in percentage terms, 3827/10000 = 38.27%, which is almost ten times greater.
So the bottom-line question is: if, as many evolutionists say, chimpanzee and human genomes are 99% identical, how “identical” are two human genomes?
“niwrad”‘s final question is interesting for the very salient reason that he did not provide an answer for it, even though his whole trick depends on the conceit that he has developed a better metric for quantifying sequence similarity than that used by actual geneticists. There is a reason why “niwrad” failed to answer, though, and that is that trying to claim that there is only 96.16% sequence similarity between two human genomes is manifestly risible. We know that the “trick” involved here is to confuse genetic sequence similarity with the “30BPM” metric, and that when faced with an obviously nonsensical outcome, “niwrad” punted rather than make explicit the full ridiculousness of his claim.
Above, I mentioned that “niwrad”‘s metric actually confirms high sequence similarity values. Here’s how that happens. First, one needs to realize that one doesn’t need “Monte Carlo” techniques to evaluate “niwrad”‘s “30BPM” metric: we can develop its properties with the usual probabilistic equations. The parameters of interest to us are the rate of change (C), the length of the analysis sequence (K), and the probability of a match (p). If we assume a uniform distribution of changes, then our model is simply the probability p that we do not observe a change within our analysis window K at a particular rate of change C. And that is simply expressed as
Besides being simple, it is obviously also nonlinear. Notice that “niwrad” made quite a fuss about how his metric did what everyone expects for the endpoints of the distribution, where complete sequence identity happened and where complete randomness obtained. Notice that “niwrad” did not go anywhere near calibrating his metric against an expectation concerning a sequence with a known amount of similarity. There’s a reason for that, specifically, that one can’t blather about greater-than-expected dissimilarity if one actual calibrates the technique for known amounts of sequence similarity.
For example, what is the expected “30BPM” result when sequence similarity is actually 99%? We just solve the equation above to yield:
Similarly, when sequence similarity is 99.9%, the “30BPM” expected result is:
So, what about “niwrad”‘s “30BPM” numbers that he obtained empirically? We can convert those back into sequence similarity numbers, which are not the same thing as “30BPM” numbers at all. The equation is simply a rearrangement of the one above:
“niwrad”‘s average “30BPM” value for the human-chimp comparison was 0.6173, giving a sequence similarity estimate of 0.984.
“niwrad”‘s average “30BPM” value for the human-human comparison was 0.9616, giving a sequence similarity estimate of 0.9987.
I should note that “niwrad”‘s “30BPM” metric becomes bloody useless at a point far short of completely random sequences. What point is that? I’m glad that you asked. Given a sample of 10,000 analysis windows, the threshold of usability would be when you have a 50% chance of seeing one match out of those 10,000 samples. That sets p at 0.00005 and gives C as 0.28116. That is, any sequence similarity of less than 0.719 will look exactly the same in “30BPM” terms and be ranked as having 0% similarity.
The “30BPM” metric deployment by “niwrad” does exactly what it was designed to do: exaggerate dissimilarity. It’s a magic trick intended to make an inconvenient fact disappear. It is a fundamentally dishonest exercise.
Update: Fixed the discrepancy between the symbols I defined and what I used in the equations. References to R should have been C, and now are.