<?xml version="1.0" encoding="utf-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Jones, Luskin, and Text</title>
	<atom:link href="http://austringer.net/wp/index.php/2007/01/31/jones-luskin-and-text/feed/" rel="self" type="application/rss+xml" />
	<link>http://austringer.net/wp/index.php/2007/01/31/jones-luskin-and-text/</link>
	<description>Wesley R. Elsberry&#039;s personal weblog, talking about falconry, science, antievolution, computation, and the broken body he lives in.</description>
	<lastBuildDate>Tue, 07 Feb 2012 21:25:08 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: Austringer</title>
		<link>http://austringer.net/wp/index.php/2007/01/31/jones-luskin-and-text/comment-page-1/#comment-76392</link>
		<dc:creator>Austringer</dc:creator>
		<pubDate>Fri, 02 Feb 2007 13:24:23 +0000</pubDate>
		<guid isPermaLink="false">http://austringer.net/wp/?p=494#comment-76392</guid>
		<description>It has come to my attention that some people haven&#039;t kept track of the numbers for similarity for text comparisons. 

There are a few points to make.

First, in order to make a precise statement of number of words copied between two texts, one would need to specify at a minimum the smallest run of words that will count toward that measure. In the trivial case, with a minimum run length of &lt;i&gt;one&lt;/i&gt;, every text is copied 100% verbatim from an unabridged dictionary (assuming correct spelling and no neologisms). We are not interested in the trivial case, though, and that makes things more interesting. If we search for a run length of, say, ten words (which is my default), we can be pretty confident that any two texts that share one or more such runs of words it is because of a shared source, either one from the other or both from a third text. When one looks for such exact matches, one could generate a number with precision and only have one parameter to qualify that with. Any two long but unrelated texts in the same language are likely to share a preponderance of runs of 1 word (it is like the dictionary comparison mentioned earlier), a substantial number of 2-word runs, and decreasing numbers of matches at 3, 4, 5, etc. word runs. By the time one gets to considering 5 words or more in a row, the absolute number of matches between unrelated texts should be close to zero. So even when considering completely verbatim copying, one can have analyses that report differing percentages of matches in a subject text simply because of the length of the run being considered to constitute a match.

Second, when one wants to find more than just simple verbatim copying, one will have to make more choices, and that means more potential variation in results. Words may be changed within a run, words may be inserted, or words may be deleted. &quot;Fourscore and seven long years ago&quot; should be recognized as having been derived from the Gettysburg Address despite having an inserted word not present in the original. One can choose how many words at a time might have been changed, inserted, or deleted and still cause a determination that a match has occurred. This number has to be strictly less than the minimum run length being considered. When I analyse for 10 word runs, I have settled upon up to 4 words as potentially being changed, skipped, or deleted.

Third, making this a matter of algorithm rather than eyeballing removes the subjectivity from the analysis. I can choose my parameters, but once I&#039;ve done that, that&#039;s it. The number that pops out for one text being considered as the source of another is fixed given that choice of parameters. This is a Good Thing.

Fourth, when a percentage is reported, it depends critically upon &lt;i&gt;which way&lt;/i&gt; that the analysis was run. The number one obtains to answer the question, &quot;How much of reference text A was copied from subject text B?&quot; is by no means the &lt;i&gt;same&lt;/i&gt; number as one gets by asking, &quot;How much of reference text B was copied from subject text A?&quot; At the extreme end, consider a chapter from &lt;i&gt;Moby Dick&lt;/i&gt; and the whole book as texts. The chapter is 100% copied from the book, but the book is only, say, 4% copied from the chapter.

Fifth, I like seeing a side-by-side comparison of texts as well as getting the numbers. That is why I have &lt;i&gt;always&lt;/i&gt; provided such views as supplements to the summary numbers when I&#039;ve done this sort of text comparison. The present instance is no exception, notwithstanding the apparent inability of some naysayers to notice and follow the provided links.

OK, so here is a list of numbers and what they mean for stuff being discussed here.

90.9% -- This is the number the Discovery Institute has settled on as representing how much of the KvD decision&#039;s section on whether ID is science (let&#039;s call it &quot;KvD-IDsci&quot; for short) was copied from the plaintiffs&#039;s proposed findings of fact section dealing with the same topic (ppfof-IDsci). How did they get that? Somebody at the DI eyeballed it and said that&#039;s close enough. If they tasked someone else to do it again, there is no guarantee that the number would remain the same.

70% -- the proportion of copied text to uncopied text *100 in KvD-IDsci taken from ppfof-IDsci when using parameters of a minimum of 5 words in a run and up to 2 words being changed, skipped, or deleted. This is a liberal matching criterion.

66% -- the proportion of copied text to uncopied text *100 in KvD-IDsci taken from ppfof-IDsci when using parameters of a minimum of 10 words in a run and up to 4 words being changed, skipped, or deleted. This is a conservative matching criterion, and the standard one I use for text matching.

48% -- the proportion of text copied by KvD-IDsci to text not copied there *100 in ppfof-IDsci, using parameters of a minimum of 10 words in a run and up to 4 words being changed, skipped, or deleted. This one has confused at least one person, who seems to have thought that this was another number applied to the analysis of KvD-IDsci. Instead, this number indicates how much &lt;i&gt;of&lt;/i&gt; ppfof-IDsci was &lt;i&gt;used by&lt;/i&gt; Judge Jones, not how much &lt;i&gt;of&lt;/i&gt; KvD-IDsci came from there.

38% -- the proportion of copied text to uncopied text *100 in the KvD decision taken from the plaintiffs&#039;s proposed findings of fact when using parameters of a minimum of 10 words in a run and up to 4 words being changed, skipped, or deleted. The whole ruling has quite a bit of text that did not come from the PPFOF.</description>
		<content:encoded><![CDATA[<p>It has come to my attention that some people haven&#8217;t kept track of the numbers for similarity for text comparisons. </p>
<p>There are a few points to make.</p>
<p>First, in order to make a precise statement of number of words copied between two texts, one would need to specify at a minimum the smallest run of words that will count toward that measure. In the trivial case, with a minimum run length of <i>one</i>, every text is copied 100% verbatim from an unabridged dictionary (assuming correct spelling and no neologisms). We are not interested in the trivial case, though, and that makes things more interesting. If we search for a run length of, say, ten words (which is my default), we can be pretty confident that any two texts that share one or more such runs of words it is because of a shared source, either one from the other or both from a third text. When one looks for such exact matches, one could generate a number with precision and only have one parameter to qualify that with. Any two long but unrelated texts in the same language are likely to share a preponderance of runs of 1 word (it is like the dictionary comparison mentioned earlier), a substantial number of 2-word runs, and decreasing numbers of matches at 3, 4, 5, etc. word runs. By the time one gets to considering 5 words or more in a row, the absolute number of matches between unrelated texts should be close to zero. So even when considering completely verbatim copying, one can have analyses that report differing percentages of matches in a subject text simply because of the length of the run being considered to constitute a match.</p>
<p>Second, when one wants to find more than just simple verbatim copying, one will have to make more choices, and that means more potential variation in results. Words may be changed within a run, words may be inserted, or words may be deleted. &#8220;Fourscore and seven long years ago&#8221; should be recognized as having been derived from the Gettysburg Address despite having an inserted word not present in the original. One can choose how many words at a time might have been changed, inserted, or deleted and still cause a determination that a match has occurred. This number has to be strictly less than the minimum run length being considered. When I analyse for 10 word runs, I have settled upon up to 4 words as potentially being changed, skipped, or deleted.</p>
<p>Third, making this a matter of algorithm rather than eyeballing removes the subjectivity from the analysis. I can choose my parameters, but once I&#8217;ve done that, that&#8217;s it. The number that pops out for one text being considered as the source of another is fixed given that choice of parameters. This is a Good Thing.</p>
<p>Fourth, when a percentage is reported, it depends critically upon <i>which way</i> that the analysis was run. The number one obtains to answer the question, &#8220;How much of reference text A was copied from subject text B?&#8221; is by no means the <i>same</i> number as one gets by asking, &#8220;How much of reference text B was copied from subject text A?&#8221; At the extreme end, consider a chapter from <i>Moby Dick</i> and the whole book as texts. The chapter is 100% copied from the book, but the book is only, say, 4% copied from the chapter.</p>
<p>Fifth, I like seeing a side-by-side comparison of texts as well as getting the numbers. That is why I have <i>always</i> provided such views as supplements to the summary numbers when I&#8217;ve done this sort of text comparison. The present instance is no exception, notwithstanding the apparent inability of some naysayers to notice and follow the provided links.</p>
<p>OK, so here is a list of numbers and what they mean for stuff being discussed here.</p>
<p>90.9% &#8212; This is the number the Discovery Institute has settled on as representing how much of the KvD decision&#8217;s section on whether ID is science (let&#8217;s call it &#8220;KvD-IDsci&#8221; for short) was copied from the plaintiffs&#8217;s proposed findings of fact section dealing with the same topic (ppfof-IDsci). How did they get that? Somebody at the DI eyeballed it and said that&#8217;s close enough. If they tasked someone else to do it again, there is no guarantee that the number would remain the same.</p>
<p>70% &#8212; the proportion of copied text to uncopied text *100 in KvD-IDsci taken from ppfof-IDsci when using parameters of a minimum of 5 words in a run and up to 2 words being changed, skipped, or deleted. This is a liberal matching criterion.</p>
<p>66% &#8212; the proportion of copied text to uncopied text *100 in KvD-IDsci taken from ppfof-IDsci when using parameters of a minimum of 10 words in a run and up to 4 words being changed, skipped, or deleted. This is a conservative matching criterion, and the standard one I use for text matching.</p>
<p>48% &#8212; the proportion of text copied by KvD-IDsci to text not copied there *100 in ppfof-IDsci, using parameters of a minimum of 10 words in a run and up to 4 words being changed, skipped, or deleted. This one has confused at least one person, who seems to have thought that this was another number applied to the analysis of KvD-IDsci. Instead, this number indicates how much <i>of</i> ppfof-IDsci was <i>used by</i> Judge Jones, not how much <i>of</i> KvD-IDsci came from there.</p>
<p>38% &#8212; the proportion of copied text to uncopied text *100 in the KvD decision taken from the plaintiffs&#8217;s proposed findings of fact when using parameters of a minimum of 10 words in a run and up to 4 words being changed, skipped, or deleted. The whole ruling has quite a bit of text that did not come from the PPFOF.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: C.E. Petit</title>
		<link>http://austringer.net/wp/index.php/2007/01/31/jones-luskin-and-text/comment-page-1/#comment-76206</link>
		<dc:creator>C.E. Petit</dc:creator>
		<pubDate>Thu, 01 Feb 2007 03:17:52 +0000</pubDate>
		<guid isPermaLink="false">http://austringer.net/wp/?p=494#comment-76206</guid>
		<description>It&#039;s pretty obvious to me that Luskin is not a litigator. The whole point of a well-drafted set of proposed findings of fact and conclusions of law is that they persuasively &#151; but fairly &#151; summarize the evidence as it relates to the pleadings. Some lawyers do go overboard; they&#039;re the ones who have little credibility with judges... and get the favorable results they obtain at trial overturned on appeal.

What Luskin has apparently forgotten is the concept of restricted choice. Civil litigation begins the the language of the complaint. That begins to restrict what a judge can say in response, and how. As the parties put in routine motions, evidence, and posttrial motions, the judge&#039;s expression gets further restricted; after all, if a judge&#039;s opinion does not clearly come from the pleadings and evidence before him/her, the judge&#039;s opinion will be overturned on appeal (and judges don&#039;t like that very much).

Luskin is too used to being able to write in a factual vacuum. Commentators not bound to the advocacy of a client&#039;s position enjoy this freedom, one that is denied to litigators... and judges.</description>
		<content:encoded><![CDATA[<p>It&#8217;s pretty obvious to me that Luskin is not a litigator. The whole point of a well-drafted set of proposed findings of fact and conclusions of law is that they persuasively &#8212; but fairly &#8212; summarize the evidence as it relates to the pleadings. Some lawyers do go overboard; they&#8217;re the ones who have little credibility with judges&#8230; and get the favorable results they obtain at trial overturned on appeal.</p>
<p>What Luskin has apparently forgotten is the concept of restricted choice. Civil litigation begins the the language of the complaint. That begins to restrict what a judge can say in response, and how. As the parties put in routine motions, evidence, and posttrial motions, the judge&#8217;s expression gets further restricted; after all, if a judge&#8217;s opinion does not clearly come from the pleadings and evidence before him/her, the judge&#8217;s opinion will be overturned on appeal (and judges don&#8217;t like that very much).</p>
<p>Luskin is too used to being able to write in a factual vacuum. Commentators not bound to the advocacy of a client&#8217;s position enjoy this freedom, one that is denied to litigators&#8230; and judges.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mark</title>
		<link>http://austringer.net/wp/index.php/2007/01/31/jones-luskin-and-text/comment-page-1/#comment-76205</link>
		<dc:creator>mark</dc:creator>
		<pubDate>Thu, 01 Feb 2007 03:16:29 +0000</pubDate>
		<guid isPermaLink="false">http://austringer.net/wp/?p=494#comment-76205</guid>
		<description>The &lt;i&gt;York Dispatch&lt;/i&gt; carried a version of Luskin&#039;s rant (19 December 2006). A week earlier, the &lt;i&gt;Dispatch&lt;/i&gt; editorial was &quot;&#039;Design&#039; folks lose poorly&quot; and a number of other articles in the York and other Pennsylvania papers similarly spoke of the Discovery Institute in less-than-admirable terms. In a letter to the editor of the &lt;i&gt;Dispatch&lt;/i&gt; responding to Luskin&#039;s whine (reproduced &lt;a href=&quot;http://divineafflatus.blogspot.com/2006/12/letter-to-editor-york-dispatch.html&quot; rel=&quot;nofollow&quot;&gt;here&lt;/a&gt;), I noted there were several things Luskin failed to mention in his opinion piece that were at least as important as what he did mention. Furthermore, I noted, it is heartening to see that many editors and reporters &quot;do indeed understand the Discovery Institute&#039;s tactics.&quot;</description>
		<content:encoded><![CDATA[<p>The <i>York Dispatch</i> carried a version of Luskin&#8217;s rant (19 December 2006). A week earlier, the <i>Dispatch</i> editorial was &#8220;&#8216;Design&#8217; folks lose poorly&#8221; and a number of other articles in the York and other Pennsylvania papers similarly spoke of the Discovery Institute in less-than-admirable terms. In a letter to the editor of the <i>Dispatch</i> responding to Luskin&#8217;s whine (reproduced <a href="http://divineafflatus.blogspot.com/2006/12/letter-to-editor-york-dispatch.html" rel="nofollow">here</a>), I noted there were several things Luskin failed to mention in his opinion piece that were at least as important as what he did mention. Furthermore, I noted, it is heartening to see that many editors and reporters &#8220;do indeed understand the Discovery Institute&#8217;s tactics.&#8221;</p>
]]></content:encoded>
	</item>
</channel>
</rss>

