Nutonian’s Eureqa and Concerns of Overfitting
I’ve been using Nutonian’s Eureqa symbolic regression product extensively since early 2013. Back in late 2013, there was an article about Nutonian’s Eureqa that elicited comments.
An “A.E. Bartholomew” weighed in with a comment that the title, “Nutonian raises $4M to extract ‘laws of physics’ from data”, was “hyperbolic and misleading”.
That led me to write a comment of my own in the thread there. I’ll quote it here:
Wesley Elsberry Friday, October 25, 2013
The Eureqa software is deserving of the enthusiasm of Mr. Harris. The 2009 publication in Science is precisely about discovering laws of physics in data, specifically a Hamiltonian for a double-pendulum system. What Eureqa got was the motion data from a double-pendulum, and the Hamiltonian describing the system was in the output. There are a total of eight “laws of physics” listed on p.83 of that article found only by analysis of the physical data. I have been using Eureqa intensely for several months in my job (my opinions here are, of course, my own), and what I am routinely seeing emerge are “rules of” our field. I’m not in physics, so Eureqa isn’t finding “laws of physics” in our data, but I can say that Eureqa consistently finds the relationships we know about, and is providing valuable insight concerning relationships we had not expected.
I hadn’t checked back on that since, and found both that my comment attracted another, and comments have since been turned off. Since I can no longer reply in thread, the best I can do is reply here.
First off, the comment responding to me:
Hans Wolters Sunday, November 3, 2013
I have no first hand experience with the software, but it seems to me that overfitting would be a constant problem with that approach
And my response…
Experience with the tool does show that Nutonian saw the potential for overfitting and took steps to address that. First, though, overfitting is endemic to just about any approach, so saying “overfitting is a constant problem” is true, but exceedingly banal: you could say the same of any modeling tool. Second, the lack of familiarity with the Eureqa software means that Hans missed out on how Eureqa provides excellent means of figuring out when overfitting may be an issue and uses principles of operation that mitigate the danger of overfitting. Eureqa produces mathematical equations that represent relations in the input data. One can directly analyze a Eureqa proposed model to determine if it passes the smell test, a feature that is lacking in the usual multiple linear regression or many other machine learning approaches. The outputs of those are usually pretty darn opaque. Eureqa also doesn’t report just one equation, not unless the relation is utterly trivial and the solution is exact. So what Eureqa provides to the researcher is the Pareto frontier of solutions, the solutions that are locally dominant for both amount of error and complexity of equation. A typical run of Eureqa might yield a dozen or more possible equations to describe the data. One of those is likely to be as small as possible, a simple constant relating the mean of the model variable values. Then perhaps an equation with a constant term and a single explanatory variable, perhaps multiplied by a factor. The size or complexity of equations grows along the Pareto frontier as the error drops. This is expected: adding terms to models increases fit, but may introduce overfitting, the problem Hans is interested in. Eureqa graphs the Pareto frontier, making it visually simple to pick out where major reductions in error terms occur with small changes in equation complexity. It is at these points in the Pareto frontier where one is likely to find the solution that maximally informs concerning what the data has to convey. The overall goal of using Eureqa is finding precisely those models that are both simple and accurate. Eureqa can even express the error metric as Akaike Information Criterion, which handily combines the concerns of good model fit and retaining model simplicity to avoid overfitting. The Eureqa Desktop also tracks performance on the part of the data that has been reserved for validation. A reduction in performance on the validation set is a strike against a model, so those tend to not be retained by Eureqa to report later. And Eureqa, by default, will put aside a random sample of your data for validation if you don’t specify a validation set yourself. It’s almost like Nutonian engineered Eureqa to avoid delivering overfit models. Certainly, one can overfit using Eureqa, if one simply always chooses the most complex equation that Eureqa reports. But one is not obliged to overfit with Eureqa, and the software is engineered to make not overfitting easy.
Very well written response.