Plotting

How does one plot 5.2 million XY data points?

I ran into this while working on a paper submission. This is one thing one does not lack for when doing evolutionary computation: size of data sets.

Matlab seems to become dog slow and unstable when trying to plot large numbers of data points. The interface bogs down such that trying to re-label axes is a real chore.

I tried out the GNU R package, and had it crash on trying to read in the data set.

Then I started going through plotting packages in the FreeBSD ports system. That’s where I came across the GRI package. This is an open source, GPL licensed graph plotting language. It has simple examples available online. As its documentation notes, it is a package with a fairly shallow learning curve. Its interface is entirely command-line, and its output option is PostScript. In interaction, it simply outputs PostScript graphic after Postscript graphic, simply named “gri-nn.ps” in the current working directory. One can import data from an ASCII file, where columns are separated by white space.

So that’s exactly what I did. 5.2 million data points in, one 151MB Postscript graphic out. Ghostscript can convert that to PDF, which can then be converted to all sorts of raster-based graphics formats. It’s not a perfect solution, but it is a working solution.

Back to the grind for me…

Wesley R. Elsberry

Falconer. Interdisciplinary researcher: biology and computer science. Data scientist in real estate and econometrics. Blogger. Speaker. Photographer. Husband. Christian. Activist.

13 thoughts on “Plotting

  • 2008/01/29 at 7:51 am
    Permalink

    If the issue is data visualisation, then you’d want to plot the whole data set. But if the issue is publication, why not plot a random sample – maybe 1% or 0.1% of your data?

  • 2008/01/29 at 9:42 am
    Permalink

    When you say that it has a shallow learning curve, do you mean that one must go far out the learning curve to have a significant rise? Or do you mean that the slope is gentle, making learning easy?

  • 2008/01/29 at 10:04 am
    Permalink

    TomS: The second of the two.

    IanR: Yes, I had done the sample graph as a stand-in while exploring other options. That might have sufficed, but I’d prefer to provide the plot as the complete data set.

  • 2008/01/29 at 11:12 am
    Permalink

    Why would you want to plot 5.2 million data points? You’ll just have a black smudge. Or a very large graph.

    Why not plot the density? I use some R code which calculates and plots contours. If you want the code, email me.

    Or you could bin the data (you have enough!), and plot those as contours, or as 3D plots.

    Bob

  • 2008/01/29 at 2:38 pm
    Permalink

    As Bob O’H: Exactly, why so many data points on a plot? I am facinated by what the need could be?

  • 2008/01/29 at 3:08 pm
    Permalink

    The basic aim is to show the correlation between the x and y variables. This does come through well enough with a scatterplot. 5.2M is the total population size.

  • 2008/01/30 at 8:33 am
    Permalink

    (sigh)

    Now I know how nonscientists feel when a defender of science takes the bait and responds to a creationist sound bite with a technical refutation.

    If you weren’t on my short list of heroes defending science education, I’d swear that you were part of the Evil Computer Geek branch of the nonexistent EAC.

    If you want a laugh, I use Excel to plot XY graphs, and yes, I did overload it a few times.

  • 2008/01/31 at 12:10 am
    Permalink

    Actually, I’m finding that my install of StatView, a Windows 3.1 program originally, handles my dataset okay under Windows Vista. It is slow, but there hasn’t been any tendency to crash.

  • 2008/02/02 at 10:37 am
    Permalink

    Still seems like you’re going for thud factor. Statistics might be your best friend. I would suspect that a graph made from a sample would properly make your point.

    I would love to see the graph – any chance you can post it for show and tell?

  • 2008/02/02 at 11:08 am
    Permalink

    Nice, thanks for that tip.

    On why plot the points? Well, I agree with Austringer. If you use a method where you plot the points if there are 50, 0r 500, etc., then it is appropriate to plot them out even if there are 5 million.

    You must promise us this, though: Let’s see the plot!!!

  • 2008/02/04 at 5:29 pm
    Permalink

    You have to be careful with that many points. Chances are, you’re *looking* at those 5M points spread out over, say, 800×800 pixels = 0.64M. There *will* be overlapping points. You say you can see the correlation—well, it’s possible that the correlation you see is just the outliers, the ones that don’t overlap.

    For example, imagine that you’re plotting points between 0

  • 2008/02/05 at 8:16 am
    Permalink

    Zarquon: Yes, FreeBSD does have gnuplot. For some reason, the gnuplot package didn’t install properly for 6.3-release. I have since installed from ports, but I got “gri” via pkg_add quicker.

    Ben M: Thanks, yes, it pays to take care. I also took the correlation test from Zar and wrote a Perl script to implement it via various machine formulae for the terms. The correlation is both reasonably strong and significant. I wouldn’t rely upon a graph for more than a visual confirmation of what the statistics tell.

    Greg Laden: My PI is not in favor of pre-publishing data or results, so I’ll have to defer putting the plot up for a bit. I did like the demo graphs you had for “gri” in your post, and I think those show capabilities nicely.

Comments are closed.