Feed on Posts or Comments

Computation Austringer on 29 Jan 2008 06:59 am

Plotting

How does one plot 5.2 million XY data points?

I ran into this while working on a paper submission. This is one thing one does not lack for when doing evolutionary computation: size of data sets.

Matlab seems to become dog slow and unstable when trying to plot large numbers of data points. The interface bogs down such that trying to re-label axes is a real chore.

I tried out the GNU R package, and had it crash on trying to read in the data set.

Then I started going through plotting packages in the FreeBSD ports system. That’s where I came across the GRI package. This is an open source, GPL licensed graph plotting language. It has simple examples available online. As its documentation notes, it is a package with a fairly shallow learning curve. Its interface is entirely command-line, and its output option is PostScript. In interaction, it simply outputs PostScript graphic after Postscript graphic, simply named “gri-nn.ps” in the current working directory. One can import data from an ASCII file, where columns are separated by white space.

So that’s exactly what I did. 5.2 million data points in, one 151MB Postscript graphic out. Ghostscript can convert that to PDF, which can then be converted to all sorts of raster-based graphics formats. It’s not a perfect solution, but it is a working solution.

Back to the grind for me…

Share and Enjoy:
  • Digg
  • Facebook
  • Google Bookmarks
  • LinkedIn
  • MySpace
  • Reddit

Viewed 1323 times by 565 viewers

13 Responses to “Plotting”

  1. on 29 Jan 2008 at 7:51 am 1.IanR said …

    If the issue is data visualisation, then you’d want to plot the whole data set. But if the issue is publication, why not plot a random sample – maybe 1% or 0.1% of your data?

  2. on 29 Jan 2008 at 9:42 am 2.TomS said …

    When you say that it has a shallow learning curve, do you mean that one must go far out the learning curve to have a significant rise? Or do you mean that the slope is gentle, making learning easy?

  3. on 29 Jan 2008 at 10:04 am 3.Austringer said …

    TomS: The second of the two.

    IanR: Yes, I had done the sample graph as a stand-in while exploring other options. That might have sufficed, but I’d prefer to provide the plot as the complete data set.

  4. on 29 Jan 2008 at 11:12 am 4.Bob O'H said …

    Why would you want to plot 5.2 million data points? You’ll just have a black smudge. Or a very large graph.

    Why not plot the density? I use some R code which calculates and plots contours. If you want the code, email me.

    Or you could bin the data (you have enough!), and plot those as contours, or as 3D plots.

    Bob

  5. on 29 Jan 2008 at 2:38 pm 5.George said …

    As Bob O’H: Exactly, why so many data points on a plot? I am facinated by what the need could be?

  6. on 29 Jan 2008 at 3:08 pm 6.Austringer said …

    The basic aim is to show the correlation between the x and y variables. This does come through well enough with a scatterplot. 5.2M is the total population size.

  7. on 30 Jan 2008 at 8:33 am 7.Frank J said …

    (sigh)

    Now I know how nonscientists feel when a defender of science takes the bait and responds to a creationist sound bite with a technical refutation.

    If you weren’t on my short list of heroes defending science education, I’d swear that you were part of the Evil Computer Geek branch of the nonexistent EAC.

    If you want a laugh, I use Excel to plot XY graphs, and yes, I did overload it a few times.

  8. on 31 Jan 2008 at 12:10 am 8.Austringer said …

    Actually, I’m finding that my install of StatView, a Windows 3.1 program originally, handles my dataset okay under Windows Vista. It is slow, but there hasn’t been any tendency to crash.

  9. on 02 Feb 2008 at 10:37 am 9.George said …

    Still seems like you’re going for thud factor. Statistics might be your best friend. I would suspect that a graph made from a sample would properly make your point.

    I would love to see the graph – any chance you can post it for show and tell?

  10. on 02 Feb 2008 at 11:08 am 10.Greg Laden said …

    Nice, thanks for that tip.

    On why plot the points? Well, I agree with Austringer. If you use a method where you plot the points if there are 50, 0r 500, etc., then it is appropriate to plot them out even if there are 5 million.

    You must promise us this, though: Let’s see the plot!!!

  11. on 03 Feb 2008 at 3:12 am 11.Zarquon said …

    FreeBSD ports doesn’t have gnuplot?

  12. on 04 Feb 2008 at 5:29 pm 12.Ben M said …

    You have to be careful with that many points. Chances are, you’re *looking* at those 5M points spread out over, say, 800×800 pixels = 0.64M. There *will* be overlapping points. You say you can see the correlation—well, it’s possible that the correlation you see is just the outliers, the ones that don’t overlap.

    For example, imagine that you’re plotting points between 0

  13. on 05 Feb 2008 at 8:16 am 13.Austringer said …

    Zarquon: Yes, FreeBSD does have gnuplot. For some reason, the gnuplot package didn’t install properly for 6.3-release. I have since installed from ports, but I got “gri” via pkg_add quicker.

    Ben M: Thanks, yes, it pays to take care. I also took the correlation test from Zar and wrote a Perl script to implement it via various machine formulae for the terms. The correlation is both reasonably strong and significant. I wouldn’t rely upon a graph for more than a visual confirmation of what the statistics tell.

    Greg Laden: My PI is not in favor of pre-publishing data or results, so I’ll have to defer putting the plot up for a bit. I did like the demo graphs you had for “gri” in your post, and I think those show capabilities nicely.

Trackback This Post | Subscribe to the comments through RSS Feed

Leave a Reply

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word






Support This SiteCafePress Shop
The Austringer © 2010 |ShadedGrey made free by Web Hosting Bluebook