Some Data Analysis and Visualization

As noted here before, I’m working through refreshing archived data, mostly from CD-ROM media. I’ve run into a whole batch of CD-ROM disks that are in good physical condition, but which mostly cannot be read. I’m trying some tools that I’ve seen recommended, but would be open to suggestions.

But the whole point of getting the archived data refreshed is to do something with it. And that’s what I will aim to discuss here this time.

Over several years, there were a number of different technologies I was using to collect bioacoustic data. This means that I don’t have one single type of data of interest. I have data that was recorded on audio cassette tape. I have data from a Racal Store V data recorder that was transferred to cassette tape. I have digital data from Keithley-Metrabyte DAS-1800 DAQ, Tucker Davis Technologies DAQ, and a couple of different National Instruments DAQ boards multiplied by at least two different multichannel scenarios. Plus, there’s digital data transferred off of a Racal Storeplex unit via SCSI. There’s mixed endian byte order issues, among other things.

I have a good software solution for two of these particular data acquisition scenarios. I wrote that between 1999 and 2001 using Borland’s Delphi 5. In all, there’s about 60,000 lines of code for data acquisition, reduction, analysis, and visualization. The original can handle multi-channel recordings taken from a single National Instruments board. A variant works on digitized audio recordings. That includes interactive data reduction with an automated click-picker whose choices can be refined with changes in parameters or by interaction with an oscillogram graph.

That still leaves a lot of data waiting for analysis. During my time at Michigan State University, I got into Python programming. There are a number of nice things about going after the rest of the data with Python. A big one is that Python is free, open-source software. I can have colleagues install it and not have to worry about breaking their budgets, which is a concern when one considers the well-established science and engineering scripting platform, MATLAB. While Python doesn’t yet have all the “toolbox” capability of MATLAB, it has enough to move ahead with. For the scientific programmer, there are the Numpy, Scipy, and Pylab modules (I installed the Python(x,y) package on my Windows laptop, which includes those and more besides.) Numpy extends Python with a fast array and matrix manipulation capability. Scipy includes a variety of analysis tools. Pylab looks to put a wrapper on those two, plus the Matplotlib graphics module and the Ipython interactive shell.

I recently wanted to extract spectral information about dolphin clicks from one of the datasets that I hadn’t previously examined. So I turned to Python to do that. The data was stored as raw binary, 16 bit signed integer samples. Reading that data was simply:

fd = open(fn, ‘rb’)
read_data = np.fromfile(file=fd, dtype=np.int16)

where “fn” is a filename pulled from the directory of interest. The “np” reference above resolves to “numpy”. The three lines say to get an open file object, fd, by opening a file, fn, for binary read. Then, a Numpy array containing the data is returned by the Numpy static method, fromfile, given the file object and the specification of the data type as signed 16 bit integers. The third line closes the file object. If I had a problem with endian issues, there’s at least a couple of ways to address that in Numpy. (Getting the wrong byte order should be obvious on visualization, but I’ve seen a professor merrily tout a new processing method for dolphin clicks when his slides clearly showed that he had a byte-order problem with his dataset.)

While it is better to handle DC offset problems at the time of data collection, sometimes you just have to deal with it at analysis time. This dataset handed me that problem. This problem is one where a time-varying signal should be centered at zero volts input, but instead centers at some non-zero voltage. Fortunately, it was a fixed offset, so a pretty simple approach worked nicely: find the mean value across the dataset, and subtract that value from each sample.

shiftdata = read_data + ([-np.average(read_data)])

The use of a Numpy array for the data means that the one line above handles the element-wise addition operation. The Numpy array on the left is now a floating-point array instead of an integer array.

My Delphi program had a click-picking algorithm that took a while to craft. I haven’t ported it yet, so I just went with a very simple approach in Python. That looks at chunks of the data, where the chunksize was selected to be a bit larger than the maximum click width, but a good deal smaller than the interval between clicks. Within each chunk, the maximum value and minimum value are found. If the maximum and minimum are outside a defined noise level, consider it a found feature.

chunkmin = np.min(cary)
chunkmax = np.max(cary)
if (chunkmin < -noiseband) and (chunkmax > noiseband):
# Found a click! Or a transient, at least.
chunkmaxloc = cary.argmax()

Using the Numpy routines to find the min, max, and max location is pretty snappy.

Then, for each “click” located, I ran an FFT to get a power spectral density, and plotted that. I just used example code to add this functionality. (For underwater acoustics where pressure is measured, though, the conversion to decibels uses a factor of 20 rather than 10.)

So, for a quick and dirty script of less than three hundred lines total, I was able to:

* get a directory listing
* match to filename features to identify files to analyze
* remove DC offsets
* save new versions of the data
* scale the data according to field notes
* locate “clicks” in the data
* generate a PSD for each “click”
* collect PSD data
* generate and save oscillogram/PSD plots
* rank “clicks” on spectral features
* copy off plots of the highest-ranked clicks to a directory

My 2.4GHz dual-core Ubuntu workstation ran this script on 230 megabytes of data, producing over 1,400 graphs, and did it in eight minutes time. I’ve just located a calibration sheet on the hydrophone used, so once I’ve digitized that and applied it, I’ll post an example with real dB numbers on the axis.

Wesley R. Elsberry

Falconer. Interdisciplinary researcher: biology and computer science. Data scientist in real estate and econometrics. Blogger. Speaker. Photographer. Husband. Christian. Activist.

2 thoughts on “Some Data Analysis and Visualization

  • 2012/02/20 at 10:20 am

    I do have one suggestion. While it’s true that we are moving to computers and iPods and so on, old-fashioned technology has also improved. So, probably the best way to transfer audio data to, say, iTunes, is to get an microphone recorder attached to a tape player or something, record the sounds from the CD, and then record them with your laptop’s microphone, perhaps through Skype technology. You’d be surprised how high the quality is after doing this. It’s a little tedious… but that’s what grad students are for!

    Hope this helps,

  • 2012/02/20 at 6:37 pm

    Notedscholar, I think that I owe you an apology. I run into so many people who speak gibberish about biology and mean it that I was treating your comments as if you meant them. But this comment where you speak utter nonsense about acoustics shows me that you intend comedy rather than commentary. My apologies for misconstruing your intent earlier.

Comments are closed.