Manatee County offers tax certificates to bidders. When property owners fail to pay their taxes, and that is happening a lot right now, the county gets other people to pay the taxes and gives them a tax certificate, which is a lien against the property. Each year, an auction happens where people can bid to get these. The bid amounts are in percent interest, and range from 18% at the high end down to 0%. The person bidding the lowest percent interest gets the tax certificate, after, of course, they pay the county the outstanding taxes.
Today, there was a practice auction. This is all handled online now. The page included the option to download data on the 9,000+ properties in CSV, XLS, or XML formats.
Diane is interested in the process and specifically in the land just to the south of our property. It currently has unpaid taxes, and if the executors of the former owner’s estate don’t pay up by June 1st, it will be included in the tax certificate auction. But she is also interested in what else is available out there.
That brings up an interesting problem. The downloaded data is minimal, giving just a parcel ID, outstanding tax balance, and some auction-related attributes. On the other hand, Diane would like information that is available online from another county office, that of the Property Appraiser.
I worked on a Python script to handle the job of getting additional information on acreage, zoning, the address, and bits like that. I hadn’t done anything with Python regular expressions to date, and started looking at that and getting less enthused by the minute. The issue is getting data out of an HTML page downloaded from the Property Appraiser. I could have it done in Perl right offhand, but wanted to develop my Python skills a bit.
On the other hand, getting the job done is the top priority, so while looking stuff up, I ran across the BeautifulSoup module for Python. The web site sounded promising, and a number of other people seemed to have found it useful. Very useful.
BeautifulSoup is an HTML/XML parser. It aims to not only handle clean XHTML, but also to do reasonable things with the sort of HTML people were writing when the Web was young, in other words, bad HTML.
I downloaded the module distribution, and got it uncompressed. Setup is simply
python setup.py -install
My usage so far is to pluck values out of adjacent cells in a table. I can load a BeautifulSoup object with the HTML in question, then ask it to find the label I’m looking for in text. Then I just ask it to retrieve the next text in the document, and that is the stuff I’m looking for.
Anytime one gets started with a library to do a job, it can take a while to get going with it. BeautifulSoup let me get my job done without a lot of effort on the initial learning curve. Right now, my script is about halfway through getting the additional data wanted for those 9,000+ properties. We’ll be able to look it over in the morning. The whole script I’m using is less than a hundred lines of code, and that reads in a CSV file, traverses that, gets the associated profile page from the Property Appraiser for each property, parses that with BeautifulSoup, adds the additional fields of info to the original, and writes out a new CSV file with the more complete data set.