Search This Blog

Friday, July 30, 2010

sqlite3 and the World Ocean Database

I'm working on getting the World Ocean Database (WOD) to be directly accessible by my svp_tool (sound velocity processing toolkit) and wanted to document some of the procedure I took.  Here's a map view of a query area near the Mariana Trench, the green squares show the locations of the 125 CTD casts in the WOD (for the month of August).


The resulting sound speed, temperature and salinity vertical profiles are shown below (left to right, all plotted versus depth on the vertical axis).


The WOD is a database of oceanographic data that records physical/biological/chemical oceanographic measurements, currently the 2009 database holds ~10 million sets of measurements. The data are organized spatially by WMO squares and by instrument type resulting in 12,960 files. I'm mainly interested in XBT and CTD measurements; the files that I need to tuck into and extract temperature and/or salinity profiles based on a user selected geographic region and range of months. The unzipped files are in an awkward ASCII format that take up about 6GB on the disk in gzip format. I'd like to make the lookup options as quick as possible so:

0) Wrap up the sample I/O C program as a C library (wod.{c,h}). The sample program that comes with the database has lots of global variables so I just wrapped it all up in a structure and then wrote some functions to initialize, open/close files, read data that neatly hide away the details and keep the programmer (me) from stepping on my own toes with all those globals and obscure function calls that need to be done just to read a single cast.

1) Build a database. The raw files contain header information (position, date, cruise number, etc) and then the measurements themselves. Having to unzip files just to search through to find casts that intersect the area of interest is slow and painful. I took the sample C program to read the WOD from their website and my new C library (wod.{c,h}) and morphed it into a program that writes out a sqlite3 database that contains only a small subset of the info in the header so that I can quickly query and find casts. The database also contains enough info to determine which of the 12,960 files an individual cast falls within and its file offset. Based on some advice from Kurt, I went with sqlite3 and googled my way through the problem. Here are a few juicy bits from the code, it was pretty simple in the end.

#include
int main()
{
....
// declaration
sqlite3 *db;
....
// open a database with (char * outname)
sqlite3_open(outname, &db);
....
// Create a table, will puke if the table has been added already
sqlite3_exec(db, "create table tbl1 (instrument integer, observed integer, quadrant integer, wmo_lat integer, wmo_lon integer, latitude real, longitude real, year integer, month integer, day integer, file_offset integer)", callback, 0, &zErrMsg);
.....
// Now insert some records
sprintf(sql_string,"INSERT INTO tbl1 (instrument, observed, quadrant, wmo_lat, wmo_lon, \
latitude, longitude, year, month, day, file_offset) VALUES (%d,%d,%d,%d,%d,%.3f,%.3f,%4d,%d,%d,%d)",
wod_reader.instrument,wod_reader.observed,wod_reader.quadrant,
wod_reader.wod_lat,wod_reader.wod_lon,
wod_reader.latitude,wod_reader.longitude,
wod_reader.year,wod_reader.month,wod_reader.day,file_offset);
rc = sqlite3_exec(db, sql_string, callback, 0, &zErrMsg);

// Build indices based on latitude and longitude as I expect to be querying against
// these most often
sqlite3_exec(db, "CREATE INDEX tbl1_idx_longitude ON tbl1(longitude)", callback, 0, &zErrMsg);
sqlite3_exec(db, "CREATE INDEX tbl1_idx_latitude ON tbl1(latitude)", callback, 0, &zErrMsg);

// Now close the database.
sqlite3_close(db);

...
}


Kurt had suggested using the SQLite Manager extension in Firefox to have a peek at what's in the database. I did so, and everything seemed okay.



Next steps?

- add sqlite3 querying ability into svp_tool
- add functionality to wod.{c,h} to read casts based on query results, right now it searches the entire WOD namespace and quickly rejects files that can't possibly intersect the spatial query bounds.  Those that do intersect are then unzipped and search sequentially to find the observations that fall within a specified area (and month).  ALL records are read through in their entirety even though MOST aren't desired.
- modify wod.{c,h} to deal with reading the .gz files directly through zlib instead of gunzipping and then gzipping the files for each extraction.