Search This Blog

Tuesday, December 6, 2011

ARGO data

I've been playing with ARGO data to get a sense of how well the RTOFS Global model is doing in predicting vertical temperature and salinity structure.  I eventually got a script running that would do a daily download of the current day's real-time data for use in my comparison analysis with the RTOFS Global model. Keep in mind that I plan on using this script while at sea over a limited bandwidth connection to the internet so I chose to run wget for each file instead of passing it a list of files with the -i and -base options.

The script starts by downloading the "directory file" listing all available files, it looks like this:

# Title : Profile directory file of the Argo Global Data Assembly Center
# Description : The directory file describes all individual profile files of the argo GDAC ftp site.
# Project : ARGO
# Format version : 2.0
# Date of update : 20111206174544
# FTP root number 1 : ftp://ftp.ifremer.fr/ifremer/argo/dac
# FTP root number 2 : ftp://usgodae.usgodae.org/pub/outgoing/argo/dac
# GDAC node : FNMOC
file,date,latitude,longitude,ocean,profiler_type,institution,date_update
aoml/13857/profiles/R13857_001.nc,19970729200300,0.267,-16.032,A,845,AO,20080918131927
aoml/13857/profiles/R13857_002.nc,19970809192112,0.072,-17.659,A,845,AO,20080918131929
aoml/13857/profiles/R13857_003.nc,19970820184544,0.543,-19.622,A,845,AO,20080918131931
aoml/13857/profiles/R13857_004.nc,19970831193905,1.256,-20.521,A,845,AO,20080918131933
aoml/13857/profiles/R13857_005.nc,19970911185807,0.720,-20.768,A,845,AO,20080918131934
aoml/13857/profiles/R13857_006.nc,19970922195701,1.756,-21.566,A,845,AO,20080918131936
aoml/13857/profiles/R13857_007.nc,19971003191549,2.595,-21.564,A,845,AO,20080918131938
aoml/13857/profiles/R13857_008.nc,19971014183934,1.761,-21.587,A,845,AO,20080918131940
aoml/13857/profiles/R13857_009.nc,19971025193234,1.804,-21.774,A,845,AO,20080918131941
aoml/13857/profiles/R13857_010.nc,19971105185142,1.642,-21.362,A,845,AO,20080918131943
aoml/13857/profiles/R13857_011.nc,19971116194909,1.708,-20.758,A,845,AO,20080918131945
aoml/13857/profiles/R13857_012.nc,19971127190705,2.048,-20.224,A,845,AO,20080918131947
aoml/13857/profiles/R13857_013.nc,19971208183912,2.087,-19.769,A,845,AO,20080918131948
aoml/13857/profiles/R13857_014.nc,19971219192355,2.674,-20.144,A,845,AO,20080918131950
aoml/13857/profiles/R13857_015.nc,19971230184421,2.890,-20.433,A,845,AO,20080918131952
aoml/13857/profiles/R13857_016.nc,19980110194140,2.818,-20.699,A,845,AO,20080918131954
aoml/13857/profiles/R13857_017.nc,19980121190033,2.940,-20.789,A,845,AO,20080918131956
aoml/13857/profiles/R13857_018.nc,19980201195831,3.224,-20.757,A,845,AO,20080918131957

I parse the directory file looking for a date match in the date/time field (2nd field). You could easily modify this to limit it to a specific lat/lon bounding box or any other criteria.

Here's the script:

#!/bin/bash

base_argo_url=ftp://usgodae.org/pub/outgoing/argo

# Download the profile index
time1=`stat -f "%m" ar_index_global_prof.txt.gz`
wget --timestamping $base_argo_url/ar_index_global_prof.txt.gz
time2=`stat -f "%m" ar_index_global_prof.txt.gz`

if [ $time1 -eq $time2 ]
then
        echo "Nothing to do...no changes since last run"
        exit
fi

# Get today's date
today=`date -u '+%Y%m%d'`
echo "today is" $today

mkdir $today

zcat ar_index_global_prof.txt.gz | awk -F, '{if (NR > 9 && substr($2,1,8) == '$today') print $1 }' > $today/todays_casts.txt

cd $today

num_files=`cat todays_casts.txt | wc -l`

if [ $num_files -eq 0 ]
then
        echo "Nothing to do...no files to download yet for" $today
    exit
fi

echo "Going to check" $num_files "files"

for f in `cat todays_casts.txt`; do
        echo "Doing file" $f

        if [ -e `basename $f` ]
        then
                # Skip files that have already been downloaded
                continue
        fi

        # Don't need time stamping here, we check locally for existence of the
        # .nc file so don't need to waste time requesting a listing from the FTP server
        wget $base_argo_url/dac/$f
done



What comes out of this is a directory for the current day (named yyyymmdd) with a set of netCDF files in it (.nc file extension). Each file represents a cast from a given instrument, for example 20111206/R1900847_089.nc.

I then use a python script to read the .nc files and turn them into OMG/UNB format so that I can run comparisons against casts from RTOFS Global.

#!/usr/bin/env python2.6

import glob
import netCDF4
import numpy as np
import math
import datetime as dt
import matplotlib.pyplot as plt
import os

do_plot = True

if do_plot:
    plt.figure()
    plt.subplot(1,2,1)
    plt.xlabel("Temperature, deg C")
    plt.ylabel("Pressure, dbar")
    plt.hold
    plt.subplot(1,2,2)
    plt.xlabel("Salinity, psu")
    plt.ylabel("Pressure, dbar")
    plt.hold

for name in glob.glob('*.nc'):
    file = netCDF4.Dataset(name)

    latitude = file.variables['LATITUDE'][0]
    longitude = file.variables['LONGITUDE'][0]

    if math.isnan(latitude) or math.isnan(longitude):
        print "    skipping NAN lat/lon"
        continue

    juld = file.variables['JULD'][0]

    # TODO: the reference date is stored in 'REFERENCE_DATE_TIME'
    refdate = dt.datetime(1950,1,1,0,0,0,0,tzinfo=None)
    castdate = refdate + dt.timedelta(days=juld)

    print name + " " + str(latitude) + " " + str(longitude) + " " + str(castdate)

    try:
        # Only deal with casts that have ALL the data we need
        p = file.variables['PRES'][0][:]
        t = file.variables['TEMP'][0][:]
        t_fill_value = file.variables['TEMP']._FillValue
        t_qc = file.variables['TEMP_QC'][0][:]
        s = file.variables['PSAL'][0][:]
        s_fill_value = file.variables['PSAL']._FillValue
        s_qc = file.variables['PSAL_QC'][0][:]
    except:
        continue


    # Replace masked data with NAN
    # This will fail if there is no masked data since netCDF4 returns
    # a regular numpy array if no masked data but returns a masked numpy array
    # if there is.
    try:
        t_mask = t.mask
        t[t_mask] = np.NAN
    except:
        pass

    try:
        s_mask = s.mask
        s[s_mask] = np.NAN
    except:
        pass

    try:
        p_mask = p.mask
        p[p_mask] = np.NAN
    except:
        pass

    # Now filter based on quality control flags (we want 1, 2 or 5)
    t_ind = (t_qc == '1') | (t_qc == '2') | (t_qc == '5')
    s_ind = (s_qc == '1') | (s_qc == '2') | (s_qc == '5')

    # We only want to consider valid concurrent observations of T and S
    pair_ind = t_ind & s_ind

    if do_plot:
        plt.subplot(1,2,1)
        plt.plot(t[pair_ind],-p[pair_ind]);
        plt.subplot(1,2,2)
        plt.plot(s[pair_ind],-p[pair_ind]);

    t_filt = t[pair_ind]
    s_filt = s[pair_ind]
    p_filt = p[pair_ind]
    num_samples = t_filt.size

    file.close

    if num_samples == 0:
        print "    Skipping " + name + " due to lack of data!"
        continue

if do_plot:
    plt.show()


Here's a plot of data from 2011-12-06 at 1:30PM, EST.





Here's a map showing the geographic distribution of the casts for this particular run (2011-12-06).


Still to do?  Read up more about the various QC procedures applied to ARGO data and try to automate detection of casts that will mess up my comparison analysis (large chunks of missing data, etc).  Here's a bit of light reading to get me started. 

A la prochaine...