2bitBrain

Thursday, March 14, 2013

Mass conversion of .wma files to .mp3 files on Mac OSX command line...with Ruby and ffmpeg.

I recently discovered that iTunes hadn't imported my CD library when I brought in my music collection because it doesn't like .wma files. Typical of Apple, they don't like that file format....sigh.

Here's a 2-liner Ruby script I used to convert everything from wma to mp3 format:

#!/usr/bin/env ruby

ext = ".wma"
Dir.glob("**/*#{ext}").each {|f| m = f.gsub(ext, '.mp3'); `ffmpeg -i '#{f}' -ab 192k -ac 2 -ar 44100 '#{m}'` }

I got this from https://ariejan.net/2010/09/11/mass-convert-wma-to-mp3-using-ffmpeg-and-ruby but found out that it only worked on .wma files in the directory from which you ran the script. Google and StackOverflow saved the day and helped me to convert the Dir.glob to match all subdirectory levels.

I used:

Dir.glob("**/*#{ext}")

instead of

Dir.glob("*#{ext}")

I've never used Ruby before and probably could've done the same thing in Python but this is the first command line example that came up in google so I dove in and gave it a try...

Thursday, November 29, 2012

Splitting Kongsberg water column data into .all and .wcd files.

I goofed up on my previous post about this, I'm learning blog formatting as I go and apparently my code got a little funky when I cut and paste it into the blog. Here's another attempt that should be safer (I think).

Have fun!!

 #!/usr/bin/env python2.6  
   
 import os  
 import struct  
 import time  
 import sys  
   
 file_count=0  
 debug=False  
   
 dir="split"  
   
 for filename in sys.argv:  
   
   file_count += 1  
   
   if (file_count == 1):  
     # I'm too lazy to parse command line args so just skipping over the   
     # script name (which is arg zero in the list)  
     continue  
   
   file = open(filename, 'rb')  
   filesize = os.path.getsize(filename)  
   
   # What is the path to the input file without the filename?  
   filepath=os.path.dirname(filename)  
   fileprefix=os.path.basename(filename)  
   if debug or True:  
     print "Doing file",filename  
     print "Got file path",filepath  
     print "Got file basename",fileprefix  
   
   # Join the file's directory path with the usual output subdirectory name  
   outdir=os.path.join(filepath,dir)  
   
   if not os.path.exists(outdir):  
     os.makedirs(outdir)  
   
   split_allname = os.path.join(outdir, fileprefix)  
   split_wcdname = split_allname.replace(".all",".wcd")  
   
   if debug or True:  
     print filename, "will split into", split_allname, split_wcdname  
   
   if not os.path.exists(split_allname):  
     split_allfile = open(split_allname,"wb")  
     split_wcdfile = open(split_wcdname,"wb")  
   else:  
     print "Skipping", filename, "since it's already split!"  
     file.close()  
     continue  
   
   last_percent = 0  
   while True:  
   
     # Make sure we don't try to read beyond the EOF  
     if (file.tell() + 6 > filesize):  
       break  
   
     line = file.read(6)  
   
     header = struct.unpack("<IBB",line)  
   
     rawlength=line[0:3]  
     length = header[0]  
     stx = header[1]  
     id = header[2]  
   
     if (stx != 2):  
       if debug:  
         print 'STX not found, trying next datagram at position',file.tell()-5  
       file.seek(-5,1)  
       continue  
   
     if debug:  
       print 'STX found, going to try for ETX now'  
   
     # Make sure we don't try to read beyond the EOF  
     if (file.tell() + (length-5) > filesize):  
       file.seek(-5,1)  
       continue  
   
     file.seek(length-5,1)  
   
     # Make sure we don't try to read beyond the EOF  
     if (file.tell() + 3 > filesize):  
       break  
   
     line = file.read(3)  
     footer = struct.unpack("<BH",line)  
     etx = footer[0]  
     checksum = footer[1]  
   
     if (etx != 3):  
       if debug:  
         print 'ETX not found, trying next datagram at position',file.tell()-(length+3)  
       file.seek(-(length+3),1)  
       continue  
   
     # Rewind to very beginning of the datagram, including the length field  
     file.seek(-(length+4),1)  
     data = file.read(length+4)  
   
     if debug:  
       print "Got id", id, "and length", length  
   
     if (id == 0x49 or id == 0x69 or id == 0x52 or id == 0x55):  
       # Stuff for both files  
       split_allfile.write(data)  
       split_wcdfile.write(data)  
     elif (id == 0x6B):  
       # Just for the watercolumn file  
       split_wcdfile.write(data)  
     else:  
       # Everything else goes into the raw file  
       split_allfile.write(data)  
   
     percent=int(100.0 * file.tell()/filesize)  
   
     if (percent%5 == 0 and percent != last_percent):  
       print percent, "% done, ALL:",split_allfile.tell()," WCD:",split_wcdfile.tell()  
       last_percent = percent  
   
     if file.tell() >= filesize:  
       break  
   
   file.close()  
   split_allfile.close()  
   split_wcdfile.close()  
   
 print 'All done!'

Six degrees of Kevin Bacon...in my music collection

I used to play a game with making music playlists SEVERAL years ago when I used to have time to do such things...anyway...the game involved picking a word and then finding a song in my collection with that word. Then you'd pick another word from the song you just found and then you'd have to find another song with that word. You'd carry on doing this until you got bored and then you'd hit play and go drink beer. Here's an example:

Word 1: sun

Song 1: "House of the Rising Sun", by the Animals.

Word 2: house

Song 2: "Burning Down the House", by the Talking Heads.

Word 3: burning

Song 3: "Burning Love", by Elvis Presley

Word 4: love

Song 4: "Love Hurts", by Nazareth

...and so on...

It was kind of interesting since sometimes you'd pick a bad song and end up down a blind alley.

So, I started doing that the other day but I thought of a variant: you choose TWO words and then you use the method above to find one song and then you need to follow the same method until you come to a song with the second chosen word in it. Sort of like "Six Degrees of Kevin Bacon" but with your music collection. You can make an example from the scenario explored above:

Connect the words "sun" and "hurts" in six songs or less.

That's easy, we just did it and here's the answer:

Song 1: "House of the Rising Sun", by the Animals.

Song 2: "Burning Down the House", by the Talking Heads.

Song 3: "Burning Love", by Elvis Presley

Song 4: "Love Hurts", by Nazareth

Give another a try on your own. Connect the words "sun" and "money" in six songs or less. Go ahead, try it! I'll wait.

If you're insane, you can try to find the minimum number of songs required to do this. This, naturally, got me to thinking how I would go about scripting my way out of this problem and it reminded me of some computer science courses I took many years ago where the ideas of connected graphs were explored, including algorithms to traverse them. I can't remember much of the traversal theory, my naive guess is that in order to jump from a randomly chosen word to another it would be best to try and connect to "hub" words that appear in a lot of songs. This would involve figuring out which words appear the most and then we'd need to compute the graph topology to represent how words are connected.

I think the most sensible way to start breaking down the problem would be to:

1) build a list of ALL words in my song collection

2) figure out the count for each word, i.e. how many songs does the word "Love" appear in? (913 songs in my collection).

So, here's my first attempt at figuring out the preliminary steps above.

find . -type f -name \*.mp3 -o -name \*.MP3 -o -name \*.m4a -o -name \*.m4v -o -name \*.wma | awk -F/ '{print $NF}'| awk -F. '{print $(NF-1)}' | sed -e 's/_/ /g' -e 's/(/ /g' -e 's/)/ /g' -e 's/,/ /g' -e s/"'"s//     | sed -e 's/ /\

/g'| perl -nle 'print if m{^[[:ascii:]]+$}' | aspell clean | awk '{ if (length($1)>2) print $1;   }' | tr '[:upper:]' '[:lower:]' | tee all_words.txt | sort -u > unique_words.txt

Yeah, I know it's a big nasty one liner. Here's the same thing broken down:

Step 1: use the find command to "find" regular files (-type f) whose file extensions match a set of music file types:

find . -type f -name \*.mp3 -o -name \*.MP3 -o -name \*.m4a -o -name \*.m4v -o -name \*.wma

Step 2: Pipe output of 1 to awk, use the "/" separator to split each file that's found into directories and filenames and print the last one (the filename without the directory structure). Do a similar trick to split the filename into name and extension using the period as the field separator, this time print the second last entry to get the filename without leading directory structure and without the file extension.

awk -F/ '{print $NF}'| awk -F. '{print $(NF-1)}'

Step 3: use the sed command to strip out weird characters that will just get in our way, like parentheses and underscores, etc, etc.

sed -e 's/_/ /g' -e 's/(/ /g' -e 's/)/ /g' -e 's/,/ /g' -e s/"'"s//

Step 4: use the sed command again to turn all spaces into carriage returns so that we can split up a song name into just the words and have each word be on a line by itself.  This one was tricky since you need to embed a carriage return in the replacement portion of sed.


sed -e 's/ /\



/g'

Step 5: use perl to strip out non-ascii characters, like those french songs you downloaded after watching Amelie, you know, the ones with funny accents in the file names:

perl -nle 'print if m{^[[:ascii:]]+$}'

Step 6: Run the whole stream through a spell checker and remove words that don't appear in the dictionary:

aspell clean

Step 7: Use awk to figure out which words are very short and exclude them, like 'a', 'as', 'is', 'to':

awk '{ if (length($1)>2) print $1;   }'

Step 8: Non convert all uppercase characters to lower case using the 'tr' command.

tr '[:upper:]' '[:lower:]'

Step 9:  Dump all the words into a file using the 'tee' command and then sort the same stream into a second file which contains only unique instances of the words, i.e. you won't see the word "love" in this file 913 times, it will only appear once).

tee all_words.txt | sort -u > unique_words.txt

It turns out that my music collection has 6,743 unique words in it.  In alphabetical order, the first one word is "aaa" from Big Sugar's "AAA Aardvark Hotel" and the last word is "zorba" from the Gypsy King's "Zorba the Greek".

Okay, now how do we figure out how many instances there are of each word?  We spin up a loop that visits each unique word and then uses 'grep' to find ENTIRE word matches (-w) in the file containing all the words in the song collection.  This gets dumped into a file which holds the count for each word.

rm word_counts.txt

for word in `cat unique_words.txt`;

do

    count=`grep -w $word all_words.txt | wc -l`

    echo $count $word >> word_counts.txt

done

You can use the sort command to determine which words show up the most:

sort -rn word_counts.txt

Here's the top ten words in my music collection, along with their counts:

2993 the
1214 you
913 love
569 and
328 your
304 don
288 don't
286 for
216 man
214 with

Okay, that's a good spot to stop for tonight. Maybe I'll pick this up again someday, maybe I won't. It was fun crafting the mega one-liner to come up with this though.

Friday, July 6, 2012

Cruise specific sea surface temperature/salinity animations

I'm currently sitting offshore south of Iceland about to pass over the Mid-Atlantic Ridge. We're shooting EM302 multibeam and, as always, I'm curious about what to expect from the ocean's temperature and salinity.

The RTOFS grids that I've been playing with for the past while have come in handy again. I wrote a script to pull down the imagery that's generated on a daily basis on the RTOFS website and a bit of sorcery with ImageMagick's "convert" program massages it into a little animation for me to see the sea surface temperature and salinity every day.

Here's an example:

And here's the script:

#!/bin/bash

dir=`date '+%Y%m%d'`

rm -rf $dir
mkdir $dir
cd $dir

for field in temperature salinity; do

echo "Doing field" $field

for hour in 024 048; do
echo "Doing nowcast" $hour
wget http://polar.ncep.noaa.gov/global/nctest/images/large/rtofs_arctic_$field\_n$hour\_0.png
done

for hour in 024 048 072 096 120 144; do
echo "Doing forecast" $hour
wget http://polar.ncep.noaa.gov/global/nctest/images/large/rtofs_arctic_$field\_f$hour\_0.png
done
done

for file in *.png; do

prefix=`basename $file .png`
convert -crop 1840x120+0+1131 $file scale.ppm

# I have to convert to .ppm to toss out rotation info in png header such that
# the crop operation works? Otherwise, the crop geometry is in the original unrotated
# image geometry.
convert -rotate 135 $file temp.ppm
convert -crop 450x360+600+1000 temp.ppm temp2.ppm

label=`echo $file | awk -F_ '{print '$dir',$3,$4}' `

convert -draw "image Over 0,310 450,52 scale.ppm" -fill white -stroke black -pointsize 24 -draw "text 10,30 '$label'" temp2.ppm $file

done

for temperature_file in *temperature*png; do

output_file=`echo $temperature_file | sed -e 's/temperature/TS/' `

salinity_file=`echo $temperature_file | sed -e 's/temperature/salinity/' `

convert -extent 900x360 -draw "image Over 450,0 450,360 $salinity_file" $temperature_file $output_file

done

convert -loop 0 -delay 50 rtofs_arctic_temperature_f*png temperature_animation.gif

convert -loop 0 -delay 50 rtofs_arctic_salinity_f*png salinity_animation.gif

convert -loop 0 -delay 50 rtofs_arctic_TS_f*png TS_animation.gif

rm temp.ppm temp2.ppm scale.ppm

Thursday, July 5, 2012

Splitting up Kongsberg Watercolumn files

Sometimes you just forget to save Kongsberg multibeam water column data into a separate file. This just happened to me on a cruise and I found that the >2GB file sizes made my 32-bit software puke. Luckily, the file sizes didn't exceed 4GB so I decided to write a file splitter in python the pulls apart the original .all file and outputs a new .all file purged of water column datagrams AND a separate .wcd file. Here's my first cut at it, it's a script that you feed a list of filenames and it creates a "split" subdirectory for each file and writes a split .all/.wcd combination into the split subdirectory, for example, the file:

20120702/0003_20120702_122222_FK_EM710.all

will split into:

20120702/split/0003_20120702_122222_FK_EM710.all
20120702/split/0003_20120702_122222_FK_EM710.wcd

Give it a try, I hope it doesn't nuke your data.

#!/usr/bin/env python2.6

import os
import struct
import time
import sys

file_count=0
debug=False

dir="split"

for filename in sys.argv:

file_count += 1

if (file_count == 1):
# I'm too lazy to parse command line args so just skipping over the
# script name (which is arg zero in the list)
continue

file = open(filename, 'rb')
filesize = os.path.getsize(filename)

# What is the path to the input file without the filename?
filepath=os.path.dirname(filename)
fileprefix=os.path.basename(filename)
if debug or True:
print "Doing file",filename
print "Got file path",filepath
print "Got file basename",fileprefix

# Join the file's directory path with the usual output subdirectory name
outdir=os.path.join(filepath,dir)

if not os.path.exists(outdir):
os.makedirs(outdir)

split_allname = os.path.join(outdir, fileprefix)
split_wcdname = split_allname.replace(".all",".wcd")

if debug or True:
print filename, "will split into", split_allname, split_wcdname

if not os.path.exists(split_allname):
split_allfile = open(split_allname,"wb")
split_wcdfile = open(split_wcdname,"wb")
else:
print "Skipping", filename, "since it's already split!"
file.close()
continue

last_percent = 0
while True:

# Make sure we don't try to read beyond the EOF
if (file.tell() + 6 > filesize):
break

line = file.read(6)

header = struct.unpack("

rawlength=line[0:3]
length = header[0]
stx = header[1]
id = header[2]

if (stx != 2):
if debug:
print 'STX not found, trying next datagram at position',file.tell()-5
file.seek(-5,1)
continue

if debug:
print 'STX found, going to try for ETX now'

# Make sure we don't try to read beyond the EOF
if (file.tell() + (length-5) > filesize):
file.seek(-5,1)
continue

file.seek(length-5,1)

# Make sure we don't try to read beyond the EOF
if (file.tell() + 3 > filesize):
break

line = file.read(3)

footer = struct.unpack("

etx = footer[0]

checksum = footer[1]

if (etx != 3):

if debug:

print 'ETX not found, trying next datagram at position',file.tell()-(length+3)

file.seek(-(length+3),1)

continue

# Rewind to very beginning of the datagram, including the length field

file.seek(-(length+4),1)

data = file.read(length+4)

if debug:

print "Got id", id, "and length", length

if (id == 0x49 or id == 0x69 or id == 0x52 or id == 0x55):

# Stuff for both files

split_allfile.write(data)

split_wcdfile.write(data)

elif (id == 0x6B):

# Just for the watercolumn file

split_wcdfile.write(data)

else:

# Everything else goes into the raw file

split_allfile.write(data)

percent=int(100.0 * file.tell()/filesize)

if (percent%5 == 0 and percent != last_percent):

print percent, "% done, ALL:",split_allfile.tell()," WCD:",split_wcdfile.tell()

last_percent = percent

if file.tell() >= filesize:

break

file.close()

split_allfile.close()

split_wcdfile.close()

print 'All done!'

Friday, May 25, 2012

SVP Weather Map daily forecast

I got the IT guys at work to find a dusty old PC and to install Linux on it. Tucked safely away in a corner somewhere, it's running a daily script that downloads the RTOFS 144 hour forecast and then produces SVP Weather Maps. These are then dumped on the FTP server where they are publicly available after an anonymous FTP login.

You can find the daily forecast here: Today's SVP Weather Map.

If you're a command line scripter, you might be interested in the details of the script:

#!/bin/bash

source ~/.bash_profile

# http://nomads.ncep.noaa.gov/pub/data/nccf/com/rtofs/prod/rtofs.20120212/rtofs_glo_3dz_f024_daily_3zsio.nc
base_url="http://nomads.ncep.noaa.gov/pub/data/nccf/com/rtofs/prod/rtofs"

# This forces us to download yesterday's grids since we want the full 144 hour forecast and that won't be
# available until the end of today.
#day=`date -v -1d '+%Y%m%d'`
day=`date --date="yesterday" '+%Y%m%d'`

file_prefix="rtofs_glo_3dz_f"
file_suffix_salinity="_daily_3zsio.nc"
file_suffix_temperature="_daily_3ztio.nc"

cd /home/jbeaudoin/data/RTOFS
mkdir -p $day
cd $day

for hour in 024 048 072 096 120 144;
do

        echo "Doing hour" $hour
           name=$base_url.$day/$file_prefix$hour$file_suffix_temperature
        echo "Retrieving" $name
        wget -N $name

        name=$base_url.$day/$file_prefix$hour$file_suffix_salinity
        echo "Retrieving" $name
        wget -N $name

        temperature_file=$file_prefix$hour$file_suffix_temperature
        salinity_file=$file_prefix$hour$file_suffix_salinity

        echo "Doing " $hour $temperature_file $salinity_file


        # Do the uncertainty analysis, it dumps out a results_YYYYMMDD.dat file
        svp_rtofs \
                -angular_sector 120 \
                -draft 5 \
                -t $temperature_file \
                -s $salinity_file

        # Daily download footprint is 24GB. No need to hang on
        # to these so clearing them out.
        rm $temperature_file
        rm $salinity_file
done

rm .gmt*
makecpt -Crainbow -T0/0.5/0.1 -Z -D > uncertainty.cpt
gmtset CHAR_ENCODING ISOLatin1Encoding

today=`date '+%Y%m%d'`

for f in *.dat;
do

        prefix=`get_prefix $f`

        # results_20120524.dat
        datestamp=`basename $prefix | awk -F_ '{print $2}' `

        rm $prefix.grd
        cat $f | awk '{if (NF == 3) {printf("%.7f %.7f %.2f\n", $2,$1,$3);} }' | nearneighbor -I3m -S10m -R-180/180/-75/80 -N8/1 -G$prefix.grd
        grdimage $prefix.grd -R -JM8i -K -Xc -Y1.75i -Q -Cuncertainty.cpt -K > $prefix.ps

        pscoast -R -J -O -K -Gblack -W0.1p -Dl -A500/0/1 -B30g30/20g20 -Wthinner,black -U >> $prefix.ps

        echo "60 50 20 0 1 BL $datestamp" | pstext -R -J -O -Gyellow -K >> $prefix.ps

        psscale --LABEL_FONT_SIZE=15p -D4.0i/-0.5i/6i/0.3ih -O -Cuncertainty.cpt -B0.1:"Depth Uncertainty, 60\260 beam angle (%w.d., 2@~s@~)": >> $prefix.ps

        ps2raster -A -Tg -E600 $prefix.ps

        convert -rotate 90 $prefix.png $prefix.png

        if [ $datestamp -eq $today ]; then
                # We always provide a "today" file so that people can
                # bookmark the today file on the FTP site.
                cp $prefix.png today.png
        fi

        gzip -f $f

done

At the end of the script, the .png files and the .grd files are uploaded to the CCOM FTP site.

Tuesday, February 28, 2012

More Mac Mini wrestling...

Okay, so I didn't quite give up when I said I did. More notes on this install.

1) Downloaded Xcode 4.3 and then tried a fink install but it whined about not finding a c compiler. After some googling, I found out that you need to download the command line tools from the "Downloads" tab within Xcode's preferences. This is pretty complicated for us "hello_world.c" folks.

2) After doing that, trying a fink installation again. Jumped into the fink install directory and did "./bootstrap" and ran with all the defaults. It attempted to make a new /sw2 directory so I killed the job and did 'sudo mv /sw /sw_fail' to get that out of the way then launched the bootstrap script again.

After a few minutes:

3) After fink installed, I typed:

/sw/bin/pathsetup.sh

This setup the fink path by creating a .profile file in my ~ (one wasn't there already, not sure what it would have done if one was).

Then typed:

fink selfupdate-rsync
fink index -f

That's it and probably a good spot to stop for tonight...