Search This Blog

Thursday, November 29, 2012

Six degrees of Kevin Bacon...in my music collection

I used to play a game with making music playlists SEVERAL years ago when I used to have time to do such things...anyway...the game involved picking a word and then finding a song in my collection with that word.  Then you'd pick another word from the song you just found and then you'd have to find another song with that word.  You'd carry on doing this until you got bored and then you'd hit play and go drink beer.  Here's an example:


Word 1: sun
Song 1: "House of the Rising Sun", by the Animals.
Word 2: house
Song 2: "Burning Down the House", by the Talking Heads.
Word 3: burning
Song 3: "Burning Love", by Elvis Presley
Word 4: love
Song 4: "Love Hurts", by Nazareth

...and so on...

It was kind of interesting since sometimes you'd pick a bad song and end up down a blind alley.

So, I started doing that the other day but I thought of a variant: you choose TWO words and then you use the method above to find one song and then you need to follow the same method until you come to a song with the second chosen word in it.  Sort of like "Six Degrees of Kevin Bacon" but with your music collection.  You can make an example from the scenario explored above:

Connect the words "sun" and "hurts" in six songs or less.

That's easy, we just did it and here's the answer:

Song 1: "House of the Rising Sun", by the Animals.
Song 2: "Burning Down the House", by the Talking Heads.
Song 3: "Burning Love", by Elvis Presley
Song 4: "Love Hurts", by Nazareth

Give another a try on your own.  Connect the words "sun" and "money" in six songs or less.  Go ahead, try it!  I'll wait.

If you're insane, you can try to find the minimum number of songs required to do this.  This, naturally, got me to thinking how I would go about scripting my way out of this problem and it reminded me of some computer science courses I took many years ago where the ideas of connected graphs were explored, including algorithms to traverse them.  I can't remember much of the traversal theory, my naive guess is that in order to jump from a randomly chosen word to another it would be best to try and connect to "hub" words that appear in a lot of songs.  This would involve figuring out which words appear the most and then we'd need to compute the graph topology to represent how words are connected.

I think the most sensible way to start breaking down the problem would be to:

1) build a list of ALL words in my song collection
2) figure out the count for each word, i.e. how many songs does the word "Love" appear in?  (913 songs in my collection).

So, here's my first attempt at figuring out the preliminary steps above.

find . -type f -name \*.mp3 -o -name \*.MP3 -o -name \*.m4a -o -name \*.m4v -o -name \*.wma | awk -F/ '{print $NF}'| awk -F. '{print $(NF-1)}' | sed -e 's/_/ /g' -e 's/(/ /g' -e 's/)/ /g' -e 's/,/ /g' -e s/"'"s//     | sed -e 's/ /\
/g'| perl -nle 'print if m{^[[:ascii:]]+$}' | aspell clean | awk '{ if (length($1)>2) print $1;   }' | tr '[:upper:]' '[:lower:]' | tee all_words.txt | sort -u > unique_words.txt  

Yeah, I know it's a big nasty one liner.  Here's the same thing broken down:

Step 1: use the find command to "find" regular files (-type f) whose file extensions match a set of music file types:


find . -type f -name \*.mp3 -o -name \*.MP3 -o -name \*.m4a -o -name \*.m4v -o -name \*.wma 


Step 2: Pipe output of 1 to awk, use the "/" separator to split each file that's found into directories and filenames and print the last one (the filename without the directory structure).  Do a similar trick to split the filename into name and extension using the period as the field separator, this time print the second last entry to get the filename without leading directory structure and without the file extension.


awk -F/ '{print $NF}'| awk -F. '{print $(NF-1)}'

Step 3: use the sed command to strip out weird characters that will just get in our way, like parentheses and underscores, etc, etc.

sed -e 's/_/ /g' -e 's/(/ /g' -e 's/)/ /g' -e 's/,/ /g' -e s/"'"s//

Step 4: use the sed command again to turn all spaces into carriage returns so that we can split up a song name into just the words and have each word be on a line by itself.  This one was tricky since you need to embed a carriage return in the replacement portion of sed.

sed -e 's/ /\
/g'

Step 5: use perl to strip out non-ascii characters, like those french songs you downloaded after watching Amelie, you know, the ones with funny accents in the file names:

perl -nle 'print if m{^[[:ascii:]]+$}' 


Step 6: Run the whole stream through a spell checker and remove words that don't appear in the dictionary:


aspell clean


Step 7: Use awk to figure out which words are very short and exclude them, like 'a', 'as', 'is', 'to':


awk '{ if (length($1)>2) print $1;   }' 


Step 8: Non convert all uppercase characters to lower case using the 'tr' command.


tr '[:upper:]' '[:lower:]'


Step 9:  Dump all the words into a file using the 'tee' command and then sort the same stream into a second file which contains only unique instances of the words, i.e. you won't see the word "love" in this file 913 times, it will only appear once).


tee all_words.txt | sort -u > unique_words.txt


It turns out that my music collection has 6,743 unique words in it.  In alphabetical order, the first one word is "aaa" from Big Sugar's "AAA Aardvark Hotel" and the last word is "zorba" from the Gypsy King's "Zorba the Greek".


Okay, now how do we figure out how many instances there are of each word?  We spin up a loop that visits each unique word and then uses 'grep' to find ENTIRE word matches (-w) in the file containing all the words in the song collection.  This gets dumped into a file which holds the count for each word.


rm word_counts.txt
for word in `cat unique_words.txt`;
do
    count=`grep -w $word all_words.txt | wc -l`
    echo $count $word >> word_counts.txt
done


You can use the sort command to determine which words show up the most:

sort -rn word_counts.txt

Here's the top ten words in my music collection, along with their counts:


2993 the
1214 you
913 love
569 and
328 your
304 don
288 don't
286 for
216 man
214 with



Okay, that's a good spot to stop for tonight.  Maybe I'll pick this up again someday, maybe I won't.  It was fun crafting the mega one-liner to come up with this though.



No comments:

Post a Comment