CS67 Multimedia Information Retrieval Demo
Douglas Turnbull & Rich Wicentowski
Fall 2009

--------------------------------------------
INTRODUCTION

Up we have been analyzing text documents
  -> can be broken down into "bag of tokens" 
    -> each token has a (clear) semantic meaning

Multimedia documents have less easy to break down
  -> images, videos, sounds 
  -> focus on music since we are building a music search engine
  -> many idea transfer other media (image, video) 
      and domains (sound effects, speech)

----------------------------------------------
AUDIO SIGNAL PROCESSING

What is sound?
  -> vibration (pressure waves)transmitted through matter (usually air)
     (e.g., plucked guitar string, speaker diaphram)
  -> rate of wave is call the frequency

TODO: audioPlay.m look at wav in matlab


How is are complex sounds created?
  -> you can think of sound as a sum of sin waves
  -> with infinite number of sine waves we can reconstruct any sound
     -> Fourier theorem

TO DO: sineExample.m

How do he "hear" sound
  -> we use our ear to perceive these frequencies
  -> cochlea is in the inner ear has hair cell that "wiggle" when
       certain frequency are encountered 
  -> this causes certain neurons in our brain to fire
  -> we learn to interpret certain song in certain ways
     (e.g., music genre classification in less than a second)

What is the frequency domain?
  -> waveform is in the TIME domain 
     -> "sample" amplitude at regular TIME intervals
     -> time vs. amplitude

  -> we use a Fourier transform to project into FREQUENCY domain 
     -> frequency vs amplitude

TODO: periodogramExample.m
      -> play with two frequencies -> see in frequency domain
      -> look at audio example
      -> represents the spectral shape of short audio sample (23ms) 
      	   -> "snapshot" is need of stationarity of single

TODO: Look at spectrogram (aka Short-time Fourier Transform)
      -> a "time series" of periodograms (with some scaling)
      -> each sample represents spectral shape
      -> high-dimensional (d = 256+) feature vectors


Mel-frequency Cepstral Coeffients (MFCC)
      -> We reduced the dimensionality of periodograms (d = 13)
      -> lots of signal processing
      -> but we can "reconstruct" spectrum from small number of MFCCs

MFCC (See Logan ISMIR 2000 for a great introduction)

1) Waveform (discretized and quantized) 

2) STFT (d = 256)

3) Scale Amplitudes -> log() -> human perception of loudness

4) Scale Frequencies -> Mel-scaling
		     -> rescale freq bands to match human perception
   	 	     	-> mapping between freq and perceived pitch  
		     -> closer spacing for low freqs
		     -> d = 40

5) Project Spectra to lower dimensional space 
   	   	     -> d = between 5 and 20, 13 is common
		     -> more gives better reconstruction
		     -> projection is done with discrete cosine transform(DCT)
		        -> "decorrelates"  bins from mel-scaling


TO DO: check out mfccExample.m
   -> next time you will be using ma_mfcc() to calculate mfcc for 80 songs

--------------------------------------------------------------------
KEY POINTS:

1) In the end, we have a "time series" of 13-dimensional vectors each 
of which are a compact representation of a short-term spectral shape.

2) Each spectrum encodes information about the "timbre" of music.
     -> Timbre is the color of a sound
     -> Signature of an instrustment or human voice
     -> also encodes information about harmonic and inharmonic nature of sound
     	-> broadband "noise" of a snare drum


3) Often, we ignore the temporal component of the time series so that
we can think of the data as a "bag of feature vectors"


--------------------------------------------------------------------
NEXT TIME

We will look at two ways to classify music by genre using MFCCs.