CS67 Multimedia Information Retrieval Demo Douglas Turnbull & Rich Wicentowski Fall 2009 -------------------------------------------- INTRODUCTION Up we have been analyzing text documents -> can be broken down into "bag of tokens" -> each token has a (clear) semantic meaning Multimedia documents have less easy to break down -> images, videos, sounds -> focus on music since we are building a music search engine -> many idea transfer other media (image, video) and domains (sound effects, speech) ---------------------------------------------- AUDIO SIGNAL PROCESSING What is sound? -> vibration (pressure waves)transmitted through matter (usually air) (e.g., plucked guitar string, speaker diaphram) -> rate of wave is call the frequency TODO: audioPlay.m look at wav in matlab How is are complex sounds created? -> you can think of sound as a sum of sin waves -> with infinite number of sine waves we can reconstruct any sound -> Fourier theorem TO DO: sineExample.m How do he "hear" sound -> we use our ear to perceive these frequencies -> cochlea is in the inner ear has hair cell that "wiggle" when certain frequency are encountered -> this causes certain neurons in our brain to fire -> we learn to interpret certain song in certain ways (e.g., music genre classification in less than a second) What is the frequency domain? -> waveform is in the TIME domain -> "sample" amplitude at regular TIME intervals -> time vs. amplitude -> we use a Fourier transform to project into FREQUENCY domain -> frequency vs amplitude TODO: periodogramExample.m -> play with two frequencies -> see in frequency domain -> look at audio example -> represents the spectral shape of short audio sample (23ms) -> "snapshot" is need of stationarity of single TODO: Look at spectrogram (aka Short-time Fourier Transform) -> a "time series" of periodograms (with some scaling) -> each sample represents spectral shape -> high-dimensional (d = 256+) feature vectors Mel-frequency Cepstral Coeffients (MFCC) -> We reduced the dimensionality of periodograms (d = 13) -> lots of signal processing -> but we can "reconstruct" spectrum from small number of MFCCs MFCC (See Logan ISMIR 2000 for a great introduction) 1) Waveform (discretized and quantized) 2) STFT (d = 256) 3) Scale Amplitudes -> log() -> human perception of loudness 4) Scale Frequencies -> Mel-scaling -> rescale freq bands to match human perception -> mapping between freq and perceived pitch -> closer spacing for low freqs -> d = 40 5) Project Spectra to lower dimensional space -> d = between 5 and 20, 13 is common -> more gives better reconstruction -> projection is done with discrete cosine transform(DCT) -> "decorrelates" bins from mel-scaling TO DO: check out mfccExample.m -> next time you will be using ma_mfcc() to calculate mfcc for 80 songs -------------------------------------------------------------------- KEY POINTS: 1) In the end, we have a "time series" of 13-dimensional vectors each of which are a compact representation of a short-term spectral shape. 2) Each spectrum encodes information about the "timbre" of music. -> Timbre is the color of a sound -> Signature of an instrustment or human voice -> also encodes information about harmonic and inharmonic nature of sound -> broadband "noise" of a snare drum 3) Often, we ignore the temporal component of the time series so that we can think of the data as a "bag of feature vectors" -------------------------------------------------------------------- NEXT TIME We will look at two ways to classify music by genre using MFCCs.