Proc. of the 10th Int. Conference on Digital Audio Effects (DAFx-07), Bordeaux, France, September 10-15, 2007
A MATLAB TOOL BOX FOR MUSICAL FEATURE EXTRA-
CTION FROM AUDIO
Olivier Lartillot, Petri Toiviainen
University of Jyv?skyl?
Finland lartillo@campus.jyu.fi
Abstract: We present MIRtoolbox, an integrated set of functions written in Matlab, dedicated to the extraction of musical features from audiofiles. The design is based on a modular framework: the differential gorithms are decomposed into stages, formalized using a minimal set of elementary mechanisms, and integrating different variants proposed by alternative approaches – including new strategies we have developed –, that users can select and parametrize.
This paper offers an overview of the set of features, related among others, to timbre, tonality, rhythm or form, that can be extracted with MIRtoolbox. Four particular analyses are provided as examples. The toolbox also includes functions for statistical analysis, egmentation and clustering. Particular attention has been paid to the design of a syntax that offers both simplicity of use and transparent adaptiveness to a multiplicity of possible input types. Each feature extraction method can accept as argument an audiofile, or any preliminary result from intermediary stages of the chain
of operations. Also the same syntax can be used for analyses of single audio files, batches of files, series of audio segments, multichannel signals, etc. For that purpose, the data and methods of the toolbox are organised in an object-oriented architecture. 1. MOTIVATION AND APPROACH
MIRToolbox is a Matlab toolbox dedicated to the extraction of musically-related features from audio recordings. It has been designedin particular with the objective of enabling the computation of a large range of features from databases of audio files, that can be applied to statistical analyses.
第 1 页 共 28 页
Few softwares have been proposed in this area. The most important one, Marsyas [1], provides a general architecture for connectingaudio, soundfiles, signal processing blocks and machine learning (see section 5 for more details). One particularity of our own approach relies in the use of the Matlab computing environment, which offers good visual isation capabilities and gives accessto a large variety of other toolboxes. In particular, the MIRToolbox makes use of functions available in recommended public-domain toolboxes such as the Auditory Toolbox [2], NetLab [3], or SOM toolbox [4]. Other toolboxes, such as the Statistics toolbox or the Neural Network toolbox from MathWorks, can be directly used forfurther analyses of the features extracted by MIRToolbox without having to export the data from one software to another.
Such computational framework, because of its general objectives,could be useful to the research community in Music Information Retrieval (MIR), but also for educational purposes. For that reason, particular attention has been paid concerning the ease of use of the toolbox. In particular, complex analytic processes can be designed using a very simple syntax, whose expressive power comes from the use of an object-oriented paradigm.
The different musical features extracted from the audio files are highly interdependent: in particular, as can be seen in figure 1,some features are based on the same initial computations. In order to improve the computational efficiency, it is important to avoid redundant computations of these common components. Each of these intermediary components, and the final musical features, are therefore considered as building blocks that can been freely articulatedone with each other. Besides, in keeping with the objective of optimal ease of use of the toolbox, each building block has been conceived in a way that it can adapt to the type of input data.
For instance, the computation of the MFCCs can be based on the waveform of the initial audio signal, or on the intermediary representations such as spectrum, or mel-scale spectrum (see Fig. 1).Similarly, autocorrelation is computed for different range of delays depending on the type of input data (audio waveform, envelope, spectrum). This decomposition of all the set of feature extractional gorithms into a common set of building blocks has the advantage of offering a synthetic overview of the different approache studied in this domain of research.
第 2 页 共 28 页
2. FEATURE EXTRACTION
2.1. Feature overview
Figure 1 shows an overview of the main features implemented inthe toolbox. All the different processes start from the audio signal(on the left) and form a chain of operations proceeding to right.The vertical disposition of the processes indicates an increasing order of complexity of the operations, from simplest computation (top) to more detailed auditory modelling (bottom).
Each musical feature is related to one of the musical dimensions traditionally defined in music theory. Boldface characters highlight features related to pitch, to tonality (chromagram, keystrength and key Self-Organising Map, or SOM) and to dynamics (Root Mean Square, or RMS, energy). Bold italics indicate features related to rhythm, namely tempo, pulse clarity and fluctuation.Simple italics highlight a large set of features that can be associated to timbre. Among them, all the operators in grey italics can be in fact applied to many others different representations: for instance, statistical moments such as centroid, kurtosis, etc.,can be applied to either spectra, envelopes, but also to his to gramsbased on any given feature.
One of the simplest features, zero-crossing rate, is based on asimple description of the audio waveform itself: it counts the numberof sign changes of the waveform. Signal energy is computed using root mean square, or RMS [5]. The envelope of the audio signal offers timbral characteristics of isolated sonic event FFT-based spectrum can be computed along the
第 3 页 共 28 页
frequency domainor along Mel-bands, with linear or decibel energy scale, andapplying various windowing methods. The results can be multiplied with diverse resonance curves in order to highlight different aspects such as metrical pulsation (when computing the FFT of envelopes) or fluctuation [6].
Many features can be derived from the FFT:
? Basic statistics of the spectrum gives some timbral characteristics (such as spectral centroid, roll-off [5], brightness, flatness, etc.).
? The temporal derivative of spectrum gives the spectral flux.
? An estimation of roughness, or sensory dissonance, can beassessed by adding the beating provoked by each couple of energy peaks in the spectrum [7].
? A conversion of the spectrum in a Mel-scale can lead to thecomputation of Mel-Frequency Cepstral Coefficients (MFCC)(cf. example 2.2), and of fluctuation[6].
? Tonality can also be estimated (cf. example 2.3).
The computation of the autocorrelation can use diverses normalization strategies, and integrates the improvement proposed by Boersma [8] in order to compensate the side-effects due tothe windowing.Resonance urve are also available here.Autocorre lation can be generalized through a compression of the spectral representation[9].The estimation of pitch is usually based on spectrum, autocorrelation,or cepstrum, or a mixture of these strategies [10].
A distinct approach consists of designing a complete chain of processes based on the modelling of auditory perception of sound and music [2] (circled in Figure 1). This approach can be used inparticular for the computation of rhythmic pulsation (cf. example2.4). 2.2. Example: Timbre analysis
One common way of describing timbre is based on MFCCs [11,2]. Figure 2 shows the diagram of operations. First, the audiosequence is loaded (1), decomposed into successive frames (2),which are then converted into the spectral domain, using the mirspectrum function (3). The spectra are converted from the frequency domain to the Mel-scale domain: the frequencies are rear rearrangedin to 40 frequency bands called Mel-bands1. The envelope of the Mel-scale spectrum is described with the MFCCs, which are obtained by applying the Discrete Cosine Transform to the Melscale spectrum. Usually only a restricted number of
第 4 页 共 28 页
them (for instancethe 13 first ones) are selected (5).
a=miraudio(’audiofile.wav’) (1) f=mirframe(a) (2) s=mirspectrum(f) (3) m=mirspectrum(s,’Mel’) (4) c=mirmfcc(s,’Rank’,1:13) ( 5) The computation can be carried in a window sliding through the audio signal (this corresponded to the code line 1), resultingin a series of MFCC vectors, one for each successive frame, that can be represented column-wise in a matrix. Figure 2 shows anexample of such matrix. The MFCCs do not convey very intuitive meaning perse, but are generally applied to distance computation between frames, and therefore to segmentation tasks (cf. paragrapn2.5).
The whole process can be executed in one single line by calling directly the mirmfcc function with the audio input as argument:
mirmfcc(f,’Rank’,1:13) (6) 2.3. Example: Rhythm analysis
One common way of estimating the rhythmic pulsation, describedin figure 6, is based on auditory modelling [5]. The audio signal is first decomposed into auditory channels using a bank of filters. Diversetypes of filterbanks are proposed and the number of channels can be changed, such as 20 for instance (8). The envelope of each channel is extracted (9)2. As pulsation is generally related to increase of energy only, the envelopes are differentiated, half-waverectified, before being finally summed together again (10). This gives a precise description of the variation of energy produced by each note event from the different auditory channels.
After this onset detection, the periodicity is estimated through autocorrelation (12)3. However, if the tempo varies throughout the piece, an autocorrelation of the whole sequence will not show clear periodicities. In such cases it is better to compute the auto for a frame decomposition (11)4. This yields a periodo gram that highlights the different periodicities, as shown in figure 6. In order to focus on the periodicities that are more perceptible, the
第 5 页 共 28 页