periodogram is filtered using a resonance curve [16] (12), after which the best tempos are estimated through peak picking (13),and the results are converted into beat per minutes (14). Due to the difficulty of choosing among the possible multiples of the tempo,several candidates (three for instance) may be selected for each frame, and a his to gram of all the candidates for all the frames,called periodicity histogram, can be drawn (15).
fb=mirfilterbank(a,20) (8) e=mirenvelope(fb,’Diff’,’Halfwave’) (9) s=mirsum(e) (10) fr=mirframe(s,3,1) (11) ac=mirautocor(fr,’Resonance’) (12) p=mirpeaks(ac,’Total’,1,’NoEnd’) (13) t=mirtempo(p) (14) h=mirhisto(t) (15) The whole process can be executed in one single line by calling directly the mirtempo function with the audio input as argument:
mirtempo(a,’Frame’) (16) In this case, the different options available throughout the processcan directly be specified as argument of the tempo function. Forinstance, a computation of a frame-based tempo estimation, with aselection of the 3 best tempo candidates in each frame, a range of admissible tempi between 60 and 120 beats per minute, an estimationstrategy based on a mixture of spectrum and autocorrelation applied on the spectral flux will be executed with the syntax:
mirtempo(a,’Frame’,’Total’,3,
’Min’,60,’Max’,120,’Spectrum’,
’Autocor’,’SpectralFlux’) (17)
2.4. Segmentation
More elaborate tools have also been implemented that can carry out higher-level analyses and transformations. In particular, audiofiles can be automatically segmented into a series of homogeneous sections, through the estimation of temporal disconti uities along diverse
第 6 页 共 28 页
alternative features such as timbre in particular [17]. First the audio signal is decomposed into frames (18) and one chosenfeature, such as MFCC (19), is computed along these frames. The feature-based distances between all possible frame pairs are stored in a similarity matrix (20). Convolution along the maindiagonal of the similarity matrix using a Gaussian checkerboard kernelyields a novelty curve that indicates the temporal locations of significant textural changes (21).Peak detection applied to the novelty curve returns the temporal position of feature discontinuities(22) that can be used for the actual segmentation of the audio sequence(23)5.
fr=mirframe(a) (18) fe=mirmfcc(fr) (19) sm=mirsimatrix(fe) (20) nv=mirnovelty(sm) (21) ps=mirpeaks(nv) (22) sg=mirsegment(a,ps) (23) The whole segmentation process can be executed in one singleline by calling directly the mirsegment function with the audioinput as argument:
mirsegment(a,’Novelty’) (25) By default, the novelty curve is based on MFCC, but other features can be selected as well using an additional option:
mirsegment(a,’Novelty’,’Spectrum’) (26) A second similarity matrix can be computed, in order to showthe distance – according to the same feature than the one used for the segmentation – between all possible segment pairs (28).6
fesg=mirmfcc(sg) (27) smsg=mirsimatrix(fesg) (28) 2.5. Data analysis
The toolbox includes diverse tools for data analysis, such as a peak extractor, and functions that compute histograms, entropy, zero crossingrates, irregularity or various statistical moments (centroid,spread, skewness, kurtosis, flatness) on data of various types,
第 7 页 共 28 页
suchas spectrum, envelope or histogram.
The mirpeaks functions can accept any data returned by anyother function of the MIRtoolbox and can adapt to the different kind of data of any number of dimensions. In the graphical representation of the results, the peaks are automatically located on the corresponding curves (for 1D data) or bit-map images (for 2D data).
The mirpeaks functions offers alternative possible heuristics.It is possible to define a global threshold that peaks mustexceed for them to be selected. We have designed a new strategy of peak selection, based on a notion of contrast, discarding peaksthat are not sufficiently contrastive (based on a certain threshold)with the neighbouring peaks. This adaptive filtering strategy henceadapts to the local particularities of the curves. Its articulation with other more conventional thresholding strategies leads to anefficient peak picking module that can be applied throughout the MIRtoolbox.
Supervised classification of musical samples can also be performed,using techniques such as K-Nearest Neighbours or Gaussian Mixture Model. One possible application is the classification of audio recordings into musical genres.
3. DESIGN OF THE TOOLBOX
3.1. Data encapsulation
All the data returned by the functions in the toolbox are encapsulated into types objects. The default display method associated toall these objects is a graphical display of the corresponding curves. In this way, when the display of the values of a given analysis is requested,what is printed is not a listing of long vectors or matrices,
but rather a correctly formatted graphical representation.The actual data matrices associated to those data can be obtained by calling a method called mirgetdata, which constructsthe simplest possible data structure associated to the data (cf. paragraph4.1). 3.2. Frame analysis
Frame-based analyses (i.e., based on the use of a sliding window) can be specified using two alternative methods. The first method is based on the use of the mirframefunction, which decomposes an audiosignal into successive frames. Optio nal arguments canspecify the frame
第 8 页 共 28 页
size (in seconds, by default), and the hop factor (between 0 and 1, by default). For instance, in the following code(line 29), the frames have a size of 50 milliseconds and are half overlapped. The results of that function could then be directly sentas input of any other function of the toolbox (30):
f=mirframe(a,.05,.5) (29) mirtempo(f) (30) Yet this first method does not work correctly for instance when dealing with tempo estimation as described in section 2.4. Following this first method, as shown in figure 7, the frame decompositionis the first step performed in the chain of processes. As a result,the input of the filterbank decomposition is a series of short frames,which induces two main difficulties. Firstly, in order to avoid the presence of undesirable transitory state at the beginning of each filtered frame, the initial state of each filter would need to be tuned depending on the state of the filter at one particular instant of the previousframe (depending of the ove rlapping factor). Secondly, the demultiplici tion of the redundancies of the frame decomposition(if the frames are overlapped) throughout the multiple channels of the filterbank would require the use of consequent memoryspace. The technical difficulties and waste of memory induced by this first method can be immediately overcome if the frame decompositionis performed after the filterbank decomposition and recomposition, as shown in figure 8.
This second method, more successful in this context, cannot be managed using the previous syntax, as the input of the mirte mpofunction should not be frame-decomposed yet. The other alternativesyntax consists in proposing the frame decomposition option as a possible argument (’Frame’) of the mirtempo function(31). This corresponds to what was presented in section 2.4 (codelines 16 and 17).
mirtempo(a,’Frame’,.05,.5) (31)
第 9 页 共 28 页
The frame decomposition option is available as a possible argumentto most of the functions of the toolbox. Each functioncan then specify the exact position of the frame decomposition within its chain of operations. Besides, if not specified, the default parameters of the frame decomposition – i.e., frame size and hop factor – can be adapted to each specific function. Hence,from a user’s point of view, the execution and chaining of the different operators of the MIRtoolbox follow the same syntax, be
there frame decomposition or not, apart from the additional use of either the command mirframe or the option ’Frame’ for frame decomposition. Of course, from a developer’s point of view,this requires that each feature extraction algorithm should adapt to frame-decomposed input. More precisely, as will be explained in section 4.1, input can be either a single vector or a matrix, where columns represent the successive frames. Conveniently enough, inthe Matlab environment, the generalization of vector-based algorithmsto matrix-based versions is generally effortless. 3.3. Adaptive syntax
As explained previously, the diverse functions of the toolbox canaccept alternative input: ? The name of a particular audio file (either in wav or au format)can be directly specified as input:
mirspectrum(’myfile’) (32)
第 10 页 共 28 页