Measuring “pressedness” of a timbre by Mel freq cepstrum difference

We have been searching for some way of extracting perceptually significant timbral qualities. One such quality could be the “pressedness” or “tension” of a sound. One could think of this similar to the amount of effort or energy put in by a performer of the sound. Some sort of musical intensity of intention. The term is a bit vague, but we assume it could be musically useful to be able extract such a timbral pressedness.

When playing around with the display of the first few bands (4-5 band) of the Mel frequency cepstrum coefficients (MFCC), I noticed that certain sounds would make certain unique distributions (shapes one could say) between these bands. Lets look at for example the first few MFCCs of a relaxed vocal “a”, and compare it to the same image of a pressed “a”.

Relaxed (left) and pressed (right) “a”

Then similarly for a relaxed and a pressed vocal “i”:

Relaxed (left) and pressed (right) “i”

As we can see the relaxed sounds are relativelly flat in the first few MFCC bands (except for the first one), but the more pressed sounds are more peaky. Even though the peaks will fall in different locations on different pressed sounds (and also fall differently is we change the MFCC analysis parameters), there is a clear indication that spiky shapes follow more pressed sounds. A more scientific way of explaining it would be that the formants are more pronounced, and thus creating peaks in the MF cepstrum. Now, lets look at a noisy “shhh” sound:

Shh sound

Here we also notice a certain peakyness, but more notably, the first MFCC coefficient is very low. With the first MFCC indicating the first harmonic of the MF cepstrum, it correlates well with the general balance between low and high frequencies (higher values when we have more energy in the lows). Since the “shh” sound is quite flat, this will give a low MFCC1 value. Still, disregarding the actual placement of the peaks, and generalizing the “pressedness” of the sound, we could state that a noisy shh is a more pressed timbre than a tonal “a” for example.

So, as a simple way of getting a measure of the peakyness we can just simple sum the absolute difference of the first few MFCC bands and use this as our measure. This is what is shown in the images above as “mfccdiff”. For clarity, here’s the formula:

Now, this has so far only been tested on vocal input. I am confident that the differences will not be so clear on other instrumental signals. But it still seems a reasonable feature extraction method to include in further experimentation.

Documentation as debugging

This may come as no surprise to some readers, but I thought it Documentation_pencil appropriate to note anyway. During the blog writing about the rhythmic analysis – part 1 , I noticed I would tighten up the definition of terms, and also the actual implementation of the algorithm significantly. I would start writing something, seeing what I had just written and thinking “this just does not make sense” or “this implementation must be off, or just plainly wrong”.  Then, to be able to write sensibly about the ideas, I went back and tidied up my own mess. In the process, making a much more reliable method for analysing the features I wanted to extract. What was surprising was not that this happened but the degree to which it happened.  In the context of artistic research and the reflection embedded in that process, similar events may occur.