We were a combination of physically present and online contributors to the seminar. Joshua Reiss and Victor Lazzarini participated via online connection, present together in Trondheim were: Maja S.K. Ratkje, Siv Øyunn Kjenstad, Andreas Bergsand, Trond Engum, Sigurd Saue and Øyvind Brandtsegg
Instrumental extension, monitoring, bleed
We started the seminar by hearing from the musicians how it felt to perform during Wednesday’s session. Generally, Siv and Maja expressed that the analysis and modulation felt like an extension to their original instrument. There were issues raised about the listening conditions, how it can be difficult to balance the treatments with the acoustic sound. One additional issue in this respect when working cross-adaptively (as compared to e.g. the live processing setting), is that there is not a musician controlling the processing, to the processed sound is perhaps a little bit more “wild”. In a live processing setting, the musician controlling the processing will attempt to ensure a musically shaped phrasing and control that is at the current stage not present in the crossadaptive situation. Maja also reported acoustic bleed from the headphones to the feature extraction for her sound. With this kind of sensitivity to crosstalk, the need for source separation (as discussed earlier) is again put to our attention. Adjusting the signal gating (noise floor threshold for the analyzer) is not sufficient in many situations, and raising the threshold also lowers the analyzer sensitivity to soft playing. Some analysis methods are more vulnerable than others, but it is safe to say that none of the analysis methods are really robust against noise or bleed from external signals.
Interaction scenarios as “assignments” for the performers
We discussed the different types of mapping (of features to modulators) which the musicians also called “assignments”, as it was experienced as a given task to perform utilizing certain expressive dimensions in specific ways to control the timbre of the other instrument. This is of course true. Maja expressed that she was most intrigued by the mappings that felt “illogical”, and that illogical was good. By illogical, she means mappings that does not feel natural or as intuitive musical energy flows. Things that break up the regular musical attention, and break up the regular flow from idea to expression. As an example was mentioned the use of pitch to control reverberation size. For Maja (for many, perhaps for most musicians), pitch is such a prominent parameter in the shaping of a musical statement, so it is hard to play when some external process interferes with the intentional use of pitch. The linking of pitch to some timbral control is such an interference, because it creates potential conflict between the musically intended pitch trajectory and the musically intended timbral control trajectory. An interaction scenario (or in effect, a mapping from features to modulators to effects), can in some respects be viewed as a composition, in that it sets a direction for the performance. In many ways similar to the text scores of the experimental music of the 60’s, where the actual actions or events unfolding are perhaps not described, but more a description of an idea of how the musicians may approach the piece. For some, this may be termed a composition, others might use the term score. In any case it dictates or suggests some aspects of what the performers might do, and as such consists of an external implication on the performance.
Analysis, feature extraction
Some of our analysis methods are still a bit flaky, i.e. we see spurious outliers in their output that is not necessarily caused by perceptible changes in the signal being analyzed. One example of this is the rhythm consonance feature, where we try to extract a measure of rhythmic complexity by measuring neighboring delta times between events and looking for simple ratios between these. The general idea being that simpler the ratio, the simpler the rhythm is. The errors sneak in as part of the tolerance for human deviation in rhythmic performance, where one may clearly identify one intended pattern, while the actual measured delta times can deviate more than a musical time division (for example, when playing a jazzy “swing 8ths” triplet pattern, which may be performed somewhere between equal 8th notes, a triplet pattern, or even towards a dotted 8th plus a 16th and in special cases a double dotted 8th plus a 32nd). When looking for simple integer relationships small deviations in phrasing may lead to large changes in the algorithm’s output. For example 1:1 for a straight repeating 8th pattern, 2:1 for a triplet pattern and 3:1 for a dotted 8th plus 16th pattern, re the jazz swing 8ths. See also this earlier post for more details if needed. As an extreme example, think of a whole note at a slow tempo (1:1) versus an accent on the last 16th note on a measure (giving a 15:16 ratio). These deviations create single values with a high error. Some common ways of dampening the effect of such errors would be lowpass filtering, exponential moving average, or median filtering. One problem in the case of rhythmic analysis is that we only get new values for each new event, so the “sampling rate” is variable and also very low. This means that any filtering has the consequence of making the feature extraction respond very slowly (and also with a latency that varies with the number of events played), something that we would ideally try to avoid.
Stabilizing extraction methods
In the light of the above, we discussed possible methods for stabilizing the feature extraction methods to avoid the spurious large errors. One conceptual difficulty is differentiating between a mistake and an intended expressive deviation. More importantly, to differentiate between different possible intended phrasings. How do we know what the intended phrasing is without embedding too many assumptions in our analysis algorithm? For rhythm, it seems we could do much better if we first deduct the pulse and the meter, but we need to determine what our (performed) phrasings are to be sure of the pulse and meter, so it turns into a chicken and egg problem. Some pulse and meter detection algorithms maintain several potential alternatives, giving each alternative a score for how well it fits in light of the observed data. This is a good approach, assuming we want to find the pulse and meter. Much of the music we will deal with in this project does not have a strong regular pulse, and it most surely does not have a stable meter. Let’s put the nitty gritty details of this aside for a moment, and just assume that we need some kind of stabilizing mechanism. As Josh put it, a restricted form of the algorithm.
Disregarding the actual technical difficulties, let’s say we want some method of learning what the musician might be up to, what is intended, what is mistake, and what are expressive deviations from the norm. Some sort of calibration the normal, average, or regularly occurring behavior. Track change during session, and this is change relative to the norm as established in the traning. Now, assuming we could actually build such an algorithm, when should it be calibrated (put into learn mode)? Should we continuously update the norm, or just train once and for all? If training before performance (as some sort of sound check), we might fail miserably because the musician might do wildly different things in the “real” musical situation compared to when “just testing”. Also, if we continuously update the norm, then our measures are always drifting, so something that was measured as “soft” in the beginning of the performance might be indicated as something else entirely by the end of the performance. Even though we listen to musical form as a relative change, we might also as listeners recognize when “the same” happens again later. E.g. activity in the same pitch register, the same kind of rhythmic density, the same kind of staccato abrupt complex rhythms etc. with a continuously updated norm, we might classify the same as something different. Regarding the attempt to define something as the same see also the earlier post on philosophical implications. Still, with all these reservations, it seems necessary to attempt creating methods for relative change. This can perhaps be used as a restricted form, as Joshua suggests, or in any case as an interesting variation on the extracted features we already have. It would extend the feature output formats of absolute value, dynamic range, crest. In some ways it is related to dynamic range (f.ex. as the relative change would to some degree be high when the dynamic range is high, but then again the relative change would have a more slowly moving reference, and it would also be able to go negative). As a reference for the relative change, we could use a long time average, a model of expectation, assumption of the current estimate (maintaining several candidates as with pulse and meter induction), or normal distribution and standard deviation. These long term estimates have been used with success in A-DAFx (adaptive audio effects) for live applications.
Josh mentioned the possibility of doing A/B comparision of the extraction methods with some tests at QMUL. We’ll discuss this further.
Display of extracted features
When learning (an instrument), multimodal feedback can significantly reduce the time taken to gain proficiency. When learning how the different feature extraction methods work, and how they react to intentional expressive changes in the signal, visual feedback might be useful. Other means of feedback could be sonifying the analysis (which is what we do already, but perhaps make it more pointed and simple. This could be especially useful when working with the gate mix method, as the gate will give not indication to the performer that it will soon open, whereas a sonification of the signal could aid the performer in learning the range of the feature and getting an intuitive feel for when the gate will open. Yet another means of feedback is giving the performer the ability to adjust the scaling of the feature-to-modulator mapping. In this way, it would act somewhat like giving the performer that ability to tune the instrument, ensuring that it reacts dynamically to the ranges of expression utilized by this performer. Though not strictly a feedback technique, we could still treat it as a kind of familiarization aid in that it acts as a two-way process between performer and instrument. The visualization and the performer adjustable controls could be implemented as a small GUI component running on portable devices like cellphone or touchpad. The signals can be communicated from the analyzer and MIDIator via OSC, and the selection of features to display can be controlled by the assignments of features in the MIDIator. A minimal set of controls can be exposed, and the effects of these being mirrored in the plugins (MIDIator). Some musicians may prefer not to have such a visual feedback. Siv voiced concern that she would not be so interested in focusing visually on the signals. This is a very valid concern for performance. Let us assume it will not be used during actual performance, but as a tool for familiarization during early stages of practice with the instrument.