Oslo, First Session, October 18, 2016

First Oslo Session. Documentation of process

Gyrid Kaldestad, vocal
Bernt Isak Wærstad, guitar
Bjørnar Habbestad, flute

Observer and Video
Mats Claesson

The Session took place in one of the sound studios at the Norwegian Academy of Music, Oslo , Norway

Gyrid Kaldestad (vocal) and Bernt Isak Wærstad (guitar) had one technical/setup meeting beforehand, and there were numerous emails going back and forth before the session that was about technical issues.
Bjørnar Habbestad (flute) where invited into the session.

The observer, decided to make a video documentation of the session.
I’m glad I did because I think it gives a good insight off the process. And a process it was!
The whole session lasted almost 8 hours and it was not until the very last 30 minutes that playing started.

I am (Mats Claesson) not going to comment on the performative musical side of the session. The only reason for this is that the music making happend at the very end of the session, was very short and it was not recorded so I could evaluate it “in depth” However, just watch the comments, from the participants, at the end of the video. They are very positive…..
I think from the musicians side it was rewarding and highly interesting. I am confident that the next session will generate an musical outcome that is substantial enough to be comment on, from both a performative and a musical side.

In the video there are no processed sound of the very last playing due to use of headphone, but you can listen to excerpts posted below the video.

Here is a link to the video

Reflections on the process given from the perspective of the musicians:

We agreed to make a limited setup to have better control over the processing. Starting with basic sounds and basic processing tools so that we easier could control the system in a musical way. We started with a tuning analysis for each instrument (voice, flute, guitar)

Instead of chosing analysis parameter up front, we analysed different playing techniques, e.g. non- tonal sounds (sss, shhh), multiphonics etc., and saw how the analyser responded. We also recorded short samples of the different techniques that each of us usually play, so that we could investigate the analysis several times.  

This is the analysis results we got:


Since we’re all musicians experienced with live processing, we made a setup based on effects that we already know well and use in our live-electronic setup (reverb, filter, compression, ring modulation and distortion).

To set up meaningful mappings, we chose an approach that we entitled “spectral ducking”, where a certain musical feature on one instrument would reduce the same musical feature on the other – e.g. a sustained tonal sound produced by the vocalist, would reduce harmonic musical features of the flute by applying ring modulation. Here is a complete list of the mappings used:


Excerpt #1 – Vocal and flute

Excerpt #2 – Vocal and flute

Excerpt #3 – Vocal and flute

Excerpt #4 – Vocal and flute

Lack of consisive and presise analysis results from the guitar in combination with time limitation, it wasn’t possible to set up mappings for the guitar and flute. We did however test out the guitar and flute in the last minutes of the session, where the guitar simply took the role of the vocal in terms of processing and mapping. A knowledge of the vocal analysis and mapping, made it possible to perform with the same setup even though the input instrument had changed. Some short excerpts from this performance can be heard below.

Excerpt #5 – Guitar and flute

Excerpt #6 – Guitar and flute

Excerpt #7 – Guitar and flute

 Reflections and comments:

  • We experienced the importance of exploring new tools like this on a known system. Since none of us knew Reaper from before, we used spent quite a lot of time learning a new system (both while preparing and during the session)
  • Could the meters analyser be turned the other way around? It is a bit difficult to read sideways.
  • It would be nice to be able to record and export control data from the analyser tool that will make it possible to use it later in a synthesis.
  • Could it be an idea to have more analyzer sources pr channel? The Keith McMillian Softstep mapping software could possibly be something to look at for inspiration?
  • The output is surprisingly musical – maybe this is a result of all the discussions and reflections we did before we did the setup and before we played?
  • The outcome is something else than playing with live electronics- it is immediate and you can actually focus on the listening – very liberating from a live electronics point of view!
  • The system is merging the different sounds in a very elegant way.
  • Knowing that you have an influence on your fellow musicians output forces you to think in new ways when working with live electronics.
  • The experience for us is that this is similar to work acoustically.


Seminar 21. October

We were a combination of physically present and online contributors to the seminar.  Joshua Reiss and Victor Lazzarini participated via online connection, present together in Trondheim were: Maja S.K. Ratkje, Siv Øyunn Kjenstad, Andreas Bergsand, Trond Engum, Sigurd Saue and Øyvind Brandtsegg

Instrumental extension, monitoring, bleed

We started the seminar by hearing from the musicians how it felt to perform during Wednesday’s session. Generally, Siv and Maja expressed that the analysis and modulation felt like an extension to their original instrument. There were issues raised about the listening conditions, how it can be difficult to balance the treatments with the acoustic sound. One additional issue in this respect when working cross-adaptively (as compared to e.g. the live processing setting), is that there is not a musician controlling the processing, to the processed sound is perhaps a little bit more “wild”. In a live processing setting, the musician controlling the processing will attempt to ensure a musically shaped phrasing and control that is at the current stage not present in the crossadaptive situation. Maja also reported acoustic bleed from the headphones to the feature extraction for her sound. With this kind of sensitivity to crosstalk, the need for source separation (as discussed earlier) is again put to our attention.  Adjusting the signal gating (noise floor threshold for the analyzer) is not sufficient in many situations, and raising the threshold also lowers the analyzer sensitivity to soft playing. Some analysis methods are more vulnerable than others, but it is safe to say that none of the analysis methods are really robust against noise or bleed from external signals.

Interaction scenarios as “assignments” for the performers

We discussed the different types of mapping (of features to modulators) which the musicians also called “assignments”, as it was experienced as a given task to perform utilizing certain expressive dimensions in specific ways to control the timbre of the other instrument. This is of course true. Maja expressed that she was most intrigued by the mappings that felt “illogical”, and that illogical was good. By illogical, she means mappings that does not feel natural or as intuitive musical energy flows. Things that break up the regular musical attention, and break up the regular flow from  idea to expression. As an example was mentioned the use of pitch to control reverberation size. For Maja (for many, perhaps for most musicians), pitch is such a prominent parameter in the shaping of a musical statement, so it is hard to play when some external process interferes with the intentional use of pitch. The linking of pitch to some timbral control is such an interference, because it creates potential conflict between the musically intended pitch trajectory and the musically intended timbral control trajectory. An interaction scenario (or in effect, a mapping from features to modulators to effects), can in some respects be viewed as a composition, in that it sets a direction for the performance. In many ways similar to the text scores of the experimental music of the 60’s, where the actual actions or events unfolding are perhaps not described, but more a description of an idea of how the musicians may approach the piece. For some, this may be termed a composition, others might use the term score. In any case it dictates or suggests some aspects of what the performers might do, and as such consists of an external  implication on the performance.

Analysis, feature extraction

Some of our analysis methods are still a bit flaky, i.e. we see spurious outliers in their output that is not necessarily caused by perceptible changes in the signal being analyzed. One example of this is the rhythm consonance feature, where we try to extract a measure of rhythmic complexity by measuring neighboring delta times between events and looking for simple ratios between these. The general idea being that simpler the ratio, the simpler the rhythm is. The errors sneak in as part of the tolerance for human deviation in rhythmic performance, where one may clearly identify one intended pattern, while the actual measured delta times can deviate more than a musical time division (for example, when playing a jazzy “swing 8ths” triplet pattern, which may be performed somewhere between equal 8th notes, a triplet pattern, or even towards a dotted 8th plus a 16th and in special cases a double dotted 8th plus a 32nd). When looking for simple integer relationships small deviations in phrasing may lead to large changes in the algorithm’s output. For example 1:1 for a straight repeating 8th pattern, 2:1 for a triplet pattern and 3:1 for a dotted 8th plus 16th pattern, re the jazz swing 8ths. See also this earlier post for more details if needed.  As an extreme example, think of a whole note at a slow tempo (1:1) versus an accent on the last 16th note on a measure (giving a 15:16 ratio). These deviations create single values with a high error. Some common ways of dampening the effect of such errors would be lowpass filtering, exponential moving average, or median filtering. One problem in the case of rhythmic analysis is that we only get new values for each new event, so the “sampling rate” is variable and also very low. This means that any filtering has the consequence of making the feature extraction respond very slowly (and also with a latency that varies with the number of events played), something that we would ideally try to avoid.

Stabilizing extraction methods

In the light of the above, we discussed possible methods for stabilizing the feature extraction methods to avoid the spurious large errors. One conceptual difficulty is differentiating between a mistake and an intended expressive deviation.  More importantly, to differentiate between different possible intended phrasings. How do we know what the intended phrasing is without embedding too many assumptions in our analysis algorithm? For rhythm, it seems we could do much better if we first deduct the pulse and the meter, but we need to determine what our (performed) phrasings are to be sure of the pulse and meter, so it turns into a chicken and egg problem. Some pulse and meter detection algorithms maintain several potential alternatives, giving each alternative a score for how well it fits in light of the observed data. This is a good approach, assuming we want to find the pulse and meter. Much of the music we will deal with in this project does not have a strong regular pulse, and it most surely does not have a stable meter. Let’s put the nitty gritty details of this aside for a moment, and just assume that we need some kind of stabilizing mechanism. As Josh put it, a restricted form of the algorithm.
Disregarding the actual technical difficulties, let’s say we want some method of learning what the musician might be up to, what is intended, what is mistake, and what are expressive deviations from the norm. Some  sort of calibration the normal, average, or regularly occurring behavior. Track change during session, and this is change relative to the norm as established in the traning. Now, assuming we could actually build such an algorithm, when should it be calibrated (put into learn mode)? Should we continuously update the norm, or just train once and for all? If training before performance (as some sort of sound check), we might fail miserably because the musician might do wildly different things in the “real” musical situation compared to when “just testing”. Also, if we continuously update the norm, then our measures are always drifting, so something that was measured as “soft” in the beginning of the performance might be indicated as something else entirely by the end of the performance. Even though we listen to musical form as a relative change, we might also as listeners recognize when “the same” happens again later. E.g. activity in the same pitch register, the same kind of rhythmic density, the same kind of staccato abrupt complex rhythms etc. with a continuously updated norm, we might classify the same as something different. Regarding the attempt to define something as the same see also the earlier post on philosophical implications. Still, with all these reservations, it seems necessary to attempt creating methods for relative change. This can perhaps be used as a restricted form, as Joshua suggests, or in any case as an interesting variation on the extracted features we already have.  It would extend the feature output formats of absolute value, dynamic range, crest. In some ways it is related to dynamic range (f.ex. as the relative change would to some degree be high when the dynamic range is high, but then again the relative change would have a more slowly moving reference, and it would also be able to go negative). As a reference for the relative change, we could use a long time average, a model of expectation, assumption of the current estimate (maintaining several candidates as with pulse and meter induction), or normal distribution and standard deviation. These long term estimates have been used with success in A-DAFx (adaptive audio effects) for live applications.

Josh mentioned the possibility of doing A/B comparision of the extraction methods with some tests at QMUL. We’ll discuss this further.

Display of extracted features

When learning (an instrument), multimodal feedback can significantly reduce the time taken to gain proficiency. When learning how the different feature extraction methods work, and how they react to intentional expressive changes in the signal, visual feedback might be useful. Other means of feedback could be sonifying the analysis (which is what we do already, but perhaps make it more pointed and simple. This could be especially useful when working with the gate mix method, as the gate will give not indication to the performer that it will soon open, whereas a sonification of the signal could aid the performer in learning the range of the feature and getting an intuitive feel for when the gate will open. Yet another means of feedback is giving the performer the ability to adjust the scaling of the feature-to-modulator mapping. In this way, it would act somewhat like giving the performer that ability to tune the instrument, ensuring that it reacts dynamically to the ranges of expression utilized by this performer. Though not strictly a feedback technique, we could still treat it as a kind of familiarization aid in that it acts as a two-way process between performer and instrument. The visualization and the performer adjustable controls could be implemented as a small GUI component running on portable devices like cellphone or touchpad. The signals can be communicated from the analyzer and MIDIator via OSC, and the selection of features to display can be controlled by the assignments of features in the MIDIator. A minimal set of controls can be exposed, and the effects of these being mirrored in the plugins (MIDIator). Some musicians may prefer not to have such a visual feedback. Siv voiced concern that she would not be so interested in focusing visually on the signals. This is a very valid concern for performance. Let us assume it will not be used during actual performance, but as a tool for familiarization during early stages of practice with the instrument.

Brief system overview and evaluation

As preparation for upcoming discussions about tecnical needs in the project, it seems appropriate to briefly describe the current status of the software developed so far.

The Analyzer

The plugins

The two main plugins developed is the Analyzer  and the MIDIator. The Analyzer extracts perceptual features from a live audio signal and transmit signals representing these features over a network protocol (OSC) to the MIDIator. The job of the MIDIator is to combine different analyzed features (scaling, shaping, mixing, gating) into a controller signal that we will ultimately use to control some effect parameter. The MIDIator can run on a different track in the same DAW, it can run on another DAW, or on another computer entirely.

Strong points

The feature extraction generally works reasonably well for the signals it has been tested on. Since a limited set of signals is readily available during implementation, some overfitting to these signals can be expected. Still, a large set of features is extracted, and these have been selected and tweaked for use as intentional musical controllers. This can sometimes differ from the more pure mathematical and analytical descriptions of a signal. The quality of of our feature extraction can best be measured in how well a musician can utilize it to intentionallly control the output. No quantitative mesurement of that sort have been done so far. The MIDIator contains a selection of methods to shape and filter the signals, and to combine them in different ways. Until recently, the only way to combine signals (features) was by adding them together. As of the past two weeks, mix methods for absolute difference, gating, and sample/hold has been added.

MIDIator modules

Weak points

The signal chain transmission from Analyzer to MIDIator, and then again from the MIDIator to the control signal destination each incurs at least one sample block latency. The size of a sample block can vary from system to system, but regardless of the size used our system will have 3 times this latency before an effect parameter value changes in response to a change in the audio input. For many types of parameter changes this is not critical, still it is a notable limitation of the system.

The signal transmission latency points at another general problem, interfacing between technologies. Each time we transfer signals from one paradigm to another we have the potential for degraded performance, less stability and/or added latency. In our system the interface from the DAW to our plugins will incur a sample block of latency, the interface between Csound and Python can sometimes incure performance penalties if large chunks of data needs to be transmitted from one to the other. Likewise, the communication between the Analyzer and MIDIator is such an interface.

Some (many) of the feature extraction methods create somewhat noisy signals. With noise, we mean here that the analyzer output can intermittently deviate from the value we perceptually assume to be “correct”. We can also look at this deviation statistically, if we feed it relatively (perceptually) consistent signals and look at how stable the output of each feature extraction method is. Many of the features show activity generally in the right register, and a statistical average of the output corresponds with general perceptual features. While the average values are good, we will oftentimes see spurious values with relatively high deviation from the general trend. From this, we can assume that the feature extraction model generally works, but intermittently fails. Sometimes, filtering is used as an inherent part of the analysis method, and in all cases, the MIDIator has a moving exponential average filter with separate rise and fall times. Filtering can be used to cover up the problem, but better analysis methods would give us more precise and faster response from the system.

Audio separation between instruments can sometimes be poor. In the studio, we can isolate each musician, but if we want them to be able to play together naturally in the same room, a significant bleed from one instrument to the other will occur. For live performance this situation is obviously even worse. The bleed give rise to two kinds of problems: Signal analysis is disturbed by the signal bleed, and signal processing is cluttered. For the analysis, it does not matter if we had perfect analysis methods if the signal to be analyzed is a messy combination of opposing perceptual dimensions. For the effect processing, controlling an effect parameter for one instrument leads to a change in the processing of the other instrument, just because the other instruments’ sound bleed into the first instrument’s microphones

Useful parameters (features extracted)

In many of the sessions up until now, the most used features has been amplitude (rms) and transient density. One reson for this is probably that they are concptually easy to understand, another is that their output is relatively stable and predictable in relation to the perceptual quality of the sound analyzed. Here are some suggestions of other parameters that expectedly can be utilized effectively in the current implementation:

  • envelope crest (env_crest): the peakyness of the amplitude envelope, for sustained sounds this will be low, for percussive onsets with silence between evens it will be high
  • envelope dynamic range (env_dyn): goes low for signals operating at a stable dynamic level, high for signals with a high degree of dynamic variation.
  • pitch: well known
  • spectral crest (s_crest): goes low for tonal sounds, medium for pressed tones, high for noisy sounds.
  • spectral flux (s_flux): goes high for noisy sounds, low for tonal sounds
  • mfccdiff: measure of tension or pressedness, described here

There is also another group of extracted features that is potentially useful but still has some stability issues

  • rhythmic consonance (rhythm_cons) and rhythmic irregularity (rhythm_irreg): described here
  • rhythm autocorr crest (ra_crest) and rhythm autocorr flux (ra_flux): described here

The rest of the extracted features can be considered more experimental, in some cases they might yield effective controllers, especially when combined with other features in reasonable proportions

Rhythm analysis, part 2

As mentioned in the rhythm analysis part 1, one of our goals at this point has been to try to find methods of rhythmical analysis that work without assumptions about pulse and meter, and also as far as possible without assumptions about musical style. As our somewhat minimal rhythm definition  we look at rhythm as time ratio constellations, patterns of time ratios. Here we will make an assumption (yes, something must be assumed)  regarding patterns of time durations: Recurring or repeated patterns have another perceptual quality than constantly shifting combinations of time durations. We could also assume that recurring or repeated patterns have stronger perceptual influence, but technically, it does not matter so much. The main issue is that we can measure some difference in quality. Quality here does not imply that something is better, just that something is different.

FFT of modified amplitude envelope

To measure how much recurrence there is in a signal, we can use autocorrelation. Repeated patterns will show up as peaks in the autocorrelation, and the period of repetition will be shown as the position of the peaks. Longer repetition periods give peaks further away (commonly further to the right when graphing the autocorrelation). To calculate the autocorrelation, we could use the FFT of the amplitude envelope as our basis. However, the unmodified envelope can have many variations at frequencies not related to the actual rhythms. For example, the amplitude envelope of a signal with fast transients (e.g. percussion, piano) show much more high frequency content than a signal with slow transients (e.g. many wind instruments). For this reason, we opted to use a modified envelope for the FFT in connection with the rhythm autocorrelation measure. The modified envelope is generated by using transient detection, triggering a short gaussian envelope scaled to the current amplitude of the signal. This way, we achieve a consistent envelope across different instruments, preserving the relative amplitude differences between transients (since we assume dynamics to be relevant to rhythm, for example in using accents to signify grouping of events).

Time spans and latency

One question to ask when analyzing for recurring patterns is “how long patterns are we looking for?”. One could say that a musical pattern sometimes will repeat relatively quickly, say once every second. Other times we can have arbitrary long patterns (sometimes very long), but for practical purposes, lets assume a maximum length of somewhere around 10 seconds. Now, if we want to analyze for such long patterns (10 seconds), an inherent limitation of any technique used would imply that it takes at least so many seconds to give an answer to whether there is a repeating pattern of that duration. Analyzing for 1-second patterns, we can have an indication after 1 second, and we can be sure after 2 seconds. For a musically responsive analysis, we’d like as low latency as possible, and in any practical case a latency of 10 seconds or more is not particularly responsive. Still, if the analysis window is shorter, we will not be able to detect longer patterns. One common method in FFT is to use overlapping windows, meaning that we update our analysis several times (with new data) within the time span defined by one analysis window. This will give us updated data more frequently, but still the longer patterns will only partially be influencing the output until a full period has passed. To alleviate this, we used 3 analysis durations running in parallel layers (each with overlapping windows as previously described). The longest time span is set to 10 seconds, layered onto this is a time span of 5 seconds, and layered on top of this again a 2.5 second time span. The layers are mixed down using a weighted sum that give precedence to more recent data while retaining the larger time span context. The shortest time span layer will be updated twice as often as the medium time span layer, and this again is updated twice as often as the longest time span layer. When we have a new frame of the longer time span, we will also have a new frame of the short time span ,these are then weighted 2/3 and 1/3 respectively. At the next available frame for the shorter time span, we weigh the new frame 2/3, and the longer frame 1/3. This is combined similarly for all the 3 layers to form the final autocorrelation coefficients. These are shown as a graph in the GUI.

3-layer rhythm autocorrelation. The red line is the shortest FFT, Light blue is the medium, Green is the longest. Yellow line is the weighted sum. This figures shows a steady rhythm played statically over the full duration of the longest FFT layer (10 seconds).
3 layer rhythm autocorrelation, as above. Here we see the recently played faster rhythms, with the long time span showing the static slower pulse. The combined correlation (yellow) has both features.




3 layer rhythm autocorrelation, as above. Showing a more randomly spaced non-repeating rhythm.
As above. Recently played steady fast rhythms after a period of randomly spaced non-repeating activity

Features under development

We have assumed that we could use the exact position of the peaks to detect the duration of repeated patterns. This is partly true, since fast rhythms (short repetition duration) give tightly spaced peaks, slow rhythms give sparsely spaced peaks. It seems so far, however, that the exact position and amplitude of each peak is not really stable enough to just use these values for that purpose. We use a peak picking algorithm to detect the highest peak, the second and third highest peak, and also the peak closest in time to the maximum peak and the first (earliest, leftmost) peak. In situations where the rhythm is quite static, these values correspond well to the repetition durations and the relative strength (in dynamics) of different repeating patterns. For most practical situations however, this specific application of the rhythmic autocorrelation does not really work that well (yet!). Another way of using these values could be to determine the degree to which the rhythmic events can be placed on a grid. We have termed this gridness. It is calculated by taking the maximum peak as the reference duration, then looking at the other detected peaks, and checking if we can find some relatively simple integer ratio between the max and each other peak. For example, if we have a max peak at 6 seconds, and then a second highest peak at 4 seconds, the ratio would be 3:2. In this case we construct a grid of 1/3 durations of the max peak across the whole time segment (1/3, 2/3, 3/3, 4/3, 5/3 … and so on as far as we can go), and see how many of all detected peaks correspond to grid locations. The process is repeated for each integer ratio found between the max peak and the lesser peaks. The grid with the maximum number of corresponding events wins, and is reported as the current rhythmic subdivision. The number of peaks that fall on this grid is divided by the number of detected events, resulting in the gridness value. If all detected events fall on grid locations, the gridness will be 1.0, if half of the events fall on the grid the gridness will be 0.5. The gridness measure does not currently work very well, due to the instability of the peaks as described above. This is an area of the analyzer where substantial refinement can take place. The resulting metrics (if working) however, seems intuitively to make sense; Measuring  periods of repetition, subdivisions of these, and the degree of consistence with the assumed grid of subdivisions.

So what can we actually use it for now?

Following up on a suggestion from Miller Puckette, we tried to look at more general features of the correlation graph, to extract some global descriptors of the current rhythmic activity. We turned to well known statistics like crest and flux .
The crest factor describing how “peaky” the signal is, that is the relation between the highest peak and the overall effective amplitude. Traditionally the crest would be calculated as the peak value divided by the rms (root-mean-square) amplitude. However, our use here is somewhat different than both the regular envelope crest and the spectral crest. For this specific purpose, we found it better to divide the rms value by the number of transients detected. This may at first seem counterintuitive, but can be explained by the fact that repeated patterns of the same duration creates many events that fall on the same peak locations in the autocorrelation, effectively making those peaks stronger. The division on the number of peaks detected avoids the exessively high crest values we could get if there was a single peak in the correlation. As such it gives a measure of the amount of repetition. It could be discussed whether we should use another term for this feature, to avoid confusion between this and the other creest measures.
The flux generally describes the amount of change from frame to frame, so static patterns will have low flux while constantly shifting rhythms will have high flux. The flux is calculated similarly as it would be for spectral flux (multiplying each value in a frame with the corrresponding value in the previous frame, accumulating the result and then normalizing it). This is done with a slight twist in our implementation; Because of the instability in the peaks’ location we don’t just multiply each value with the corresponding value of another frame, but we check if any neightbouring values have higher amplitude, and then use the maximum (this value or its neighbour on either side). We can think of this as the minimum possible flux. Empirical testing has shown it to be a more stable measure than the simpler variation of the flux measure.

All results related to the rhythmic autocorrelation. The crest and flux are relatively reliable measures. The other ones subject to refinement. Note that the AC peaks in the lower half of the figure currently refers only to the long time span (green) autocorrelation.

Audio effect: Liveconvolver3

The convolution audio effect is traditionally used to sample a room to create artificial reverb. Others have used it extensively for creative purposes, for example convolving guitars with angle grinders and trains. The technology normally requires recording a sound, then analyzing it and then finally loading the analyzed impulse response (IR) into an effect to use it. The Liveconvolver3 let you live sample the impulse response and start convolving even before the recording is finished. 

In the context of the crossadaptive project, convolution can be a nice way of imprinting the characteristics of one audio source on another. The live sampling of the IR is necessary to facilitate using it in an improvised manner, reacting immediately to what is played here and now.

There are some aesthetic challenges, namely how to avoid everything turning into a (somewhat beautiful) mush. This is because in convolution all samples  of one sound is multiplied with every sample of the other sound. If we sample a long melodic line as the IR, a mere click of the toungue on the other audio channel will fire the whole melodic segment once. Several clicks will create separate echoes of the melody, and a coninuous sound will create literally thousands of echoes. What is nice is that only frequencies that the two signals have in common will come out of the process. So a light whisper will create a high frequency whispering melody (with the long IR described above), while a deep and resonant drone will just let those (spectral) parts of the IR through. Since the IR contains a recording not only of spectral content but also of its evolution over  time, it can lend spectrotemporal morphing features from one sound to another. To reduce the mushyness of the processed sound, we can enhance the transients and reduce the sustained parts of the input sound. Even though this kind of (exaggerated) transient designer processing might sound artificial on its own, it can work well in the context of convolutions. The current implementation, Liveconvolver3, does not include this kind of transient processing, but we have done this earlier so it will be easy to add.

There are also some technical challenges to using this technique in a live setting. These are related to amplitude control, and to the risk of feedback when playing on larger speaker systems. The feedback risk occurs because we are taking a spectral snapshop (the impulse response) of the room we are currently playing in (well, of an instrument in that room, but nevertheless, the room is there), then we process sound coming from (another source in) the same room. The output of the process will enhance those frequencies that the two sources have in common, hence the characteristics of the room (and the speaker system) will be amplified, and this generally creates the risk of feedback to arise. Once we have unwanted feedback with convolution, it will also generally take a while (a few seconds) to get rid of, since the nature of the process creates a revereb-like tail to every sound. To reduce the risk of feedback we use a very small frequency shift of the convolver output. This is not usually perceptible, but it disturbs the feedback chain sufficiently to significantly reduce the feedback potential.

The challenge of the overall amplitude control can be tackled by using the sum of all amplitudes in the IR as a normalization factor. This works reasonably well, and is how we do it in the liveconvolver. One obvious exeption being in the case where the IR and the input sound contains overlapping strong resonances (or single lone notes). Then we will get a lot of energy on those overlapping frequency regions, and very little else. We will work on algorithms to attempt normalization in these cases as well.

The effect

Liveconvolver3 in an example setup in Reaper. Note the routing of the source signals to the two inputs of the effect (aux sends with pan).

The effect uses two separate audio inputs, one for the impulse response sampling, and one for the live input to be convolved.  We have made it as a stereo effect, but do not expect it to convolve a stereo input. It also creates a mono output in the current implementation (the same signal on both stereo outputs). In the figure we see two input sources. Track 1 receives external audio, and routes it to an aux send to the liveconvolver track, panned left so that it will enter only input 1 to the effect.. Track 2 receives external audio and similarly routes it to an aux send to the liveconvolver track, but panned right so the audio is only sent to input 2 of the effect.

The effect itself has contols for input level, highpass filtering (hpFreq), lowpass frequency (lpFreq) and output volume (convVolume). These controls basically do what the control name says. Then we have controls to set the start time (IR_start) of the impulse response (allow skipping a certain number of seconds into the recording), and the impulse response length (IR_length), determining how many seconds of the IR recording we want to use. There are also controls for fading the IR in and out. Without fading, we might experience clicks and pops in the output. The partition length sets the size of partitioned convolution, higher settings will require less CPU but will also make it respond slower. Usually just leave this at the default 2048. The big green button IR_record enables recording of an impulse response. The current max duration is 5.9 seconds at 44.1 kHz sampling rate. If the maximum duration is exceeded during recording, the recording simply stops and is treated as complete. The convolution process will keep running while recording, using parts of the newly recorded IR as they become available. The IR_release knob controls the amount of overlap between the new instances of convolution created during recording. When recording is done, we fall back to using just one instance again. Finally, the switch_inputs button let us (surprise!) switch the two inputs, so that input 1 will be the IR record and input 2 will be the convolver input. If you want to convolve a source with itself, you would first record an IR then switch the inputs so that the same source would be convolved with its own (previously recorded) IR. Finally, to reduce the potential of audio feedback, the f_shift control can be adjusted. This shifts the entire output upwards by the amount selected. Usually around 1 Hz is sufficient. Extreme settings will create artificial sounding effects and cascading delays.


The effect is written in the audio programming language Csound, and compiled into a VST plugin using a tool called Cabbage. The actual program code is just a small text file (a csd) that you can download here.

You will need to download Cabbage (the bleeding edge version can be found here), then open the csd file in Cabbage and export it as a plugin effect. Put the exported plugin somewhere in your VST path so that your favourite DAW can find it. Then you’re all set.

Export as plugin effect in Cabbage


Routing in other hosts

As a short update, I just came to think that some users might find it complicated to translate that Reaper routing setup to other hosts. I know a lot of people are using Ableton Live, so here’s a screenshot of how to route for the liveconvolver in Live:

Example setup with the liveconvolver in Live

Note that

  • the aux sends are “post” (otherwise the sound would not go through the pan pot, and we need that).
  • Because the sends are post, the volume fader has to be up. We will probably not want to hear the direct unprocessed sound, so the “Audio To” selector on the channels is set to “Sends only”
  • Both input channels send to the same effect
  • The two input channel are panned hard left (ch 1) and hard right (ch 2)
  • The monitor selector for the channels is set to “in”, activating the input regardless of arm/recording

Whith all that set up, you can hit “IR_record” and record an IR (of the sound you have on channel 1). The convolver effect will be applied to the sound on channel 2.