development blog

project details

As the subject on Android audio application development is well represented on the web, this page will not go in-depth into the ins and outs of the used technology, but rather explain why certain choices were made and what the architecture of the audio engine looks like.

Note that MikroWave is a constantly evolving project and many features as such are still undocumented or hidden in present releases. As such this page will not contain much code examples, but for those curious to peek under the hood, the engine's source code is publically available on GitHub.

Project language : Java, C++

application interface / initial audio engine

As the Google Android SDK provides an elegant API for developing using the high level Java language, the interface and model of the application was developed entirely in Java. The initial audio engine was written in Java as well, relying on the AudioTrack class providing us with a resource for audio output.

However, while this gave satisfying results when used with the sequencer, the latency ( perceived delay between an audio-triggering action and hearing the actual output ) was too high when using the on-screen keyboard or when adjusting audio properties during playback, rendering it pretty much useless as it lacked an instantaneous response. Latency was measured at around 250 ms, while anything below 70 ms should feel "acceptable" - ideally below 5 ms of course, but let's remain realistic for a moment and realise we're talking about utilising consumer appliances like mobile phones and tablets here...

going native / OpenSL

This led to porting the audio engine to C++ to run natively on the Android device ( i.e. outside of the Dalvik virtual machine, omitting the AudioTrack API and the hit of garbage collection ).

Using OpenSL all audio buffers are written directly to the audio hardware, greatly reducing latency on newer Android versions ( on older Androids the performance will be at least similar to using AudioTrack, but different hardware configurations can still benefit from a greatly increased performance using this native workaround, and read on... there is a lot of benefit to be had on older devices / operating systems as well ! ).

Using SWIG, all audio modules are still available on the Java side ( i.e. for use within the user interface and the application model ), but run natively. This has the added benefit that C++ code remains self-contained and requires no reference to Java, with the exception of a single JNI interface class that acts as a mediator with a single Java Object for processing state updates.

so OpenSL meant instant low latency?

Not really... this is a topic where Google provides little information on... and has most developers stumped. There are several considerations to take into account before reaching low latency. Some of these are:

  • render audio in a non-locking thread
    use of a circular / ring buffer omits locking of the render thread and having other threads being scheduled in a higher priority. This allows for buffer queuing and a continuous read / write cycle.

    While Android is essentially a Linux platform, the FIFO scheduling priority is unavailable as it might interfere with keeping battery consumption to a minimum!
  • the right sample rate
    certain Android devices have a native sample rate of 48 kHz, while you were perhaps synthesizing your audio at 44.1 kHz. This means that audio output is routed through the system resampler for upsampling to 48 kHz, with the result that this added route in the output path adds to in an increase in overall latency!
  • the right buffer size
    duh. As you might know larger values improve stability, while lower values lower latency. However, it's not just a matter of choosing the right size for the minimum amount of latency and maximum stability, but also making it a multiple of the devices native buffer size.

    For instance : some devices may report a recommended 4800 samples per buffer-size, which is apart from being quite large, also an unusual number. By dividing using multiples of 4800 we can however reach a low, stable buffer size of 75 samples per buffer. Other devices may however report a recommended buffer size of 512 samples, the lowest usable buffer size ( when keeping the above rule of multiples in mind ) could be 64 samples. Using a value outside of the multiple range ( for instance using 75 samples per buffer on a device that requires 64 samples per buffer ) may cause glitches as occasionally the buffer callback is called twice per timeslice. This can go by unnoticed if you have CPU cycles to spare, but it's more likely that this will slowly but surely accumulate to a clusterfuck.

You can query the devices native sample rate and buffer size using the AudioManager class. There are more optimizations available to ensure a smooth performance and these will be discussed in the sections below. However, keeping the above in mind might lead to a lot of "AHA"-moments during development.

synthesizing audio programmatically... for dummies

When Googling how to synthesize audio by applying the math you never bothered to pay attention to in high school, one of the basic examples is bound to look like this snippet 'ere:

const float PI = atan( 1 ) * 4;
const int SAMPLE_RATE = 44100;
const float MAX_PHASE = 1.0f;

WaveForm::WaveForm()
{
phase = 0.0f;
increment = ( 2 * PI ) /( SAMPLE_RATE / 440.0f ); // 440 being the pitch in Hz
}

void WaveForm::render( int bufferSize )
{
float sampleBuffer[ bufferSize ];

for ( int i = 0; i < bufferSize; ++i )
{
sampleBuffer[ i ] = sin( phase );
phase += increment;

if ( phase > MAX_PHASE )
phase -= MAX_PHASE; // repeat the waveform cycle
}
someGloballyDefinedFunctionForActualOutput( sampleBuffer ); // make audible }

...which will leave you feeling awesome when you first hear the sinewave sounding the "A"-note above middle C at 440 Hz. Until you start to twitch at the fact that it's going on forever. And in mono. And it's a not very spectacular kind of sound. And you might want to turn off the example now. AND: the tutorial probably supplied a bare-bones function for the actual writing of the buffer into the audio hardware with no hint of how to manage it with multiple objects efficiently, wait ? Multiple objects ? We need some kind of memory management too, come to think of it, and and...

Back up for a minute. The non-spectacular sound issue shouldn't present too much problems. There are plenty of informative documents available that describe the characteristics of a particular sound. You can look them up on Wikipedia and steal the math for convenience. Though it's nice to challenge yourself and come up with your own interpretations for getting THAT one unique sound. Hell, it's probably why you're doing this anyways.

But we're getting ahead of things here, going from the minimum amount of instructions necessary to generate sound to creating an audio engine that allows you to synthesize and sequence audio with multiple channels and effects AND doing it efficiently leads to pretty scarce documentation. And on top of it all : not all implementations are the be-all and end-all depending on the context.

building a sequenced audio engine

Just to re-iterate what was mentioned before, this is not the be-all and end-all implementation. It's how MikroWave works. It should give you some pointers on how you can approach your own engine, remember the wants you have for your program and evaluate what it needs and what is (un)necessary.

Writing MikroWave has led to the creation of a nice list of classes and objects. However, the important ones for this exercise ( the ones that make it possible to actually sequence audio in a musical context! ) are just these four:

Engine, Sequencer, AudioChannel and AudioEvent.

Engine

The engine contains actually very little logic. It starts the output process which continually enqueues audio buffers for the audio hardware to output. Each cycle of the thread is executed after a previously queued buffer has been processed. Each cycle writes only the amount of samples available to a single buffer. The engine has no rendering logic apart that it mixes the available AudioChannels into a single output. In other words : the engine is only reading buffers and writing into a single output buffer ( i.e. the "master"-strip on a mixer ). The AudioChannels actually containing ( rendered ) buffers for reading are requested from the sequencer.

Sequencer

Each request "steps" the sequencer position by the buffer size. Rather than relying on timers ( DON'T! ), sequences are calculated at the buffer level. In musical terms : say we're looping a single measure at 120 beats per minute in 4/4 time. There are 4 beats per measure, so the entire measure lasts for 4 beats / ( 120 beats per minute / 60 seconds ) = 2 seconds.

In programmatic terms, we calculate time in buffer samples. Let's say we are rendering audio at 44.1 kHz.
The amount of samples for a single measure is : round(( 44100 Hz * 60 seconds ) / 120 bpm ) * 4 beats = 88200 samples.

Let's say that after 10 steps and using a buffer size of 512 samples, the sequencer position is at 5120 samples. The current sequencer-step is looking to gather AudioEvents that are audible in the 5120 - 5632 range of the current measure. Just to return to a musical context : the second sixteenth note of the measure starts at 5512 samples.

The sequencers job is to return each sequenced instrument ( represented by an AudioChannel ) as well as the calculated buffer range to the engine, so it can output the AudioEvents that are elligible for playing. If the step exceeds the end position of the available ( or looping ) measure(s), the sequencer begins counting from the first available position for the current loop. ( i.e. after 172 steps the sequencer will request the 88064 - 88576 range, which exceeds the maximum end position of 88199 by 377 samples. When looping, this means the sequencer performs a request for the 88064 - 88199 range and an additional request for the 0 - 377 range for seamless audio. The new sequencer position is now 377.

AudioChannel

An AudioChannel can be thought of as a track on a mixer. In MikroWave each channel represents an instrument, either a synthesizer or the drummachine. The AudioChannel holds a vector of AudioEvents which represent musical notes at a given pitch and time. The channel also contains mixer properties ( for instance panning position and output volume ) as well as a processing chain. The processing chain contains effects processors ( let's say oscillated filters or delays for echo-generation, etc. ) which apply to the instrument. When queried by the engine, all AudioEvents that are made elligible for output by the sequencer are written into a single buffer ( the channel-strip on a mixer ), to which the processors apply their process in series. This buffer is then written by the engine into the combined, single output.

AudioEvent

Basically a musical instruction for the instrument it corresponds to. An AudioEvent contains :

  • a method for mixing its audio input into the channel buffer
  • a property describing the events length ( duration ) in samples
  • properties describing the events start and end offset in samples. These describe in a musical context at what part of a measure the note starts sounding and for what duration it lasts.

Depending on the instrument the AudioEvent corresponds to, additional properties are available. For instance an AudioEvent for a synthesizer holds a reference to instrument properties for frequency ( pitch ) and ADSR envelopes to shape the sound.

Certain events are also cacheable ( more on this in the optimization section below ) and contain their own cached buffer. When queried by the sequencer to mix their audio into the channel buffer, the requested segment is simply copied from the cached buffer, omitting the need for resynthesizing a static sound. Additional methods for these events are for invalidating its contents ( and rendered buffer! ) when the corresponding instruments properties change or a global sequencer setting such as tempo is altered.

However, these additional methods aren't of any interest to the sequencer and disregarded in this example, for those interested in a more elaborate description of the engine, you can view the MWEngine Wiki on Github.

CPU / memory optimizations

To reiterate the opening paragraphs, while mobile devices are getting more and more powerful with increased memory size and processing power, you can't expect your audience to all own the latest and baddest one. The following pointers are strongly encouraged for optimizing your engine, though it can be argued that it is always good practice to minimize overhead, regardless of how luxurious the environments capabilities are!

  • object pooling
    Pre-initialize objects whose initialization is expensive and keep them ready in a pool. If a object is required, it is retrieved from the pool and its properties updated accordingly to fit the new usage context. After the object usage has run its course, it is returned to the pool instead of destroyed. On the Java side of things, this is especially convenient as it avoids the hit of the garbage collector, though in MikroWave Java is only used for rendering the graphics and application logic, not for audio output.
  • caching
    If throughout your program you need a value - which is calculated at runtime - and the value remains constant ( either throughout the entire program lifetime or for n iterations ), do not keep recalculating it. For instance, each time a single A note that lasts for a full quaver of a measure is sequenced for output, only have the instrument render the corresponding audio buffer once, and keep the buffer intact for subsequent triggers. Only invalidate the cached buffer when the properties of its environment have changed. Another example is the drum machine, where each different part of the "drum kit" has a single, distinct sound. Render the sound only once, and use pointer arithmetic to have similar events point to the original buffer, this prevents allocating memory to hold the exact same values that are present elsewhere in memory!
  • memory allocations / deallocations in output thread
    Initialize all used objects and variables outside of the engines thread loop, and dispose of them outside of the thread.
  • do work in advance
    Keeping the previous suggestion of memory (de)allocation in mind, you can use parallel threads to prepare objects which will eventually end up in the engines output thread. For instance : if a sequence has four measures and the sequencer is currently playing the second measure, you can pre-cache the buffers of the audio events present in the third measure.

open source

Still curious ? You may browse, use and comment freely on the code base, as it has been made public and is available to developers under the MIT license.