Removing human voices from recordings to eliminate privacy issues

Removing voices from birdsong.

One of the problems for the Cacophony Project to resolve is stopping the device from recording human voices (to avoid any issues with privacy). I've been performing some backyard experiments with some promising results!

Sound is made when things vibrate - the faster the vibration (or frequency), the higher the perceived pitch of the sound. To make sense of the sound we have to look at the frequencies and make assumptions as to what created them. Small, light things (like violin strings) vibrate fast, making higher pitch sounds, and larger, heavier things (like bass drums) vibrate slowly, making lower pitch sounds.

Birds are relatively small, and most birdsong contains vibrations between about 1,000 vibrations per second and 10,000 vibrations per second. The image below is a spectrograph of some birdsong.The vertical axis is frequency of the stereo channels, the horizontal is time.

People are relatively large, and human vocal cords vibrate between 100 and 350 times per second, depending on age and gender. However, there is the slight complication that human speech has a lot of wispy sibilant sounds like 's' and 'sh', where the sound is made by the air whistling through the mouth and lips.

For my conceptual model I am making the assumptions that the signals contain the four components:
1.Background noise: a relatively constant random mix of noise, perhaps wind in the trees, or waves on a beach. 
2. Birdsong: has sounds between 1,000 Hz and 10,000 Hz
3. Voices: intermittent sounds between 100Hz and 3,000 Hz
4.Other environmental noise: both natural noises like rain, wind, thunder, and human-made noises like cars and planes.

The background noise is pretty much consistent in any given recording, but in the long term some way needs to be developed to account for different recording locations, weather and times of day. This can be done by identifying the quieter part of the recording and using that as a reference level.

To locate parts of the recording that might include voices, the signal data is analysed in 0.1 second blocks. The volume of sounds in the 100Hz to 800Hz range is calculated and compared to the power level of the total signal. When it is greater than a threshold (around 25%) it is assumed that that section of the recording could include voices, and should be muted.

To deal with the chance of words like "six" and "sighs", which have very little low frequencies, a section of the audio before and after the already identified sections are also muted (just in case) and as a final bit of finessing, any sections of un-muted audio that is too small to be useful is also muted.

This gives a simple model with six parameters that can be tweaked, and the empirically derived settings I have been using:
1.The high frequency for the voice band (800 Hz)
2. The low frequency for the voice band (100 Hz)
3. The threshold for where muting will be triggered (25%)
4.How long to mute before a section that triggers the muting (0.5 second)
5.How long to mute after a section that triggers muting (2 seconds)
6. The minimum length of un-muted audio that is of value (3 seconds)

This has proven relatively effective. On a 25 minute audio file without voice, 5% of audio was muted because of false positives, and most of these false positives were due to vehicle noise. On a second 14 minute recording that contained a significant amount of voices, 22% of the audio was muted, and only a tiny fraction of the voice remained.

Here is one of the original audio files:
And the processed output:

The technique is simple enough that it can be implemented while recording, and could significantly reduce the amount of voice accidentally recorded. This means there are no privacy issues when there are potentially hundreds or thousands of Cacophonometers recording the trends of bird song around New Zealand.

If you would like to help us to refine our device recordings, or are interested in buying a device, click here to contact us.