In my recent project, I was tasked with developing an algorithm to detect speech in an audio file and eliminate any background noise or silence. The goal was to take an audio clip containing noise, silence, and a spoken phrase, and isolate just the speech. This is useful for many speech processing applications where you only want to analyze the speech portions.
The core challenge was that simple noise filters would not work here - the noise and speech overlap in time. I needed a more sophisticated approach called energy endpoint detection. The key insight is that speech has much higher energy than silence or ambient noise. By computing energies in short time frames, I can distinguish speech frames from noise/silent frames.
The first step was to break the signal into short 20 millisecond frames and compute the energy in each frame. A high energy frame likely contains speech. I then searched forward in time to find the first high energy frame as the start of speech, then searched backwards to find the last high energy frame as the end. With the start and end points, I simply extracted those frames into a new audio file containing just clean speech.
The results were great - the algorithm accurately detected the phrase “Let us go then, you and I” spoken by a woman, eliminating all background noise before and after. My graphs visualize how the energy endpoint detection works. By quantifying signal energies over time, I could clearly see when speech begins and ends.
In the future, I could improve accuracy further by incorporating additional techniques like zero crossing rate analysis. But overall, I demonstrated how effective energy endpoint detection is for isolating speech from raw audio in a simple and efficient way. My code and documentation could serve as a template for others working on speech processing applications.
To see more details, check out the paper
Disclaimer: This project was completed as part of my BSc in Mathematics at Manchester Metropolitan University. The project was supervised by Dr. Jon Borresen. This blog post is an LLM generated text, based upon the hand-written report.