By comparing sound files, we can generate a mathematical score to determine call quality.
The Problem
As a service provider, we needed to understand service availability across our suite of Data Centres. We also needed to understand the quality of the broadband connection that our customers had. Scheduling a series of automatic calls was the easy bit. Where all the calls of sufficientquality, or where there customer local connection issues that were degrading the call quality? Let’s find out.
The Solution
Record a call at the edge and compare that to the original using spectrograms. Overlay the initial call with the recorded call (uploaded from the edge to the server) and the spectrogramsare translated using a fast fourier transform. The process divides the sound frequencies into components of distinct frequencies. Once the sound waves are broken into their distinct frequencies, the algorithm creates a distinct fourier transform (DFT) of the component frequencies. This has been done in a mathematical comparison that can be made of the original sound file and the recorded sound file. The outcome produces a similarity scale of the recorded call and its original. The user can then decide the level of similarity that they require mathematically.
We developed a similarity scale where 70/100 was of no audible difference to the human ear. 60/100-70/100 had very marginal distinctions. Anything below this shows a noticeable difference to the human ear and therefore, a call of unacceptable quality. We were able to create alarms and call logs of these calls; the recorded calls were available for download so that the validity of the scoring system could be manually checked if required. Essentially, we are creating a digital fingerprint of the original call and the recorded call.https://player.vimeo.com/video/282459141?app_id=122963&wmode=transparent
To generate fingerprints, the following process is performed:
- Both files are resampled to a fixed rate of 12kHz (this helps to compare like for like). Resampling is performed by using simple linear interpolation of the amplitudes in the WAV file.
- We build a spectrogram of the WAV file. To do this, we:
- First, apply an overlap factor so that each frame has no large discontinuities; the frame must be 1/6 of a second of data
- We then apply a window function (Hamming) which, along with the overlapping above, smooths each frame to prevent large distortions in our transformation below.
- Next, we apply a 1 dimensional discrete fourier transform on each frame.
- Lastly, we normalise all signals in all frames to be between 0 and 1.
- We use this normalised spectrogram and divide each frame into a number of filter banks. We calculate the robust points (those with the highest spectrogram value) in each filter bank.
Each robust point is scaled by the maximum value of a signed 32-bit integer (2,147,483,647) appended to the fingerprint as an intensity value (along with their coordinates), resulting in a complete fingerprint of the WAV file.
To compare two fingerprints, the following process is performed:
- We build the pair position list table for each fingerprint. This is built by collecting all the positions for each pair. A pair is defined as a hash (a simple, one-way function) of two different (x,y) positions, subject to certain constraints, within the fingerprint.
- We generate an offset score by searching each fingerprint pair position list table for the others pair. When we find a matching pair, we count the number of occurrences of each of the pairs’ positions’ differences (their offsets) and keep a tally of the number of times each offset occurs.
- We then determine which offset occurred the most number of times. We use this number (the number of times the most common offset occurred) as our similarity score. We also add to our score half the count of the offset values immediately before and after the most common offset value.
We divide this value by the number of frames in our fingerprints (or, if each fingerprint has a different number of frames, the smaller of the two fingerprints’ frame counts), and multiply our value by 100 to scale it to be between 0 and 100.