Understanding Forensic Voice Identification & Speaker Comparison
FORENSIC SPEAKER IDENTIFICATION
The science of voice identification, now referred to as speaker identification had progressed since World War II. It was first accepted in court in the 1960’s. In recent years biometric speaker comparison software systems have progressively been used. Both Aural/Spectrographic and Biometric methods are explained below.
Aural/Spectrographic
Aural/spectrographic voice identification is a combination of aural, or listening and instrumental comparison of one or more known voices with an unknown voice for the purpose of identification or elimination. The fundamental premise for voice identification is that every voice is individually characteristic enough to distinguish it from others through voiceprint analysis.
Factors contributing to voice uniqueness lie in the size and configuration of the vocal cavities, such as the throat, nasal and oral cavities and the shape, length and tension of the individual’s vocal cords. Another factor in determining voice uniqueness lies in the manner in which the articulator muscles are manipulated during speech.
Factors which are also compared are resonance quality, pitch, temporal factors, inflection, dialect, articulation, syllable grouping and peculiar speech characteristics.
SELECTION OF WORDS FOR COMPARISON
After the evidence recording is received, the recorded speech is captured to a lab computer. The recording is reviewed to identify the words that will be best utilized for comparison. The words sought are those which are most clearly articulated, not slurred, truncated or run together with other words. Each group of words selected is then placed into what is referred to as a short term memory file. The desired number of words per comparison is 20 or more.
COLLECTION OF EXEMPLAR
It is recommended that the exemplar of the known voice be collected in as close to the same manner as the recording of the unknown voice was recorded. For example, if the recording of the unknown voice was recorded over the phone, the exemplar of the known voice should be collected over the phone, etc. When the exemplar is collected the suspect is asked by the examiner to say the same words in the same way as they were spoken by the unknown person. In other words in a normal, natural voice.
AURAL AND WAVEFORM ANALYSIS
Analysis is conducted through aural (listening) and visual comparison of the words through graphical (waveform) display. Each recording is transferred onto the computer using a digital sound card to ensure the best quality capture. A graphical display of the recorded material, called the waveform, can then be viewed as the recording is played and reviewed. The configuration of the individual words can be seen as they are played.
SPECTROGRAPHIC ANALYSIS
A computerized spectrographic analysis is conducted. This facilitates visual comparison of the features of each word spoken. The spectrogram displays the speech in three formats: time, frequency and amplitude. The spectrogram serves as a permanent visual record of the words spoken and facilitates visual comparison of similar words spoken by an unknown speaker’s voice with a known speaker’s voice. The spectrogram shows time along the horizontal axis and frequency along the vertical axis. Amplitude is indicated in varying degrees of gray or colored shading.
STANDARDS FOR COMPARISON DETERMINATION
The following are the standards accepted nationally by all professional organizations involved with voice identification, including the Audio Engineering Society International and the American Board of Recorded Evidence.
IDENTIFICATION: At least 90% of all comparable words must be very similar aurally and spectrally, producing not less than twenty (20) matching words.
PROBABLE IDENTIFICATION: At least 80% of the comparable words must be very similar aurally and spectrally, producing not less than fifteen (15) matching words.
POSSIBLE IDENTIFICATION: At least 80% of comparable words must be very similar aurally and spectrally, producing not less than ten (10) matching words.
INCONCLUSIVE: Falls below either the Possible Identification or Possible Elimination confidence levels and/or the examiner does not believe a meaningful decision is obtainable due to various limiting factors.
POSSIBLE ELIMINATION: At least 80% of comparable words must be very dissimilar aurally and spectrally, producing not less than ten (10) words that do not match.
PROBABLE ELIMINATION: At least 80% of the comparable words must be dissimilar aurally and spectrally, producing not less than fifteen (15) words that do not match.
ELIMINATION: At least 90% of the comparable words must be very dissimilar aurally and spectrally, producing not less than twenty (20) words that do not match.
BIOMETRIC SPEAKER IDENTIFICATION
Biometrics refers to the quantifiable data (or metrics) related to human characteristics and traits. For more than forty years, the court accepted method of voice identification analysis is the aural/spectrographic method. In recent years, major advances in biometric speaker identification analysis have occurred.
Biometrics are used for identification of humans by their unique characteristics. It is used in DNA testing, fingerprints, facial recognition, palm prints, iris recognition and voice/speaker identification analysis. A sample of as little as 16 seconds of pure speech from a known voice and an unknown voice is necessary. The longer the samples are, the greater the likelihood percentage of identification or elimination. Multiple voices can be compared in a single analysis. Biometric voice identification is being used by federal agencies, including the FBI, NSA, CIA, etc. The system uses the following three methods to compare the voices:
Pitch Statistics: Pitch refers to how high or low a person’s voice sounds. The pitch of the voice can be measured scientifically. The Pitch Statistics Method (PSM) contains 16 different pitch parameters, including average pitch value, maximum, minimum, median, percent of areas with rising pitch, pitch logarithm variation, pitch logarithm asymmetry, pitch logarithm excess and 8 additional parameters.
Spectral-Formant Method: The spectral maxima of speech signal are called formants. They are formed because of the resonances, which happens in the vocal tract during the speech generation process. The formants (resonance frequencies) depend on the geometrical size and shape of the vocal tract (head with all the cavities and organs). In general, in the frequency band of a phone line (300-3400 Hz), we can find only four formants. The instantaneous values and dynamic traces of those four formants are extracted from the dynamic spectrogram and compared using Support Vector Machine (SVM) classifier.
Gaussian Mixture Models based Method (GMM): This approach is more statistical and requires computing power, so it cannot be accomplished manually. In simple words, not only the spectral maxima (values of resonance frequencies) are measured and compared, but the shape of those and the energy distribution along the frequencies.
After the analysis is conducted, the results are displayed in likelihood percentages as well as other statistical data.
As experienced expert analysts, our highly trained and qualified forensics team uses top of the line hardware and software technology to carefully handle and thoroughly analyze your evidence audio, video, and photographic media. We can isolate, separate, enhance, and authenticate all relevant aspects of audio and video recordings in order to produce expert evidence backed by many years of master experience. As an expert witness in the courtroom, our testimony from the witness stand will contain provable facts and relevant demonstrative evidence that can often make or break a case (see our Testimonials).
Let us serve you. Contact Us at (800) 799-0828