ABSTRACT

Forensic speaker identification is the most important task within the field that is known as Forensic Phonetics and Acoustics or Forensic Speech and Audio Analysis. The former term corresponds to the name of the organisation International Association for Forensic Phonetics and Acoustics (IAFPA; see their website www.iafpa.net), which hosts an annual international meeting and is represented in the journal International Journal of Speech, Language and the Law. Although it has been made very clear by the IAFPA that phoneticians are not privileged among its membership over, for example, speech engineers, in fact phoneticians and linguists are traditionally most strongly represented in the organisation. The second term, Forensic Speech and Audio Analysis, is entirely neutral with respect to any underlying academic field (phonetics) and this is the term used for the name of a working group within ENFSI (European Network of Forensic Science Institutes; see www.enfsi.eu). In that group, engineers and computer scientists are at least as strongly represented as phoneticians and linguists. This is partially due to the fact that forensic speech and audio analysis comprise many activities other than speaker identification, which can benefit from a speech and audio engineering perspective. Such activities include both audio enhancement, i.e. the attempt to increase the intelligibility of poor-quality speech through advanced filtering and other signal processing procedures, and audio authentication, i.e. detecting indications that an audio recording has been manipulated. Since these activities outside of speaker identification are not excluded from the scope of the IAFPA, it makes little sense to see a difference in meaning between the two terms Forensic Phonetics and Acoustics and Forensic Speech and Audio Analysis, so they will be treated as synonymous. A third term, which should also be seen as synonymous is Forensic Speech Science. This is the name of the first academic programme that was established in that field at the University of York, UK in 2007 (see www.york.ac.uk/depts/lang/postgrad/forensic.htm). Forensic speaker identification can be divided into several sub-tasks. A classification

that has proven useful in forensic practice is shown in Table 25.1. If audio recordings exist of both the unknown speaker (i.e. the offender in situations

such as kidnapping, stalking or drug dealing) and a suspect, it is possible to conduct a

speaker comparison and use it as evidence in court. An alternative term for speaker comparison is voice comparison, which means the same. If the suspect is cooperative, a recording can be made of his speech and the forensic expert can have a large amount of control over this recording. For example, it is possible to make a transcript of the unknown speaker’s utterances and then ask the suspect to read them or repeat them in appropriate chunks of speech. Such a procedure results in text identity, which can be useful for some subsequent activities such as the measurement of vowel formants. (Formants are resonance frequencies that result from the shape of the vocal tract and they are measured in Hertz (Hz); the lowest resonance frequency is called the first formant – commonly abbreviated as F1 – and the highest resonance frequency used for most forensic applications is the third formant, F3.) However, text identity is not a requirement in forensic speaker identification, and reading or (less so) repeating can result in unnatural prosody, which creates its own problems. Therefore, the recording of a suspect should also contain speech that is uttered as spontaneously as possible. If, however, the suspect is not cooperative and does not agree to have his voice recorded, it will depend on the legal system of the country and the circumstance of the case whether prior recordings of the suspect, perhaps taken from police interviews or from telephone surveillance, can be used. Another form of uncooperative behaviour occurs when a suspect agrees to a recording, but then tries to disguise his voice in an apparent or subtle way. In such a case, the expert has to decide from a forensic-phonetic perspective whether this evidence can still be used. The methodology used in speaker comparisons involves a wide variety of both auditory and acoustic parameters and will be addressed in Section 3. If an audio recording exists of the unknown speaker, but no suspect has been found, it

is still possible to create a speaker profile based on the recording. Synonymous terms for such activity are voice analysis and voice profiling. Speaker profiles are usually requested by the police in an ongoing investigation for the purpose of finding a suspect. Information useful for that purpose includes age, sex, region, social status and foreign language background. Speaker profiling is addressed in more detail in Section 2. In the same situation in which a speaker profile is requested, it is also possible to present audio samples of the unknown speaker to the general public, using mass media such as TV, radio or the internet. This is usually only implemented in high-profile cases, partially because the subsequent expert work required in evaluating all the responses from the public (including conducting many subsequent speaker comparisons) can be substantial. Some forensic cases begin with a speaker profiling stage and end with a speaker

comparison stage. Perhaps the most remarkable example is the Yorkshire Ripper case,

where these two stages lay 30 years apart. In the early stages, a speaker profile was provided of a caller claiming to be the Yorkshire Ripper, who between 1975 and 1980 had murdered 13 women in Leeds, Bradford, Huddersfield and Manchester. Later it was discovered that the calls had been made by a hoaxer. A suspect hoaxer was eventually found through DNA analysis in 2005. A speaker comparison between the voice of the suspect and the voice from the calls in the 1970s revealed strong indications that these two voices belonged to the same individual. The Yorkshire Ripper case is described in detail in Ellis (1994) and French et al. (2006). In some situations, no recording of the unknown speaker is available, but a witness has

heard the person speaking. In some cases, such as robbery or rape, the witness may also be the victim. In these situations, it makes a difference, both scientifically and legally, whether or not the witness knew the offender from before the crime. In the former situation, the task required of the witness is called familiar-speaker identification and in the latter unfamiliar-speaker identification. Familiar-speaker identification enters the evidential process in the form of a regular witness statement. Here the challenge might be to ascertain – based on scientific knowledge about human speaker perception in general – whether such a witness statement is reliable or whether adverse conditions occurred that cast doubt on its reliability. Such adverse conditions include short utterances, distance, additive noise and unusual utterance modes such as shouting (see Blatchford and Foulkes 2006 for a recent case study and further references). Cases with unfamiliar-speaker identification require a different methodology and can be addressed in terms of a voice line-up, also referred to as voice parade (see Nolan 2003). In the fourth possibility shown in Table 25.1, somebody has witnessed the crime

but no suspect and no recording exists. Although this scenario occurs frequently in reality, experts are only rarely asked for their involvement (at least this holds for Germany). Perhaps this is because there is no established forensic methodology for such a scenario. What would be very useful here is some way of creating what in the visual domain are known as phantom pictures or photofit pictures (Nolan 1983: 208 for that suggestion). Current technologies in speech synthesis developed under the terms ‘voice transformation’ and ‘voice conversion’ are very promising (Stylianou 2008 for overview). Speaker comparison and speaker profiling, which were shown on the left side of

Table 25.1, fall into the province mentioned in the title of this chapter, i.e. speaker identification by experts (see Künzel 1995; Broeders 2001 for the term). The term speaker identification by experts is opposed to naïve speaker identification which denotes the situations shown on the right-hand side of Table 25.1. To be more precise, although the identification process in naïve speaker identification is performed by individuals who are not trained with respect to speech analysis, the framework in which these perceptions by naïve listeners are elicited is a professional one, in which experts are involved in the planning and execution of procedures such as voice line-ups. An alternative term for speaker identification by experts is technical speaker identification (Nolan 1983, 1997). As Nolan (1997) points out, the adjective technical has to be understood in a broad sense – as not only covering the use of instruments such as spectral analysers, but also as referring to non-instrumental methods such as auditory-based phonetic transcription. In this chapter the former term will be kept, which presents an opportunity to think more closely about the kind of qualifications that are needed by an expert in forensic speaker identification. Since this issue depends on the different methods that are used, this task will be postponed towards the end of this chapter.