The spectral features of pressure characterized by the human outer ear, head, and torso are vital for sound localization in humans, especially for sound above 5 kHz. Many studies have shown that spectral notches contribute significantly as cues for this purpose. These spectral notches and their center frequencies depend on the shape and geometry of each individual, so their characteristics are different between individuals. We used wavelet multi-resolution analysis to automatically detect the center frequencies of spectral notches in head-related transfer functions (HRTFs). Auto-detection of these notches at certain locations for an individual helps in using suitable, complete, 3D HRTFs from HRTF datasets that have similar notch characteristics for an individual. Symlet and Daubechies wavelets were successfully used at different decomposition levels for this purpose. Symlet2 gives the best performance in terms of the auto-detection sensitivity.

※ The user interface design of www.jsts.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

### Journal Search

## 1. Introduction

A Head-Related Transfer Function (HRTF) is defined as the ratio of the frequency
response at the ear drum to that at a sound source for both ears. HRTFs comprise all
acoustical cues to a sound-source location that are available from that location ^{[1]}. In addition to the interaural time difference (ITD) and interaural level difference
(ILD), HRTFs are considered as the main cues for sound source localization. They are
usually measured for individuals at certain limited locations in terms of azimuth
and elevation.

HRTFs are required to create 3D virtual auditory displays (VADs) using headphones.
VADs have many applications, including psychoacoustic and physiological research,
industry, virtual training, driving training simulation ^{[2]}, virtual aviation ^{[3]}, and battle environment simulation ^{[4]}. There are also many other applications for VADs in communication, multimedia, mobile
products, and clinical auditory evaluations ^{[5]}.

A Head-Related Impulse Response (HRIR) is defined as the time domain of the HRTFs. One of the most famous methods to measure it is by generating a Dirac delta impulse at a sound source and measuring the output at a microphone located at a subject’s eardrum in an anechoic room. This should be done for each direction in 3D space because the result is significantly dependent on the direction. Different techniques and algorithms have been proposed and implemented to build HRTFs at locations other than the measured ones or at finer resolution [6-8].

Scattering and reflections from the torso and the shoulder of a subject cause
the characterization of the HRTF at frequencies less than 3 kHz ^{[9]}. Accordingly, the geometry of these body parts doesn’t affect the shape of the HRTF
above 3KHz. Previous studies on humans show that there are prominent spectral ``notches''
and ``peaks'' in HRTFs above 4-5 kHz. These are dominant cues for elevation and azimuth
angles of a sound-source location, which are essential for sound localization, especially
for elevations and for determining whether a sound is in front of or behind an observer
^{[10]}. Many of these spectral features are caused by pinnae reflections and diffractions,
which act as a filter in the frequency domain.

The absence or presence of the peaks gives a strong indication of the sound source
elevation ^{[10]}. For example, a one-octave peak at 7-9 kHz was presented as an indication of elevations
around 90$^{\circ}$ ^{[11]}. However, spectral peaks do not show a smooth trend with the changes of the elevations
as the spectral notches ^{[10]}.

The spectral location of the first prominent spectral minima is called the first
notch. Human data has shown that changes occur in its center frequency from around
6 kHz to around 12 kHz as the angle of a sound source varies from -15$^{\circ}$ to
45$^{\circ}$ in elevation with a fixed azimuth angle ^{[12]}. The first notch is due to the ear concha and is considered as one of the important
features for elevation perception of a sound source ^{[11]}.

HRTFs depend on the shape and geometry of the head, external ears, and body parts,
which interact with received sound waves. Because of this, HRTFs can be quite different
for different individuals for a given location in space ^{[13]}. In order to have a full implementation of a complete VAD for a certain individual,
the HRTFs need to be measured or synthesized for all directions (i.e., all elevation
and all azimuth angles). Higher directional resolution results in smoother and more
effective directional hearing for VADs. The most popular far-field HRTF databases
use a directional resolution of 5$^{\circ}$ to 15$^{\circ}$ for both azimuths and
elevations. However, measurement of the HRTFs in all directions for subjects is expensive
and inappropriate and requires too much preparations.

One of the solutions to this problem is using a structural model of a subject
in order to build an individualized HRTF. These models are based on synthesizing the
HRTFs based on the anthropometry of the subject, especially the geometry of the pinna,
head, torso, and shoulders ^{[14]}. Therefore, we hypothesize that if notch and peak frequencies for a certain individual
HRTF are close to those of another individual, then using the HRTF of the first individual
for the second is more suitable than using other individuals’ HRTFs with significantly
different spectral notches and peaks frequencies.

Taking a few measurements at certain locations for an individual can be used to
auto-detect the frequencies of the spectral notches and peaks and to compare their
values to the notches and peaks frequencies in currently available HRTF databases.
The ones with closer notches and peak frequencies at the same measured locations can
be used as indications for suitable HRTFs for an individual. To do this, we need to
automatically detect the main notches in a measured HRTF for an individual for comparison.
Wavelet multi-resolution analysis has been successfully used for auto-detection of
events, including notches and peaks in non-stationary signals ^{[15,}^{16]}. It was used in this study for the auto-detection of main spectral notches in measured
HRTFs.

The rest of the paper is organized as follows. Section 2 discusses the used database and the 3D reference coordinate system, data pre-processing, role of the spectral notches in direction estimation, and discrete wavelet transform. Section 3 discusses the results of applying wavelet multi-resolution analysis on the HRTFs to auto-detect the spectral notches. The paper is concluded in section 4.

## 2. Methods

### 2.1 Database and Coordinate System

An interaural polar coordinate system was used in this study. The elevation (EL) represents the latitude, and the source azimuth (AZ) represents the longitude. The location at (AZ=0$^{\circ}$, EL=0$^{\circ}$) corresponds to the direction in front of the subject. Negative elevations are below the horizontal plane, and positive elevations are above it. EL=90$^{\circ}$ corresponds to the direction directly above the subject’s head, and (AZ=180$^{\circ}$, EL=0$^{\circ}$) corresponds to the direction directly behind it. Negative azimuth angles are to the left side, and positive ones are to the right of the subject.

In this study, we used HRIRs from the Center for Image Processing and Integrated
Computing-University of California (CIPIC) database ^{[17]}. It contains HRIRs for 43 subjects with 27 anthropometric measurements for subjects’
heads, torsos, and pinnae. For each subject, HRIRs are measured at azimuth angles
between -80$^{\circ}$ and 80$^{\circ}$ and elevation angles between -45$^{\circ}$
and 230.624$^{\circ}$. There are a total of 1250 directions for each subject, and
the sampling frequency is 44.1 kHz. HRTFs have been calculated in this study from
HRIRs by taking the Fourier Transform using 512 points with a frequency resolution
Δf of 86.13 Hz.

For the purpose of this study, we used directions at elevations between -45$^{\circ}$ and 45$^{\circ}$ in the median plane as an example (i.e., at an azimuth angle of 0$^{\circ}$ for the right ear of some randomly selected subjects from the CIPIC database). Subjects 3, 8, 9, 10, 11, and 12 were selected in this study. Matlab® 2014 was used for reading the data, pre-processing, wavelet multi-resolution analysis, and notch auto-detection of the HRTFs’ spectral notches.

### 2.2 Pre-processing of Data

HRIRs have been windowed by 2 ms using a Hanning window in order to remove echoes
in the raw data, including some reflections caused by torsos, shoulders, and knees.
This causes indirect smoothing for the HRTFs. Smoothing in the frequency domain does
not affect the localization capability given that the main spectral features are kept
^{[18]}. Phase responses are ignored, and only magnitudes of the HRTFs are considered because
many studies have proven that HRTFs can be accurately represented by their minimum
phase spectra. The reason is that the auditory system is not sensitive to the absolute
phase of a sound applied to a single ear ^{[18,}^{19]}.

### 2.3 HRTF Spectral Notches

Notches and peaks in the HRTFs are direction-dependent, so they indirectly provide information about the direction of a sound. In addition, they depend on the shape and size of the pinna, which are different among individuals. Fig. 1 presents an example of the notches and peaks in the right-ear HRTF at the location of (AZ=0$^{\circ}$, EL=-45$^{\circ}$) for subject 10 of the CIPIC database.

### 2.4 Discrete Wavelet Transform

A wavelet is a limited-duration waveform that is irregular and often non-symmetrical.
Its average value equals zero, and it has the capability to describe abnormalities,
pulses, and other events. Wavelet analysis involves the decomposition of a signal
using an orthonormal group of basis functions, such as the sines and cosines in a
Fourier series. Scaling or dilation in wavelet terminology means stretching the wavelet
in time, which is related to the frequency in Fourier series terminology. Translation
in wavelet terminology is the shifting of the wavelet to the right or left in the
time domain. A ``mother wavelet'' refers to an unstretched wavelet. A Continuous Wavelet
Transform (CWT) represents all possible integer factors of shifting and stretching
the wavelet, while a Discrete Wavelet Transform (DWT) stretches and shifts in a dyadic
scale using powers of 2 (e.g., 2, 4, 8, 16, etc.) ^{[20]}.

Wavelet decomposition splits a signal into two parts using high-pass and low-pass filters. Using more filters splits the signal into more parts. A low-pass filter (scaling function filter) gives a smoothed version and approximation of the signal, while a high-pass filter (wavelet filter) gives the details. When details and approximations are added together, they can reconstruct the original signal.

Usually, each approximation is split into more approximations and details, and so on. Selecting certain levels of details or approximations can be used to choose certain events or parts of a signal that have a certain range of frequencies. Convolution of the wavelet function ѱ(t) with signal x(t) gives the wavelet transform, T, while the convolution of x(t) with the scaling function ϕ(t) produces the approximation coefficient, S.

The discrete wavelet transform (DWT) can be expressed as:

The coefficient of the signal approximation at scale m and location n can be expressed as:

Fig. 2. A 3-level discrete wavelet transform. Each filter (high pass or low pass) is followed by decimation or down-sampling by 2. cA1 represents the first-level approximation coefficients, cD2 represents the second-level detail coefficients, cA3 represents the third-level approximation coefficients, etc.

For a discrete input signal of finite length and a range of scales 0 < m < M, a discrete
approximation of the signal can be expressed as ^{[21]}:

where $\textit{x}$$_{M}$ $(\textit{t})$ is the signal approximation at scale M, and the signal detail at scale m is expressed as:

Usually, approximations are repeatedly divided into low frequencies (approximations) and high frequencies (details) to find the next level of wavelet analysis using more filters, as shown in Fig. 2. This figure shows a three-level wavelet decomposition as an example. The low and high pass filters’ impulse responses are dependent on the chosen wavelet.

There are many kinds of wavelets, such as Haar, Daubechies, Biorthogonal, Symlet, and Coiflet wavelets. Symlets 2 through 8 and Daubechies 1 and 2 wavelets were tested in this study because of their similarity in shape to the HRTF spectral notches at the different directions among the subjects, which make them suitable for the auto-detection problem. The decomposition analysis was done up to level 6 for each wavelet. Fig. 3 shows examples of some Symlet wavelets used in this study. The proposed algorithm determines the most suitable HRTF set for an individual from a database, as presented in Fig. 4.

## 3. Results and Discussion

Fig. 5 shows an example of an HRTF wavelet decomposition using one of the tested wavelets, Symlet5, up to level 6. Low-level details represent the highest frequencies in the HRTF. The energy of the main notches was noticed in all detail levels from level one to level 5 (i.e., $\textit{D}$$_{1}$ to $\textit{D}$$_{5}$). The first three levels ($\textit{D}$$_{1}$, $\textit{D}$$_{2}$, and $\textit{D}$$_{3}$) have a clear resemblance to the spectral notches compared to other signal information in these three levels. Therefore, these three levels were used for the auto-detection of the main notches in the HRTFs.

Fig. 5. HRTF at 0° azimuth and -45° elevation (AZ=0°, EL= -45°) for subject 10 and its wavelet decomposition up to level 6 using Symlet5.

To give more significance to the highest frequency components, detail $\textit{D}$$_{1}$ coefficients were multiplied by a higher factor. The reconstructed signal from wavelet levels $\textit{D}$$_{1}$, $\textit{D}$$_{2}$, and $\textit{D}$$_{3}$ was used according to the following proposed equation, which gives higher weight for the lower-level details. The weights of each level were selected empirically for the notches of the database subjects to maximize the auto-detection sensitivity:

where R represents the reconstructed signal using the inverse-discrete wavelet transform (IDWT) of certain weights of $\textit{D}$$_{1}$, $\textit{D}$$_{2}$, and $\textit{D}$$_{3}$ details.

The reconstructed signal from these details for the HRTF example in Fig. 5 is shown in Fig. 6. Locations of the spectral notches are simply auto-detected and marked as local peaks of the squared-absolute reconstructed signal in Figs. 6 and 7. These figures show examples of notch auto-detection at two different directions using Symlet5 for subjects 10 and 11, respectively.

Local peak selection from the squared absolute reconstructed signals was simply done using a frequency sample that is larger than the neighboring samples and restricted to a one peak in a window of 1 kHz. The reason was that it is unusual to have more than one main spectral notch within this frequency range. Almost all peaks need to be detected as long as they are higher than 1% of the maximum of the squared absolute reconstructed signal, which is considered as the peaks’ amplitude threshold.

According to the Fourier transform applied to the HRIRs, the frequency resolution
for the processed data, Δf, is 87 Hz. An analysis was done on CIPIC data subjects
3, 8, 9, 10, 11, and 12. Spectral notches located between 4 kHz and 16 kHz were considered
for the analysis in this study because pinna cues usually lie in this range of frequencies
^{[22]}. Furthermore, this range has essential cues for sound localization ^{[12]}, where the total number of notches in the selected subjects at the stated locations
is 238 notches.

Fig. 6. (a) HRTF at (AZ=0°, EL=-45°) direction for Subject 10; (b) Reconstruction signal according to Eq. (5) using Symlet5; (c) Absolute square of signal in (b) with the auto-detected local peaks as small red circles.

Fig. 7. (a) HRTF at (AZ=0°, EL=-39.375°) direction for Subject 11; (b) Reconstruction signal according to Eq. (5) using Symlet5; (c) Absolute square of signal in (b) with the auto-detected local peaks as small red circles.

The performance of the auto-detection capability of the selected wavelets is presented in Table 1. The results were sorted according to their auto-detection sensitivity. The sensitivity $\textit{S}$ is defined as:

where $\textit{TP}$ and $\textit{FN}$ represent the number of true positives (correctly detected notches) and number of false negatives (missed notches), respectively.

Table 1. Performance of wavelets on HRTF spectral notches auto-detection.

Wavelet |
Sensitivity (%) |

sym2 |
100 |

db2 |
99.6 |

sym3 |
92.9 |

db3 |
92.4 |

sym4 |
90.8 |

sym5 |
89.1 |

sym6 |
87.4 |

sym7 |
86.1 |

sym8 |
86.1 |

Around 70% of the auto-detected notches were accurately detected with the exact central frequency compared to the manually examined ones. Around 28% were auto-detected with a difference of ${\pm}$Δf from the actual notch frequency, and 2% were different by ${\pm}$2Δf from the actual central frequency. These values are almost the same among all wavelets tested. A slight difference occured between the actual notch frequency and the auto-detected ones when the actual notch was very shallow and not deep enough to be auto-detected accurately. However, these shallow spectral notches do not play an important role in sound-source localization for humans compared to the deep spectral notches because they are not associated with significant reflections.

All deep spectral notches were auto-detected accurately without any difference from the original notches’ central frequencies. Usually, the measured HRIRs and the calculated HRTFs are normalized, so when the depth of the notch is discussed, we refer to the relative attenuation in the frequency response. A higher slope and lower relative amplitude of the spectral notch result in higher amplitude in the squared amplitude of the squared signal, which gives a direct indication of the drop in the notch amplitude.

Many studies have proposed different algorithms to find individualized HRTFs.
Some of these studies describe the relation between anthropometric parameters of the
subjects, especially their pinnae, and the HRTF features at different locations ^{[14]}. HRTFs are modeled accordingly, given the fact that HRTF describes the interaction
between the sound waves and the human head, torso, and shoulder geometry. This approach
is complicated and needs accurate estimation of the anthropometric parameters and
clear understanding about these parameters and their characterization of the HRTF.

Other studies model the HRTFs measured at certain directions and then estimate
HRTFs at all other locations using different interpolation methods ^{[6,}^{8]}. Most of these studies validated the interpolation in a limited range of directions
in terms of azimuth and elevation angles, and some of the models have high computational
complexity. Even though the algorithm proposed does not create or model an individualized
set of HRTFs for a subject, it can be used to find the closest set of HRTFs among
available HRTF databases that have already been measured in different institutions
and labs around the world. Thus, it can be used for a subject to save time and effort
and to provide a good approximation for individualized HRTFs.

## 4. Conclusions

Spectral notches of HRTFs play important roles as spectral cues for sound-source localization for humans. Accurate auto-detection and estimation of the spectral notches is considered an important step to check the similarity between HRTFs of a certain subject and ones in databases in order to find a suitable HRTF set for that subject. Wavelet multi-resolution analysis using decomposition of up to three levels by Symlet2 to Symlet8, Daubechies2, and Daubechies3 wavelets have been used successfully to auto-detect frequencies of spectral notches in the HRTFs.

Symlet2 outperformed the other tested wavelets in terms of auto-detection capability, and it auto-detected all spectral notches in all tested HRTFs. Most of the auto-detected notches were detected by the exact central frequency of the notches. Nevertheless, future work remains to subjectively validate the proposed method by a subjective test, as well as to test more directions and more subjects.

### REFERENCES

## Author

Bahaa Al-Sheikh received a B.Sc. degree in electronics engineering from Yarmouk University, Jordan, an MSc in electrical engineering from Colorado State University, Colorado, USA, and a PhD in biomedical engineering from the University of Denver, Colorado, USA, in 2000, 2005, and 2009, respectively. Between 2009 and 2015, he worked for Yarmouk University as an assistant professor in the department of Biomedical Systems and Medical Informatics Engineering and served as the department chairman between 2010 and 2012. He served as a part-time consultant for Sand-hill Scientific Inc., Highlands Ranch, Colorado, USA, in biomedical signal processing between 2009 and 2014. Currently, he is an associate professor at the Electrical Engineering Department at the American University of the Middle East in Kuwait. His research interests include digital signal and image processing, biomedical systems modeling, medical instrumentation, and sound-source localization systems.

Mohammad Shukri Salman received B.Sc., M.Sc. and Ph.D. degrees in electrical and electronics engineering from Eastern Mediterranean University (EMU) in 2006, 2007, and 2011, respectively. From 2006 to 2010, he was a teaching assistant in the Electrical and Electronics Engineering department at EMU. In 2010, he joined the Department of Electrical and Electronic Engineering at European University of Lefke (EUL) as a senior lecturer. For the period of 2011-2015, he has worked as an assist. prof. in the Department of Electrical and Electronics Engineering, Mevlana (Rumi) University, Turkey. Currently, he is an Assoc. Prof. at the Electrical Engineering Department at the American University of Middle East in Kuwait. He has served as a general chair, program chair, and TPC member for many international conferences. His research interests include signal processing, adaptive filters, image processing, sparse representation of signals, control systems, and communications systems.

Alaa Eleyan received B.Sc. and M.Sc. degrees in electrical & electronics engineering from Near East University, Northern Cyprus, in 2002 and 2004, respectively. In 2009, he finished his PhD degree in electrical and electronics engineering at Eastern Mediterranean University, Northern Cyprus. Dr. Eleyan did his post-doctorate studies at Bilkent University in 2010. He has nearly two decades of working experience as both a research assistant and faculty member in different universities in Northern Cyprus and Turkey. Currently, he is working as an associate professor at Ankara Science University in Turkey. His current research interests are computer vision, signal & image processing, pattern recognition, machine learning, and robotics. He has more than 60 published journal articles and conference papers in these research fields. Dr. Eleyan has served as a general chair of many international conferences, such as ICDIPC2019, DIPECC2018, TAEECE2018, and DICTAP2016.