Research ArticleNEUROSCIENCE

Speaker-independent auditory attention decoding without access to clean speech sources

See allHide authors and affiliations

Science Advances  15 May 2019:
Vol. 5, no. 5, eaav6134
DOI: 10.1126/sciadv.aav6134
  • Fig. 1 Schematic of the proposed brain-controlled assistive hearing device.

    A brain-controlled assistive hearing device can automatically amplify one speaker among many. A deep neural network automatically separates each of the speakers from the mixture and compares each speaker with the neural data from the user’s brain to accomplish this goal. Then, the speaker that best matches the neural data is amplified to assist the user.

  • Fig. 2 Speaker-independent speech separation with ODAN.

    (A) The flowchart of the ODAN for speech separation. (B) The T-F representation of the mixture sound is projected into a high-dimensional space in which the T-F points that belong to the same speaker are clustered together. (C) The center of each speaker representation in the embedding space is referred to as the attractors. The distance between the embedded T-F points and the attractors defines a mask for each speaker that multiplies the T-F representation to extract the speakers. (D) The location of the attractors is updated at each time step. First, the previous location of the attractors is used to determine the speaker assignment for the current frame. (E) Then, the attractors are updated based on a weighted average of the previous attractors and the center of the current frame defined by the speaker assignments.

  • Fig. 3 Evaluating the accuracy of speech separation and attention decoding methods.

    (A) Comparison of separation between the representation of the two speakers in the T-F (left) and embedding space (right). The axis represents the first two principal components of the data that are used to allow visualization. Each dot represents one T-F bin (left) or one embedded T-F bin (right), which are colored based on the relative power of the two speakers in that bin. (B) Separation accuracy as a function of time. The dashed line shows the time at which the speakers in the mixture are switched. (C) Correlation values between the reconstructed spectrograms (from neural data) and the attended/unattended spectrograms. Correlation values were significantly higher for the attended speaker (paired t test, P < 0.001; Cohen’s D = 0.8), thus confirming the effect of attention in the neural data. The correlation with the clean spectrograms was slightly higher than that with the ODAN outputs, but the differences between the attended and unattended speakers were the same for both clean and ODAN outputs. (D) Attention decoding: The percentage of segments in which the attended speaker was correctly identified for a varying number of correlation window lengths when using ODAN and the actual clean spectrograms. There was no significant difference between using the clean and the ODAN spectrograms (Wilcoxon rank sum test, P = 0.9). (E) Dynamic switching of attention was simulated by segmenting and concatenating the neural data into alternating 60-s bins. The dashed line indicates switching attention. The average correlation values from one subject are shown using a 4-s window size for both ODAN and the actual clean spectrograms. The shaded regions denote SE. (F) The transition time in detecting a switch of attention was calculated as the time at which the correlation difference between the two speakers crossed zero. The average transition time across subjects increased with larger window sizes; however, there was no significant difference between the transition time of ODAN and the actual clean spectrograms (Wilcoxon rank sum test, P > 0.6).

  • Fig. 4 Improved subjective quality and objective quality and intelligibility of the ODAN-AAD system.

    (A) Subjective listening test to determine the ease of attending to the target speaker. Twenty healthy subjects were asked to rate the difficulty of attending to the target speaker when listening to (i) the raw mixture, (ii) the ODAN-AAD amplified target speaker, and (iii) the clean-AAD amplified target speaker. The detected target speakers in (ii) and (iii) were amplified by 12 dB relative to the interfering speakers. Subjects were asked to rate the difficulty on a scale of 1 to 5 (MOS). The bar plots show the median MOS ± SE for each condition. The enhancement of the target speaker for the ODAN-AAD and clean-AAD systems was 100 and 118%, respectively (P < 0.001). (B and C) Objective quality (PESQ) and intelligibility (ESTOI) improvement of the target speech in the same three conditions as in (A). ****P < 0.0001, t test.

  • Table 1 Comparison of speech separation accuracy of ODAN with two other methods for separating two-speaker mixtures (WSJ0-mix2 dataset) and three-speaker mixtures (WSJ0-mix3 dataset).

    Comparison of speech separation accuracy of ODAN with two other methods for separating two-speaker mixtures (WSJ0-mix2 dataset) and three-speaker mixtures (WSJ0-mix3 dataset).. The separation accuracy of ODAN, which is the causal system, is slightly worse but comparable to the other noncausal methods.

    Number of
    Speakers
    ModelCausalSI-SNRi
    (dB)
    SDRi
    (dB)
    PESQESTOI
    Two speakersOriginal mixture002.020.56
    DAN-LSTM (11)No9.19.52.730.77
    uPIT-LSTM (15)Yes7.0
    ODANYes9.09.42.700.77
    Three speakersOriginal mixture001.660.39
    DAN-LSTM (11)No7.07.42.130.56
    uPIT-BLSTM (15)No7.4
    DPCL++ (50)No7.1
    ODANYes6.77.22.030.55
  • Table 2 Speech separation accuracy of ODAN in separating one-, two-, and three-speaker mixtures (WSJ0-mix2 and WSJ0-mix3 datasets).

    Speech separation accuracy of ODAN in separating one-, two-, and three-speaker mixtures (WSJ0-mix2 and WSJ0-mix3 datasets).. The ODAN was trained on both the WSJ0-mix2 and WSJ0-mix3 datasets and used in all cases.

    Number of
    speakers
    CausalSI-SNRi (dB)SDRi (dB)PESQESTOI
    3Yes7.07.52.080.56
    2Yes8.99.32.630.76
    1YesSI-SNR (dB)
    24.4
    SDR (dB)
    25.0
    4.140.98

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/5/5/eaav6134/DC1

    Fig. S1. Electrode coverage and speech responsiveness for each subject.

    Fig. S2. The change in the update parameter of attractors (parameter q in methods) when the speakers in the mixture switch.

    Movie S1. The full demo of the proposed ODAN-AAD system.

  • Supplementary Materials

    The PDF file includes:

    • Fig. S1. Electrode coverage and speech responsiveness for each subject.
    • Fig. S2. The change in the update parameter of attractors (parameter q in methods) when the speakers in the mixture switch.
    • Legend for movie S1

    Download PDF

    Other Supplementary Material for this manuscript includes the following:

    • Movie S1 (.mov format). The full demo of the proposed ODAN-AAD system.

    Files in this Data Supplement:

Stay Connected to Science Advances

Navigate This Article