Research ArticleSeismology

Earthquake detection through computationally efficient similarity search

See allHide authors and affiliations

Science Advances  04 Dec 2015:
Vol. 1, no. 11, e1501057
DOI: 10.1126/sciadv.1501057
  • Fig. 1 Comparison of earthquake detection methods in terms of three qualitative metrics: Detection sensitivity, general applicability, and computational efficiency.

    STA/LTA scores high on general applicability because it finds unknown sources, scores high on computational efficiency because it detects earthquakes in real time, but scores low on detection sensitivity because it can miss low-snr seismic events. Template matching rates high on detection sensitivity because cross-correlation can find low-snr events, rates high on computational efficiency because we only need to cross-correlate continuous data with a small set of template waveforms, but rates low on general applicability because template waveforms need to be determined in advance. Autocorrelation has high detection sensitivity because it cross-correlates waveforms, and high general applicability because it can find unknown similar sources, but has very low computational efficiency that scales poorly with the size of the continuous data set. FAST performs well with respect to all three metrics, combining the detection sensitivity and general applicability of correlation-based detection with high computational efficiency and scalability.

  • Fig. 2 Map with locations of catalog earthquakes on the Calaveras Fault and seismic station with data.

    Double-difference catalog locations of the 8 January 2011 Mw 4.1 earthquake (red star) and NCSN catalog events (dots) between 8 and 15 January 2011 on the Calaveras Fault, and station CCOB.EHN (white triangle) from which we processed 1 week of data from 8 to 15 January 2011. Blue dots indicate the 21 catalog events detected by FAST, and black dots indicate the 3 catalog events missed by FAST. (Inset) Map location within California (red box).

  • Fig. 3 FAST event detections plotted on 1 week of continuous data.

    Data are from station CCOB.EHN (bandpass, 4 to 10 Hz) starting on 8 January 2011 (00:00:00). FAST detected a total of 89 earthquakes, including 21 of 24 catalog events (blue) and 68 new events (red).

  • Fig. 4 FAST scaling properties as a function of continuous data duration up to 6 months.

    (A) Memory usage for the database generated by LSH. (B) FAST total runtime (red) subdivided into runtime for feature extraction (blue) and similarity search (green). Autocorrelation runtimes (purple) for continuous data longer than 1 week are extrapolated based on quadratic scaling (dashed line). These results are from running FAST with the parameters in Table 1, with the number of hash functions r increased from 5 to 7, which decreased the total runtime for 1 week of continuous data to under an hour.

  • Fig. 5 Feature extraction steps in FAST.

    (A) Continuous time series data. (B) Spectrogram: amplitude on log scale. (C) Spectral images from two similar earthquakes at 1267 and 1629 s. (D) Haar wavelet coefficients: amplitude on log scale. (E) Sign of top standardized Haar coefficients after data compression. (F) Binary fingerprint: output of feature extraction. Notice that similar spectral images result in similar fingerprints.

  • Fig. 6 Example of how LSH groups fingerprints together in the database.

    (A) Example of MHS for two similar fingerprints A and B, with p = 6. (B) LSH decides how to place two similar fingerprints A (blue) and B (green) into hash buckets (ovals) in each hash table (red boxes); waveforms are shown for easy visualization. The MHS length is p = 6, and there are b = 3 hash tables, so each hash table gets a different subset of the MHS of each fingerprint that is 6/3 = 2 integers long: the output of r = 2 Min-Hash functions. Taking each hash table separately: if the MHS subsets of A and B are equal, then A and B enter the same hash bucket in the database; this is true in hash tables 1 and 3, where h(A) = h(B) = [155, 64] and h(A) = h(B) = [110, 21], respectively. In hash table 2, however, the MHS subsets of A and B are not equal, because h(A) = [231, 35] and h(B) = [207, 35], so A and B enter different hash buckets.

  • Fig. 7 LSH database and similarity search example.

    (A) Database generated using LSH, with b = 3 hash tables (red boxes); each hash table has many hash buckets (ovals). LSH groups similar fingerprints into the same hash bucket with high probability; earthquake signals (colors) are likely to enter the same bucket, whereas noise (black) is grouped into different buckets. (B) Search for waveforms in database similar to query waveform (blue). First, LSH determines which bucket in each hash table has a waveform that matches the query. Next, we take all other waveforms in the same bucket in each hash table and calculate the FAST similarity between each (query, database) waveform pair: the fraction of hash tables containing the pair in the same bucket. The red waveform is in the same bucket as the blue query waveform in all three hash tables, so their similarity is 1; the green waveform is in the same bucket in two of three hash tables; and so on. This figure displays waveforms for easy visualization, but the database stores references to fingerprints in the hash buckets, and a search query requires converting the waveform to its fingerprint.

  • Table 1 FAST input parameters.

    These were used for detection in synthetic data (except the event detection threshold) and in 1 week of CCOB.EHN data.

    FAST parameterValue
    Time series window length for spectrogram generation200 samples (10 s)
    Time series window lag for spectrogram generation2 samples (0.1 s)
    Spectral image window length100 samples (10 s)
    Spectral image window lag = fingerprint sampling period10 samples (1 s)
    Number of top k amplitude standardized Haar coefficients800
    LSH: number of hash functions per hash table r5
    LSH: number of hash tables b100
    Initial pair threshold: number v (fraction) of tables, pair in same bucket4 (4/100 = 0.04)
    Event detection threshold: number v (fraction) of tables, pair in same bucket19 (19/100 = 0.19)
    Similarity search: near-repeat exclusion parameter5 samples (5 s)
    Near-duplicate pair and event elimination time window21 s
    Autocorrelation and catalog comparison time window19 s
  • Table 2 Summary of performance comparison between autocorrelation and FAST for several metrics.

    The numbers for metrics 3 to 5 should sum to the number in metric 1.

    MetricAutocorrelationFAST
    1. Total number of detected events8689
    2. Number of false detections (false positives)012
    3. Number and percentage of catalog detections24/24 = 100%21/24 = 87.5%
    4. Number of new detections from both algorithms4343
    5. Number of new detections from one, missed by the other1925
    6. Number of missed detections (false negatives)2522
    7. Runtime9 days 13 hours1 hour 36 min

Supplementary Materials

  • Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/1/11/e1501057/DC1

    Continuous data time gaps

    Detection on synthetic data

    Reference code: Autocorrelation

    Near-repeat exclusion of similar pairs

    Postprocessing and thresholding

    Fig. S1. Illustration of comparison between many-to-many search methods for similar pairs of seismic events.

    Fig. S2. Twenty-second catalog earthquake waveforms, ordered by event time in 1 week of continuous data from CCOB.EHN (bandpass, 4 to 10 Hz).

    Fig. S3. Catalog events missed by FAST, detected by autocorrelation.

    Fig. S4. Twenty-second new (uncataloged) earthquake waveforms detected by FAST, ordered by event time in 1 week of continuous data from CCOB.EHN (bandpass, 4 to 10 Hz); FAST found a total of 68 new events.

    Fig. S5. FAST detection errors.

    Fig. S6. Example of uncataloged earthquake detected by FAST, missed by autocorrelation.

    Fig. S7. Histogram of similar fingerprint pairs output from FAST.

    Fig. S8. Schematic illustration of FAST output as a similarity matrix for one channel of continuous seismic data.

    Fig. S9. CC and Jaccard similarity for two similar earthquakes.

    Fig. S10. Theoretical probability of a successful search as a function of Jaccard similarity.

    Fig. S11. Synthetic data generation.

    Fig. S12. Hypothetical precision-recall curves from three different algorithms.

    Fig. S13. Synthetic test results for three different scaling factors c: 0.05 (top), 0.03 (center), 0.01 (bottom), with snr values provided.

    Table S1. Autocorrelation input parameters.

    Table S2. NCSN catalog events.

    Table S3. Scaling test days.

    Table S4. Example of near-duplicate fingerprint pairs detected by FAST, which represent the same pair with slight time offsets.

    Reference (44)

  • Supplementary Materials

    This PDF file includes:

    • Continuous data time gaps
    • Detection on synthetic data
    • Reference code: Autocorrelation
    • Near-repeat exclusion of similar pairs
    • Postprocessing and thresholding
    • Fig. S1. Illustration of comparison between many-to-many search methods for similar pairs of seismic events.
    • Fig. S2. Twenty-second catalog earthquake waveforms, ordered by event time in 1 week of continuous data from CCOB.EHN (bandpass, 4 to 10 Hz).
    • Fig. S3. Catalog events missed by FAST, detected by autocorrelation.
    • Fig. S4. Twenty-second new (uncataloged) earthquake waveforms detected by FAST, ordered by event time in 1 week of continuous data from CCOB.EHN (bandpass, 4 to 10 Hz); FAST found a total of 68 new events.
    • Fig. S5. FAST detection errors.
    • Fig. S6. Example of uncataloged earthquake detected by FAST, missed by autocorrelation.
    • Fig. S7. Histogram of similar fingerprint pairs output from FAST.
    • Fig. S8. Schematic illustration of FAST output as a similarity matrix for one channel of continuous seismic data.
    • Fig. S9. CC and Jaccard similarity for two similar earthquakes.
    • Fig. S10. Theoretical probability of a successful search as a function of Jaccard similarity.
    • Fig. S11. Synthetic data generation.
    • Fig. S12. Hypothetical precision-recall curves from three different algorithms.
    • Fig. S13. Synthetic test results for three different scaling factors c: 0.05 (top), 0.03 (center), 0.01 (bottom), with snr values provided.
    • Table S1. Autocorrelation input parameters.
    • Table S2. NCSN catalog events.
    • Table S3. Scaling test days.
    • Table S4. Example of near-duplicate fingerprint pairs detected by FAST, which represent the same pair with slight time offsets.
    • Reference (44)

    Download PDF

    Files in this Data Supplement:

Navigate This Article