Earthquake detection through computationally efficient similarity search

Data mining techniques applied to continuous waveforms reveal many previously unknown small earthquakes.


The PDF file includes:
Continuous data time gaps Detection on synthetic data Reference code: Autocorrelation Near-repeat exclusion of similar pairs Postprocessing and thresholding Fig. S1. Illustration of comparison between many-to-many search methods for similar pairs of seismic events. Fig. S2. Twenty-second catalog earthquake waveforms, ordered by event time in 1 week of continuous data from CCOB.EHN (bandpass, 4 to 10 Hz). Fig. S3. Catalog events missed by FAST, detected by autocorrelation. Fig. S4. Twenty-second new (uncataloged) earthquake waveforms detected by FAST, ordered by event time in 1 week of continuous data from CCOB.EHN (bandpass, 4 to 10 Hz); FAST found a total of 68 new events. Fig. S5. FAST detection errors. Fig. S6. Example of uncataloged earthquake detected by FAST, missed by autocorrelation. Fig. S7. Histogram of similar fingerprint pairs output from FAST. Fig. S8. Schematic illustration of FAST output as a similarity matrix for one channel of continuous seismic data. Fig. S9. CC and Jaccard similarity for two similar earthquakes. Fig. S10. Theoretical probability of a successful search as a function of Jaccard similarity. Fig. S11. Synthetic data generation. Fig. S12. Hypothetical precision-recall curves from three different algorithms.

Continuous Data Time Gaps
The selected week of continuous data from the NCEDC contained 7 time gaps, with the longest time gap around 14 minutes in duration. We stitched together the time series data, and placed uncorrelated white Gaussian noise in the time gaps, scaled by the mean and standard deviation of 1000 data samples on either end of the time gap. We confirmed that FAST did not detect any spurious events in or near the time gaps filled with synthetic noise.

Detection on Synthetic Data
We performed a synthetic test to compare objectively the detection performance of FAST against a reference autocorrelation code. Our synthetic data consisted of scaled-down earthquake waveforms inserted at known times into noisy seismic data (7,8), which provides ground truth.
We extracted broadband noise from the N-S component of CCOB ( Figure S11A) during the first 12 hours of 2011-01-09. To simulate a repeating earthquake signal that FAST and autocorrelation should detect, we took a 10-second catalog earthquake waveform at 553-563 s (Figure S11B) in the CCOB data, multiplied it by a scaling factor c, and inserted it 24 different times into the noisy data. We also added a non-repeating earthquake signal, which we do not expect either algorithm to detect, by taking a different 10-second earthquake waveform ( Figure   S11C), scaling it by the same factor c, and planting it once into the noisy data.
We define the snr for the synthetic data as the ratio of the signal power to the noise power: where the signal amplitude A signal and noise amplitude A noise are root-mean-squared (rms) values calculated from 10-second time windows (a 1 , a 2 , … , a n ), with n = 200 samples (for 20 sps data): A rms = 1 n a 1 2 + a 2 2 + + a n 2 ( ) To compute A signal , we used the 10-second catalog earthquake waveform at 553-563 s, scaled by the factor c. For A noise , we computed the average of the A rms values for 10-second time windows at the 24 times in the noisy data, prior to inserting the repeating earthquake signal.
Since we have ground truth for the synthetic data, we can quantify detection errors made by autocorrelation and FAST as we adjust their detection thresholds. At one particular threshold, classifications are correct when the algorithm detects an event where one truly was planted (true positive: TP), or fails to detect an event where one was not planted (true negative: TN).
Classification errors occur when we falsely detect an event where one was not planted (false positive: FP), or we fail to detect an event where one actually was planted (false negative: FN).
Precision is the fraction of identified detections that are true event detections (44): and recall is the fraction of true events that were correctly identified as event detections (44): Changing the detection threshold results in different values for TP and FP, which can be used to trace out a curve called the precision-recall curve. Figure S12 shows hypothetical precisionrecall curves from three different algorithms. With perfect performance, where both FP and FN are zero, precision and recall are equal to one, at the upper right corner; however, there usually is a trade-off between these metrics, depending on the detection threshold. If the threshold is too low, recall will be high since we do not miss any events, but precision will be low since there will be an increase in false detections. But if the threshold is too high, precision will be high since false detections are rejected, but recall may be low since we may also fail to detect actual events. We should set the threshold to the best compromise for the particular application.
Figure S13 displays synthetic test detection results, where the repeating earthquake signal was scaled by three different factors c: 0.05, 0.03, 0.01, with associated snr values from Eq. 9 and 10. We show synthetic data with planted, scaled waveforms and detection results as precision-recall curves for both autocorrelation and FAST. Table 1 contains FAST parameter values that we used, and Table S1 displays parameters for the reference autocorrelation code.
These are the same values used for detection in real data, except here we varied event detection thresholds to compute precision-recall curves; these thresholds are in units of CC for autocorrelation and FAST similarity for FAST. For c = 0.05 (snr = 7.37), both autocorrelation and FAST achieve perfect precision and recall ( Figure S13B), finding all 24 planted events without any false positives. For c = 0.03 (snr = 2.65), autocorrelation still has perfect precision and recall, but FAST starts to trade off precision against recall ( Figure S13D). For c = 0.01 (snr = 0.29), neither autocorrelation nor FAST detect any of the planted waveforms ( Figure S13F).
As expected, neither autocorrelation nor FAST detected the non-repeating planted waveform.
We conclude from the synthetic test results that autocorrelation is a more sensitive detector than FAST for detecting low snr signals. But FAST performs on par with autocorrelation for moderate values of snr, with a significantly reduced computational cost. This is consistent with the fact that FAST makes approximations in both the feature extraction and similarity search steps, while autocorrelation performs a comprehensive, precise comparison.

Reference Code: Autocorrelation
Autocorrelation partitions the continuous data into N short overlapping time windows, and computes N(N-1)/2 normalized CC values between all possible window pairs using Eq. 1. Using a time window length of M = 200 samples (10 s), and τw = 2 samples (0.1 s) as the lag between adjacent time windows (Table S1), we found that N = 6,047,901, which is about 10 times greater than N fp = 604,781. This factor of 10 difference is attributed to the use of a 1 s lag between FAST fingerprints, compared to a 0.1 s lag for autocorrelation.
We use a "sliding window" implementation of autocorrelation that avoids redundant computation of the dot product in the CC (Eq. 1). The dot product can be broken up into three parts: two small boundary segments and one large interior segment, since there is a large amount of overlap between adjacent time windows. We compute the dot product for the next window pair by reusing the computation from the previous interior segment and adjusting for boundary segments. Our autocorrelation implementation is configured to run in parallel on up to 1000 processors with the Message Passing Interface (MPI), although we report runtime results on 1 processor for a fair comparison with FAST.
Our autocorrelation code outputs only time window pairs with CC above an initial threshold, defined by βd, where we used β = 5 and d is the median absolute deviation (MAD) defined with the L1 matrix norm as: where CC( a, b) is the CC between windows a and b in Eq. 1. We assume a normal distribution of correlation coefficients for our large value of N from the central limit theorem, but use MAD as a detection statistic because it is more robust to outliers (8).

Near-Repeat Exclusion of Similar Pairs
A query fingerprint will always identify itself as a similar fingerprint in the database, but this match is trivial and of no interest to detection. Also, we are not interested in "near-repeat" pairs where a fingerprint is reported as similar to itself, but offset by a few time samples. Therefore, we use a "near-repeat exclusion" parameter, n r = 5, to avoid returning any fingerprint within n r samples of the search query fingerprint as a matching similar fingerprint. Since the fingerprint lag is 1 s, n r = 5 samples is equivalent to 5 seconds (Table 1).
In the same way, we want to avoid autocorrelation output pairs that consist of a time window correlated with itself, so we use a "near-repeat exclusion" parameter, n r = 50, to avoid returning any time window within n r samples of the current window as a similar waveform. For a 0.1 s time window lag, n r = 50 samples is equivalent to 5 seconds (Table S1).

Postprocessing and Thresholding
Post-processing and thresholding are required to convert a list of pairs of similar fingerprint times to a list of earthquake detection times. First, the list of pairs can have near-duplicate pairs with FAST similarity above the event detection threshold, when they really represent the same pair with slight time offsets (red squares in Figure S8). For example, Table S4 shows a list of near-duplicate pairs that actually represent a single pair, so we keep only the single pair with the highest similarity 0.57, and remove the rest of the pairs within 21 seconds of the times for the highest similarity pair (Table 1).
After removing near-duplicate pairs, we create a list of event detection times. We sort the pairs in decreasing order of similarity, then add each event in the pair to the detection list.      found by autocorrelation, confirmed by visual inspection to be legitimate earthquakes, so these are missed detections. Figure S6: Example of uncataloged earthquake detected by FAST, missed by autocorrelation. New uncataloged event (11296 s, blue) was detected by FAST to be similar to a catalog event (988 s, red), but was not detected by autocorrelation. The overall waveform shapes are similar, but their alignment is not precise. Figure S7: Histogram of similar fingerprint pairs output from FAST. It is binned by FAST similarity, with bin size 0.04. The number of pairs in each bin is on a log scale, and nearduplicate pairs are included. For Nfp = 604,781 fingerprints, there are Nfp(Nfp -1)/2 ~ 1.8 · 10 11 possible fingerprint pairs (red). FAST outputs 978,904 candidate pairs (blue, green) with similarity of at least the initial threshold of 0.04 (Table 1), which constitute only 0.0005% of the total number of possible pairs, although the database stores additional pairs in memory. After applying the higher event detection threshold of 0.19 (Table 1) Figure S8: Schematic illustration of FAST output as a similarity matrix for one channel of continuous seismic data. As in Figure S1, each square represents a pair of fingerprints (which can be mapped back to waveforms) from two different times (blue, green) in the continuous data. This symmetric matrix is very sparse, since LSH restricts our search to highly similar fingerprint pairs; black squares with high similarity indicate when similar waveforms occur. The FAST similarity metric is the fraction of hash tables containing each fingerprint pair in the same hash bucket. Further processing and thresholding, including removal of near-duplicate pairs (red squares), is required to obtain a list of event detection times.  As v increases, the curve moves to the right, with steeper slope: Jaccard similarity must be higher for successful search, and there is a sharper cutoff for detections. To simulate a repeating earthquake signal, we took a 10-second catalog earthquake waveform at 553-563 s, starting from 2011-01-08 00:00:00 in the CCOB.EHN data, multiplied it by a scaling factor c, and inserted it 24 times into the noisy data, every 30 minutes starting at 900 s: 900 s, 2700s, ... , 42300 s. (C) We also added a non-repeating earthquake signal by taking a 10-second earthquake waveform at 314075-314085 s, scaling it by the same factor c, and planting it once into the noisy data at 19800 s. Figure S12: Hypothetical precision-recall curves from three different algorithms. Each point along the curve represents a different detection threshold. Ideally, if there are no FP or FN errors, both precision and recall would equal 1, so the curve would touch the upper-right corner. But there usually is a trade-off between these metrics depending on the detection threshold. The algorithm for the blue curve has the best detection performance since we can set a threshold such that both precision and recall are close to 1. The algorithm for the green curve is not as good: we can have high precision or high recall, but not both at the same time. The red curve displays the worst performance: both precision and recall are low, regardless of threshold.