Abstract
EEG signal processing is a fundamental method for neurophysiology research and clinical neurology practice. Historically the classification of EEG into physiological, pathological, or artifacts has been performed by expert visual review of the recordings. However, the size of EEG data recordings is rapidly increasing with a trend for higher channel counts, greater sampling frequency, and longer recording duration and complete reliance on visual data review is not sustainable. In this study, we publicly share annotated intracranial EEG data clips from two institutions: Mayo Clinic, MN, USA and St. Anne’s University Hospital Brno, Czech Republic. The dataset contains intracranial EEG that are labeled into three groups: physiological activity, pathological/epileptic activity, and artifactual signals. The dataset published here should support and facilitate training of generalized machine learning and digital signal processing methods for intracranial EEG and promote research reproducibility. Along with the data, we also propose a statistical method that is recommended for comparison of candidate classifier performance utilizing out-of-institution/out-of-patient testing.
Measurement(s) | brain measurement • physiological activity • epileptic seizure AE • Artifact • Annotation |
Technology Type(s) | electroencephalography (EEG) • intracranial electroencephalography • data transformation |
Factor Type(s) | institution |
Sample Characteristic - Organism | Homo sapiens |
Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12326471
Similar content being viewed by others
Background & Summary
Intracranial electroencephalography (iEEG) is an invasive procedure commonly used for localization of epileptic seizure onset zones in patients with drug resistant epilepsy. The iEEG signals are directly measured from cortical and deep brain structures, e.g. hippocampus, amygdala, etc. Currently, the visual inspection and artifact rejection of the data is standard pre-processing procedure that must be done prior to evaluating of epileptic seizure onset zones. The improvement of EEG acquisition systems, data storage, and surgical techniques allows for large scale data collection spanning over multiple days to weeks, recording from hundreds of electrodes with sampling rates reaching up to 32 kHz in research settings1. In addition to the clinical utility for mapping epileptic brain by localization of seizures and interictal epileptiform transients these data enable a wide range of neuroscience research activities. The amount of collected data is rapidly increasing and advancement in data compression, storage, visualization, and automated processing of data is important and has received significant attention2,3. The acquisition of large datasets has driven the development of improved data preprocessing tools that enable extraction of important application specific data segments i.e. to focus on the data that are clinically important like seizures or other brain states for research purposes. It is very burdensome for a human operator to manually classify hundreds of channels for data spanning days to weeks. If reliable automated methods, can be developed they can easily and reliably mine the data and crop the segments of the data with the iEEG features of interests. The automatic classification of artifacts and segmentation of iEEG recordings is recognized as a challenging task, and many interesting studies have been published addressing the challenges4,5. In recent years, a variety of methods using machine learning techniques and deep learning techniques for iEEG processing emerged with impressive results6,7,8. The generalizability of an automated artifact detection method will enable broader utilization and extension for any retrospective and prospective iEEG dataset, but this has received little attention. We have recently demonstrated robust generalization of automated detection algorithms for artifact classification using training and testing datasets collected from different institutions, acquisition systems, under different measurement conditions7,8.
Automated processing and data mining with Convolutional Neural Networks (CNN) are powerful, but the interpretation of particular classifications and correlation with known iEEG waveforms is difficult. The inability to dissect the CNN decision process makes it less interesting for understanding fundamental neurophysiology, and ultimately for usage in clinical practice. We recently demonstrated the ability to temporally localize graphoelements that drive the final classification and make visual review and interpretation of raw EEG recordings possible8. This approach could also prove useful for supervised adaptive retraining in active learning and expert-in-the-loop scenario based on expert’s review of the data yielding false positive or false negative classifications. Common iEEG graphoelements of physiological activity (e.g. delta, alpha & beta bands oscillations) can be identified and characterized. Further, data contaminated with artifactual signals from several types including artificial or physiological sources. The most common artifact is powerline noise (50 Hz or 60 Hz) that is usually induced to the acquisition systems. Other artifacts have originally received less attention or were incorrectly assumed not to contaminate iEEG. But, it was later proven that eye movements and muscle artifacts might distort iEEG recordings9,10,11 and need to be either removed or discarded from analysis. In many recent electrophysiological studies across several domains (neurology, cognition, etc.), the research focuses more and more on subtle attributes of the iEEG signal such as power in high frequency bands12,13,14,15. In such analyses, it is critical to recognize and control for subtle power changes in the signal that might be caused for example by high frequency harmonics of power line noise and other artificial generators that would be previously omitted by cropping a frequency band (e.g. analyzing data in low frequencies only). In general, iEEG often contains artifactual signals in electrodes that are spatially closer to scalp or cranial nerve foramen (movement, muscle artifacts, eye movements). The ECG signal and weak scalp signals might also propagate to the measurement system by a common reference. In addition, natural pulsation of the brain tissue driven by respiration, cerebro-spinal fluid pulsation, and hearth rhythm (blood-flow and pulsation in vessels) might cause motion distortion artifacts. The patients undergoing the iEEG monitoring have electrodes implanted into the brain structures that are assumed to generate the epileptic/pathological activity like interictal epileptiform spikes and high-frequency-oscillations13. The problematic part of automated iEEG classification, that biases results of the studies, is the fact that artifactual signals (like muscle artifacts) caused by patient movement or other physiological sources commonly share features with pathological signals, e.g. power in band 200–600 Hz.
The purpose for public sharing of this dataset is to advance the field and the progress of generalized machine learning and iEEG processing techniques in neurophysiology. In particular, machine learning techniques capable of processing data from multiple institutions without performance degradation and without the need of retraining will be extremely useful. We anticipate these methods will boost the creation of new, large gold standard datasets from multiple institutions. The generalized pre-trained models should be re-trainable (transfer-learning)16,17 to adapt to new datasets without requiring a collection of new annotated gold standards, which should significantly decrease the time for manual annotation and therefore advance the iEEG utilization in clinical practice and research. For this reason, we believe that public sharing of such datasets is a cornerstone for further advancing iEEG research.
Methods
Data collection
The iEEG dataset published in this study was collected from two institutions: St. Anne’s University Hospital (Brno, Czech Republic) and Mayo Clinic (Rochester, Minnesota, United States of America). The data acquisition methods, and signal annotation techniques described below are adopted and expanded version of descriptions in our related work7,8. Here, for purposes of data sharing, we significantly extended the datasets that were used in our previous studies7,8. We provided additional information that are clinically relevant and might extend usability of the dataset. Each data segment is described by clinical useful features: classification category (power line interference; high frequency noise; pathological activity; physiological activity), seizure onset zone (True, False), anatomical location, electrode type, reviewer identification number, patient number. Provided information allows for various statistical testing scenarios. The description of the format of data and meta-data is extensively commented in section Data Format Description.
The St. Anne’s University Hospital (FNUSA) dataset was made up of iEEG data collected in awake resting state from 14 patients diagnosed with drug resistant epilepsy (DRE) who underwent a standard pre-surgical monitoring for localization of seizure onset zone, a standard for epilepsy surgery. The acquisition system used for the measurement in the hospital was a BrainScope system (M&I, BrainScope, Czech Republic). This system allows for recording up to 192-channel with maximum 25 kHz sampling rate and common reference montage. Here the system was used to record 30 minutes of awake resting interictal iEEG recordings with 25 kHz sampling rate. Raw data was filtered with 2 kHz low-pass filter, and down-sampled to 5 kHz to avoid aliasing. The electrodes used in all patients from the dataset were standard intracranial depth electrodes (5, 10 and 15 contact semi-flexible multi-contact platinum electrodes (ALCIS - Temis Health, France), with a diameter of 0.8 mm, a contact length of 2 mm, contact surface area 5.02 mm2 and inter-contact distance 1.5 mm).
Mayo Clinic data were recorded during the first night after electrode implantation and consisted of two-hour long iEEG recordings. The data were collected between 1 AM and 3 AM from 25 patients with DRE undergoing evaluation for epilepsy surgery. The Neuralynx Cheetah system (Neuralynx Inc., Bozeman MT, USA) was used to acquire the data at sampling rate of 32 kHz with hardware filter bandwidth of DC – 9 kHz. Similar to St’ Anne’s recordings, all data were filtered by an antialiasing filter, but in this case with cutoff frequency of 1 kHz. The data was subsequently down-sampled to 5 kHz. Patients were implanted with either depth electrodes or grids and strips, or the combination. An illustrative example (Fig. 1) shows co-registered electrode placement of fused MRI and CT scans of patient undergoing invasive EEG monitoring with stereotactic depth electrode. The depth electrodes used in the dataset were AD-Tech electrodes (AD-Tech Medical Instrument Corp., Racine, WI or PMT, Chahassen, MN, USA) and consisted of 4 or 8 Platinum/Iridium contacts (2.3 mm long, 1 mm diameter, spaced 5 or 10 mm center-to-center). AD-Tech subdural grids and strips electrodes had 4.0 mm diameter Platinum/Iridium discs (2.3 mm exposed) with 10 mm center-to-center distance.
Ethics declaration
All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Mayo Clinic Institutional Review Board and St. Anne’s University Hospital Research Ethics Committee and the Ethics Committee of Masaryk University.
Data annotation
All data were reviewed in SignalPlant18, a free software tool for signal processing, inspection and annotation. The dataset was annotated by 3 reviewers, where each recording was reviewed by single reviewer. Each recording was visually reviewed in time domain alongside power distribution matrices (PDM) for manual artifact detection13. The PDM method estimates the signal power envelope by data filtering in specific frequency bands, and further computes the absolute value of the analytical signal (analytical signal is complex signal comprising original signal in the real domain and its Hilbert transform in imaginary domain) to obtain the signal envelope. Visual inspection of the PDM allow for fast localization of high-power events appearing across all channels and are likely to be artifacts. The signal regions with high power envelope were subsequently reviewed in the time domain to classify signals into pathological/artifactual group. Standard clinical evaluation of pathology of the brain substrate requires two-year post-surgical follow-up evaluating reduction of epileptic seizures. However, in this study, we are targeting identification of signal graphoelements and thus defining pathological signal group as signals with epileptiform graphoelements e.g.: HFOs and spikes or epileptiform discharges, that are visually reviewed and predominantly extracted from electrodes implanted to a brain structures like hippocampus. Generation of PDM is time consuming process, however, SignalPlant allows for CUDA GPU accelerated signal filtering19, which significantly speeds up the process. Annotated events were segmented with constant-length-segmentation into 3 sec (15000 samples) long data clips. The length of the constant-length-segmentation window was empirically estimated regarding the fact that muscle artifacts span over multiple seconds. This iEEG data window provides sufficient context to reliably differentiate between all classes of the data (physiological activity, pathological/epileptic activity, power-line noise, and other non-cerebral artifacts).
Data Records
The datasets from St Anne’s University Hospital and Mayo Clinic consist of 155182 and 193118 data clips, respectively. The basic overview of segments distributions for each class/dataset is described in Table 1. Comprehensive data description statistics might be derived from datasets metadata files. In general, datasets contain data clips from four groups of distinctive events: powerline noise (in our case 50 Hz or 60 Hz depending on power line frequency at clinic’s location); muscle and machine artifacts; physiological iEEG activity in different behavioral states of subject (sleep/wake/wake-relax); pathophysiological activity. For example, Fig. 2a shows signals recorded in FNUSA contaminated by 50 Hz noise. Figure 2b shows movement artifact and Fig. 2c illustrates baseline jumps caused by instrumentation). Muscle, movement and machine artifacts group is iEEG recording that contains most often high frequency components caused either by movements, muscle artifactual activity of subject or artifacts caused by instrumentation. Figure 2d shows normal wake-relaxed state iEEG activity from FNUSA dataset. Pathological/epileptiform activity can consist from interictal epileptiform activity like spikes or high frequency oscillations (Fig. 2e). The datasets are publicly available to use under CC0 license and might be downloaded from figshare20 repository.
Technical Validation
In order to validate the reliability of gold standard annotations, we have used a cross-validation statistics in predicting the class by model that has been trained by annotations from another reviewer. For example, model was trained on data classified by reviewers 1 and 2 and subsequently tested on out of sample data segments classified by reviewer 3 (Fig. 3, Table 2). This procedure was repeated for each reviewer. Moreover, we provide out-of-institution testing statistics in order to show that annotations are consistent across institutions (Fig. 4, Table 3). Given methods provides a measure of data labeling quality. In order to validate the annotations, we used the Convolutional LSTM neural network that was previously described8. The model processes z-score normalized spectrograms of data and provides probability for each classification group. Here, we used the standard metrics i.e.: area under the receiver operating characteristic (AUROC) and area under the precision-recall curve (AUPRC) that are commonly applied in evaluation of model classification performance. For class imbalance datasets, like in our case, it’s crucial and more objective to report both AUROC and AUPRC in order to show unbiased model performance.
Data format description
We publish this dataset in a format that allows easy accessibility to a general machine learning community and allows for optimal and fast machine learning. For this reason, the datasets are stored in two separate zip archives, where each archive consist of data records from one institution. Each data segment is saved in.mat file format in order to allow processing in commonly used computing tools like Matlab and Python. Proposed format was recognized as very easy to work with during several machine learning competitions for classification of electrophysiological signals (mostly from cardiology domain) e.g. Computing in Cardiology Challenges21. Data segments are saved as data vectors (1 × 15000 float vector). Each dataset contains coma separated value (segments.csv) document describing metadata for each segment i.e.: segment_id, channel, category, reviewer, seizure onset zone (SOZ), anatomy, electrode type, anonymized patient_id, and institution. At the same time, we published the datasets in iEEG-BIDS format to comply with neuro data sharing standard22,23. Datasets and annotations are stored in multi-scale electrophysiology file format24 (.mef) that is supported23 by BIDS22. An official C code libraries and documentation for.mef usage are publicly available at https://github.com/msel-source/meflib.
In order to promote data sharing and reproducibility of results, we also publish the example of the training code for neural network models along with the dataset. We also publish Python pipelines together with requirements for Python environment. This should allow for smooth data handling and help with using the dataset. Example codes are might be downloaded from figshare20 or github (https://github.com/xnejed07/NoiseDetectionCNN-GRU).
Usage Notes
To streamline the data segmentation and machine learning process and to avoid an extra workflow on side of potential data users (direct manipulation with.mef files and compilation of supported C code libraries), we decided to publish the datasets that are segmented to 3-second segments (15,000 samples) also in matlab files (.mat). The 3-second length of the segment was empirically chosen based on electrophysiological characteristics of iEEG, experiments, tests, and results of our previous study7. Each segment is appropriately labeled to an assigned class with all the other meta-data provided. We encourage using this dataset for training of deep-learning methods for processing of new intracranial EEG data. The datasets might be used as a pretraining step. We assume that this will significantly increase the speed of automated annotation process of new data. We have previously published and described the transfer learning method7, that produces the probability matrices for each classification group. The organization of the dataset into small data clips allows for a rapid model development. That means that user doesn’t need to spend extensive amount of time with data annotation and other machine learning preprocessing steps, which usually consume a major part of time in development and testing of machine learning methods.
References
Stead, M. et al. Microseizures and the spatiotemporal scales of human partial epilepsy. Brain 133, 2789–2797 (2010).
Brinkmann, B. H., Bower, M. R., Stengel, K. A., Worrell, G. A. & Stead, M. Multiscale Electrophysiology Format: An Open Open-source Electrophysiology Format Using Data Compression, Encryption, and Cyclic Redundancy Check. Conf. Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. IEEE Eng. Med. Biol. Soc. Conf. 2009, 7083–7086 (2009).
Brinkmann, B. H., Bower, M. R., Stengel, K. A., Worrell, G. A. & Stead, M. Large-scale Electrophysiology: Acquisition, Compression, Encryption, and Storage of Big Data. J. Neurosci. Methods 180, 185–192 (2009).
Gliske, S. V. et al. Universal automated high frequency oscillation detector for real-time, long term EEG. Clin. Neurophysiol. 127, 1057–1066 (2016).
Hu, S., Stead, M. & Worrell, G. A. Automatic Identification and Removal of Scalp Reference Signal for Intracranial EEGs Based on Independent Component Analysis. IEEE Trans. Biomed. Eng. 54, 1560–1572 (2007).
Cimbalnik, J. et al. Multi-feature localization of epileptic foci from interictal, intracranial EEG. Clin. Neurophysiol. 130, 1945–1953 (2019).
Nejedly, P. et al. Intracerebral EEG Artifact Identification Using Convolutional Neural Networks. Neuroinformatics 17, 225–234 (2018).
Nejedly, P. et al. Exploiting Graphoelements and Convolutional Neural Networks with Long Short Term Memory for Classification of the Human Electroencephalogram. Sci. Rep. 9 (2019).
Ball, T. Signal quality of simultaneously recorded invasive and non-invasive EEG. NeuroImage 46, 708–716 (2009).
Kovach, C. Manifestation of ocular-muscle EMG contamination in human intracranial recordings. NeuroImage 54, 213–233 (2011).
Jerbi, K. et al. Saccade Related Gamma-Band Activity in Intracerebral EEG: Dissociating Neural from Ocular Muscle Activity. Brain Topogr. 22, 18–23 (2009).
Cimbalnik, J., Kucewicz, M. T. & Worrell, G. Interictal high-frequency oscillations in focal human epilepsy. Current Opinion in Neurology 29, 175–181 (2016).
Brázdil, M. et al. Very high-frequency oscillations: Novel biomarkers of the epileptogenic zone. Annals of Neurology 82, 299–310 (2017).
Worrell, G. & Gotman, J. High-frequency oscillations and other electrophysiological biomarkers of epilepsy: clinical studies. Biomarkers in Medicine 5, 557–566 (2011).
Kucewicz, M. T. et al. High frequency oscillations are associated with cognitive processing in human recognition memory. Brain 137, 2231–2244 (2014).
Tan, C. et al. A Survey on Deep Transfer Learning. Artificial Neural Networks and Machine Learning – ICANN 2018 Lecture Notes in Computer Science 270–279 (2018).
Yang, L., Hanneke, S. & Carbonell, J. A theory of transfer learning with applications to active learning. Mach. Learn. 90, 161–189 (2013).
Plesinger, F., Jurco, J., Halamek, J. & Jurak, P. SignalPlant: an open signal processing software platform. Physiol. Meas. 37, N38–48 (2016).
Nejedly, P., Plesinger, F., Halamek, J. & Jurak, P. CudaFilters: A SignalPlant library for GPU-accelerated FFT and FIR filtering. Softw. Pract. Exp. 48, 3–9 (2017).
Nejedly, P. et al. Multicenter intracranial EEG dataset for classification of graphoelements and artifactual signals. figshare https://doi.org/10.6084/m9.figshare.c.4681208 (2020).
Clifford, G. et al. AF Classification from a Short Single Lead ECG Recording: the Physionet Computing in Cardiology Challenge 2017. 2017 Computing in Cardiology Conference (CinC)(2017).
Pernet, C. R. et al. EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Sci. Data 6 (2019).
Holdgraf, C. et al. iEEG-BIDS, extending the Brain Imaging Data Structure specification to human intracranial electrophysiology. Sci. Data 6 (2019).
Stead, M., Bower, M., Brinkmann, B., Warren, C. & Worrell, G. Large-Scale Electrophysiology. Epilepsy 409–416 (2011).
Acknowledgements
This research was supported by NIH R01-NS92882, NIH UH2-NS095495, by the Institutional Resources for Research by Czech Technical University in Prague, Czech Republic, Projects LO1212, LQ1605 National Program of Sustainability II (MEYS CR), the CAS project RVO:68081731 and LTAUSA18056. This research has been financially supported by grant AZV NV 19-04-00343.
Author information
Authors and Affiliations
Contributions
All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication. P.N., V.K. and V.S., designed the study and data science pipelines. J.C., P.K. M.F., T.V. and B.H.B. managed the data and created the annotations for the St Anne’s university Hospital and Mayo Clinic. I.V., J.H. and F.P. designed the statistical analysis, signal processing pipelines and methods. M.P. and M.B., conducted patient care and data acquisition at St Anne’s University Hospital. P.J. and G.W. were principal investigators of the proposed project, obtained funding, and provided technical and clinical input.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.
About this article
Cite this article
Nejedly, P., Kremen, V., Sladky, V. et al. Multicenter intracranial EEG dataset for classification of graphoelements and artifactual signals. Sci Data 7, 179 (2020). https://doi.org/10.1038/s41597-020-0532-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-020-0532-5
This article is cited by
-
Utilization of temporal autoencoder for semi-supervised intracranial EEG clustering and classification
Scientific Reports (2023)
-
EEG Datasets in Machine Learning Applications of Epilepsy Diagnosis and Seizure Detection
SN Computer Science (2023)