BCI IV 2a Data Postprocessing¶
This notebook makes the dataset lineage explicit. It downloads the official BCI Competition IV data set 2a archives, recreates the six package processed .npy files, writes a manifest, and optionally compares regenerated files with the packaged processed data.
What The Files Mean¶
The raw source is the public BCI IV-2a GDF release. The project subset stores train/validation candidate trials, held-out test trials, labels, and subject IDs in six NumPy files. The labels are still the original cue IDs (769-772) at this stage; conversion to class IDs happens during preprocessing.
from pathlib import Path
import json
from eegclassify.bci2a import BCI2AConversionConfig, convert_bci2a_training_subset, download_bci2a
from eegclassify.data import compare_processed_dirs, summarize_processed_dir, write_json_report
RAW_DIR = Path("../data/raw")
PROCESSED_DIR = Path("../data/processed")
REFERENCE_DIR = Path("../data_temp")
REPORT_DIR = Path("../artifacts")
DOWNLOAD = True
REGENERATE_FROM_RAW = True
COMPARE_WITH_REFERENCE = True
Download Official Archives¶
Set DOWNLOAD = False when the archives are already available under data/raw/. Raw files are reproducible from the official URLs and should not be committed to Git.
if DOWNLOAD:
paths = download_bci2a(RAW_DIR)
print(json.dumps({key: str(value) for key, value in paths.items()}, indent=2))
else:
print("Skipping download")
Convert Raw GDF Files To The Processed Dataset¶
The converter uses the training GDF files A01T.gdf through A09T.gdf, the first 22 EEG channels, 1000 samples per trial, and the built-in package split map.
config = BCI2AConversionConfig(
window_samples=1000,
window_offset_samples=1,
test_trial_start=200,
test_trial_stop=250,
)
if REGENERATE_FROM_RAW:
bundle = convert_bci2a_training_subset(RAW_DIR, PROCESSED_DIR, config, PROCESSED_DIR / "manifest.json")
print("X_train_valid:", bundle.X_train_valid.shape)
print("X_test:", bundle.X_test.shape)
else:
print("Skipping raw-to-subset conversion")
Manifest And Comparison¶
The manifest records source URLs, shapes, dtypes, label counts, subject counts, file sizes, and SHA256 hashes. If data_temp/ exists, the comparison report checks whether the regenerated files match the local cache.
manifest = summarize_processed_dir(PROCESSED_DIR)
write_json_report(manifest, PROCESSED_DIR / "manifest.json")
print(json.dumps({name: info.get("shape") for name, info in manifest["files"].items()}, indent=2))
if COMPARE_WITH_REFERENCE and REFERENCE_DIR.exists():
report = compare_processed_dirs(PROCESSED_DIR, REFERENCE_DIR)
write_json_report(report, REPORT_DIR / "data_comparison.json")
print("all_files_match:", report["all_files_match"])
else:
print("Skipping reference-cache comparison")