# Database Structure and Access# Downloading Dataset ## Overview## Initializing the Database SPIDB extends the SONICDB framework to provide specialized data models for stored product insect research. The database uses SQLAlchemy ORM to manage relationships between sensors, subjects, materials, events, records, samples, and audio files.The SPIDB package uses the `SONICDB` package to interact with the SQLite database. The database can be initialized using the following command: ## Database Schema```python from spidb import spidb The SPIDB database consists of the following core models: db = spidb.Database("data/spi.db") ### Core Models (from SONICDB)``` - **File** - Audio file metadata and locations - **Sensor** - Recording equipment specifications## Stored Product Insect Dataset - **Channel** - Individual sensor channels - **Event** - Recording sessionsThe Stored Product Insect Dataset (SPID) is available on Kaggle at the following links: - **Sample** - Time-windowed data segments - **Record** - Fixed-duration event segments- A-SPIDS Dataset[^aspids]: [https://www.kaggle.com/dkadyrov/a-spids](https://www.kaggle.com/dkadyrov/a-spids) - M-SPIDS Dataset[^mspids]: [https://www.kaggle.com/dkadyrov/m-spids](https://www.kaggle.com/dkadyrov/m-spids) ### Extended Models (SPIDB-specific) The datasets can be manually downloaded from the Kaggle website or by using the Kaggle API. An [example script](https://github.com/dkadyrov/spidb/tree/main/examples) for downloading the datasets using the Kaggle API is provided in the `examples` folder. - **Subject** - Insect subjects with biological measurements - **Material** - Substrate materials (grains, seeds, etc.)The datasets should be downloaded and extracted into the `data` directory of the SPIDB package. The following is the directory structure for the SPIDB package to work out of the box with the examples provided in the package: ## Initializing the Database```none spidb ### Opening an Existing Database+---data | |---spi.db ```python| +---aspids from spidb import spidb | \---mspids ``` db = spidb.Database("data/spi.db") ```The data can be downloaded into a seperate directory a new database will need to be generated and populated with the updated filepaths. An example of this script is available in the `examples` folder. ###Creating a New Database## BugBytes Dataset ```pythonThe BugBytes dataset[^bugbytes] is a collection of recordings made using the Acoustic Stored Product Insect Detection System (A-SPIDS) and the Microwave Stored Product Insect Detection System (M-SPIDS) of several stored product insects at different life stages within various materials under different levels of artificial and natural noise. The dataset is available on Kaggle at the following link: [BugBytes](https://www.kaggle.com/dkadyrov/bugbytes). # Creates a new SQLite database db = spidb.Database("my_new_database.db")```none spidb # Or use PostgreSQL+---data db = spidb.Database("postgresql://user:password@localhost/spidb")| |---bugbytes.db | \---bugbytes # Or MySQL``` db = spidb.Database("mysql://user:password@localhost/spidb") ``` [^aspids]: Daniel Kadyrov, Alexander Sutin, Alexander Sedunov, Nikolay Sedunov, and Hady Salloum, “Stored Product Insect Dataset (SPID) - ASPIDS.” Kaggle. doi: 10.34740/KAGGLE/DS/4982480. ## Database Relationships [^mspids]: Daniel Kadyrov, Alexander Sutin, Alexander Sedunov, Nikolay Sedunov, and Hady Salloum, “Stored Product Insect Dataset (SPID) - MSPIDS.” Kaggle. doi: 10.34740/KAGGLE/DSV/8618204. ### Hierarchy [^bugbytes]: R. Mankin, “Bug Bytes Sound Library: Stored Product Insect Pest Sounds.” Ag Data Commons, 2019. doi: 10.15482/USDA.ADC/1504600. ``` Sensor ├── Channel │ └── File └── Event ├── Subject ├── Material ├── Record │ └── Sample │ ├── Subject │ ├── Material │ ├── Channel │ └── File ``` ### Key Relationships - **Sensor** has many **Channels** - **Sensor** records many **Events** - **Event** has one **Subject** and one **Material** - **Event** contains many **Records** - **Record** generates many **Samples** (one per channel) - **Sample** links to one **File**, **Channel**, **Subject**, and **Material** ## Model Details ### Subject Model Stores information about insect specimens: ```python subject = spidb.Subject( name="Cowpea Beetle Adult", scientific_name="Callosobruchus maculatus", common_name="Cowpea Beetle", life_stage="Adult", length=3.5, # mm width=1.5, # mm height=1.5, # mm weight=0.003, # grams volume=None, # cm³ density=None # g/cm³ ) ``` **Attributes:** - `name` - Descriptive identifier - `scientific_name` - Taxonomic binomial name - `common_name` - Common/vernacular name - `life_stage` - Development stage (egg, larva, pupa, adult) - `length`, `width`, `height` - Physical dimensions in mm - `weight` - Mass in grams - `volume` - Volume in cm³ - `density` - Density in g/cm³ ### Material Model Stores information about substrate materials: ```python material = spidb.Material( name="rice", scientific_name="Oryza sativa", common_name="Rice", density=1.4 # g/cm³ ) ``` **Attributes:** - `name` - Material identifier - `scientific_name` - Taxonomic name (for organic materials) - `common_name` - Common name - `density` - Material density in g/cm³ ### Event Model Records a recording session: ```python from datetime import datetime event = spidb.Event( start=datetime(2023, 5, 24, 19, 41, 19), end=datetime(2023, 5, 24, 19, 51, 19), description="Single cowpea beetle in rice", noise="Silence", sensor=sensor, subject=subject, material=material ) ``` **Attributes:** - `start`, `end` - Recording timestamps - `description` - Text description - `noise` - Noise level/type (e.g., "Silence", "60dBA", "Ambient") - `sensor` - Associated Sensor object - `subject` - Associated Subject object - `material` - Associated Material object ### Record Model Fixed-duration segments of events: ```python record = spidb.Record( start=datetime(2023, 5, 24, 19, 41, 19), end=datetime(2023, 5, 24, 19, 42, 19), # 60 seconds event=event, sensor=sensor, subject=subject, material=material, noise="Silence" ) ``` **Attributes:** - `start`, `end` - Time window - `event` - Parent Event object - `sensor`, `subject`, `material`, `noise` - Inherited from event ### Sample Model Individual channel recordings: ```python sample = spidb.Sample( datetime=datetime(2023, 5, 24, 19, 41, 19), event=event, record=record, sensor=sensor, channel=channel, subject=subject, material=material, file=audio_file, noise="Silence" ) ``` **Attributes:** - `datetime` - Sample timestamp - `event`, `record` - Parent objects - `sensor`, `channel` - Recording equipment - `subject`, `material`, `noise` - Context - `file` - Associated audio File object ### File Model Audio file metadata: ```python file = spidb.File( filepath="/path/to/file.wav", filename="2023_05_24_19_41_19.559_t00100_Ch0_00001.wav", extension="wav", sample_rate=44100, start=datetime(2023, 5, 24, 19, 41, 19), end=datetime(2023, 5, 24, 19, 42, 19), duration=60.0, channel_number=0, channel=channel, sensor=sensor ) ``` ## Downloading Datasets ### Using CLI (Recommended) ```bash # Download and build database spidb download --build # Download specific dataset spidb download --dataset aspids ``` See [CLI Guide](cli_guide.md) for details. ### Using Kaggle API The datasets are available on Kaggle: - **A-SPID Dataset**: [Acoustic Stored Product Insect Dataset](https://www.kaggle.com/datasets/dkadyrov/stored-product-insect-database-spidb-aspids) - **M-SPID Dataset**: [Microwave Stored Product Insect Dataset](https://www.kaggle.com/datasets/dkadyrov/stored-product-insect-database-spidb-mspids) Manual download: ```bash # Install kaggle pip install kaggle # Download A-SPID kaggle datasets download -d dkadyrov/stored-product-insect-database-spidb-aspids -p data/aspids --unzip # Download M-SPID kaggle datasets download -d dkadyrov/stored-product-insect-database-spidb-mspids -p data/mspids --unzip ``` ### Building Database from Downloaded Data ```python from spidb.build import populate_db, generate_records, generate_samples from spidb import spidb # Create database db = spidb.Database("data/spi.db") # Populate from dataset populate_db( db=db, file_directory="data/aspids", deck="data/aspids/metadata.json" ) # Generate records (60-second windows) generate_records(db, duration=60, overwrite=True) # Generate samples generate_samples(db, overwrite=True) ``` ## Directory Structure Expected structure after downloading datasets: ``` data/ ├── aspids/ │ ├── metadata.json │ └── 2023/ │ ├── 04_13/ │ ├── 04_24/ │ └── 05_09/ │ └── 2023_05_09_HH_MM_SS.fff_descriptor/ │ ├── 2023_05_09_HH_MM_SS.fff_descriptor_Ch0_00001.wav │ ├── 2023_05_09_HH_MM_SS.fff_descriptor_Ch1_00001.wav │ └── ... ├── mspids/ │ ├── metadata.json │ └── 2024/ │ └── ... └── spi.db ``` ## Metadata Structure Each dataset includes a `metadata.json` file with: ```json { "sensor": { "name": "A-SPIDS", "subname": "v1.0", "manufacturer": "Custom", "number_of_channels": 8, "type_class": "multi-sensor" }, "channels": [ {"type_class": "piezoelectric"}, {"type_class": "piezoelectric"} ], "subjects": [ { "name": "Cowpea Beetle Adult", "scientific_name": "Callosobruchus maculatus", "common_name": "Cowpea Beetle", "life_stage": "Adult", "length": 3.5, "width": 1.5, "height": 1.5, "weight": 0.003, "volume": null, "density": null } ], "materials": [ { "name": "rice", "scientific_name": "Oryza sativa", "common_name": "Rice", "density": 1.4 } ], "events": [ { "start": "2023-05-24T19:41:19.559", "end": "2023-05-24T19:51:19.559", "target": "Cowpea Beetle Adult", "material": "rice", "description": "Single specimen in center", "noise": "Silence" } ] } ``` ## Database Statistics Query database statistics: ```python from sqlalchemy import func # Count records by model n_subjects = db.session.query(func.count(spidb.Subject.id)).scalar() n_materials = db.session.query(func.count(spidb.Material.id)).scalar() n_events = db.session.query(func.count(spidb.Event.id)).scalar() n_records = db.session.query(func.count(spidb.Record.id)).scalar() n_samples = db.session.query(func.count(spidb.Sample.id)).scalar() print(f"Subjects: {n_subjects}") print(f"Materials: {n_materials}") print(f"Events: {n_events}") print(f"Records: {n_records}") print(f"Samples: {n_samples}") ``` ## Next Steps - See [Usage](usage.md) for query examples - See [Models](models.md) for complete model reference - See [CLI Guide](cli_guide.md) for command-line tools