Database Structure and Access# Downloading Dataset¶

Overview## Initializing the Database¶

SPIDB extends the SONICDB framework to provide specialized data models for stored product insect research. The database uses SQLAlchemy ORM to manage relationships between sensors, subjects, materials, events, records, samples, and audio files.The SPIDB package uses the SONICDB package to interact with the SQLite database. The database can be initialized using the following command:

Database Schema```python¶

from spidb import spidb

The SPIDB database consists of the following core models:

db = spidb.Database(“data/spi.db”)

Core Models (from SONICDB)```¶

File - Audio file metadata and locations
Sensor - Recording equipment specifications## Stored Product Insect Dataset
Channel - Individual sensor channels
Event - Recording sessionsThe Stored Product Insect Dataset (SPID) is available on Kaggle at the following links:
Sample - Time-windowed data segments
Record - Fixed-duration event segments- A-SPIDS Dataset[1]: https://www.kaggle.com/dkadyrov/a-spids
M-SPIDS Dataset[2]: https://www.kaggle.com/dkadyrov/m-spids

Extended Models (SPIDB-specific)¶

The datasets can be manually downloaded from the Kaggle website or by using the Kaggle API. An example script for downloading the datasets using the Kaggle API is provided in the examples folder.

Subject - Insect subjects with biological measurements
Material - Substrate materials (grains, seeds, etc.)The datasets should be downloaded and extracted into the data directory of the SPIDB package. The following is the directory structure for the SPIDB package to work out of the box with the examples provided in the package:

Initializing the Database```none¶

spidb

Opening an Existing Database+—data¶

| |—spi.db


from spidb import spidb |   \---mspids

db = spidb.Database(“data/spi.db”)

```The data can be downloaded into a seperate directory a new database will need to be generated and populated with the updated filepaths. An example of this script is available in the examples folder.

###Creating a New Database## BugBytes Dataset

# Creates a new SQLite database

db = spidb.Database("my_new_database.db")```none

spidb 

# Or use PostgreSQL+---data

db = spidb.Database("postgresql://user:password@localhost/spidb")|   |---bugbytes.db

|   \---bugbytes

# Or MySQL```

db = spidb.Database("mysql://user:password@localhost/spidb")

Database Relationships¶

Hierarchy¶

Sensor
├── Channel
│   └── File
└── Event
    ├── Subject
    ├── Material
    ├── Record
    │   └── Sample
    │       ├── Subject
    │       ├── Material
    │       ├── Channel
    │       └── File

Key Relationships¶

Sensor has many Channels
Sensor records many Events
Event has one Subject and one Material
Event contains many Records
Record generates many Samples (one per channel)
Sample links to one File, Channel, Subject, and Material

Model Details¶

Subject Model¶

Stores information about insect specimens:

subject = spidb.Subject(
    name="Cowpea Beetle Adult",
    scientific_name="Callosobruchus maculatus",
    common_name="Cowpea Beetle",
    life_stage="Adult",
    length=3.5,      # mm
    width=1.5,       # mm
    height=1.5,      # mm
    weight=0.003,    # grams
    volume=None,     # cm³
    density=None     # g/cm³
)

Attributes:

name - Descriptive identifier
scientific_name - Taxonomic binomial name
common_name - Common/vernacular name
life_stage - Development stage (egg, larva, pupa, adult)
length, width, height - Physical dimensions in mm
weight - Mass in grams
volume - Volume in cm³
density - Density in g/cm³

Material Model¶

Stores information about substrate materials:

material = spidb.Material(
    name="rice",
    scientific_name="Oryza sativa",
    common_name="Rice",
    density=1.4  # g/cm³
)

Attributes:

name - Material identifier
scientific_name - Taxonomic name (for organic materials)
common_name - Common name
density - Material density in g/cm³

Event Model¶

Records a recording session:

from datetime import datetime

event = spidb.Event(
    start=datetime(2023, 5, 24, 19, 41, 19),
    end=datetime(2023, 5, 24, 19, 51, 19),
    description="Single cowpea beetle in rice",
    noise="Silence",
    sensor=sensor,
    subject=subject,
    material=material
)

Attributes:

start, end - Recording timestamps
description - Text description
noise - Noise level/type (e.g., “Silence”, “60dBA”, “Ambient”)
sensor - Associated Sensor object
subject - Associated Subject object
material - Associated Material object

Record Model¶

Fixed-duration segments of events:

record = spidb.Record(
    start=datetime(2023, 5, 24, 19, 41, 19),
    end=datetime(2023, 5, 24, 19, 42, 19),  # 60 seconds
    event=event,
    sensor=sensor,
    subject=subject,
    material=material,
    noise="Silence"
)

Attributes:

start, end - Time window
event - Parent Event object
sensor, subject, material, noise - Inherited from event

Sample Model¶

Individual channel recordings:

sample = spidb.Sample(
    datetime=datetime(2023, 5, 24, 19, 41, 19),
    event=event,
    record=record,
    sensor=sensor,
    channel=channel,
    subject=subject,
    material=material,
    file=audio_file,
    noise="Silence"
)

Attributes:

datetime - Sample timestamp
event, record - Parent objects
sensor, channel - Recording equipment
subject, material, noise - Context
file - Associated audio File object

File Model¶

Audio file metadata:

file = spidb.File(
    filepath="/path/to/file.wav",
    filename="2023_05_24_19_41_19.559_t00100_Ch0_00001.wav",
    extension="wav",
    sample_rate=44100,
    start=datetime(2023, 5, 24, 19, 41, 19),
    end=datetime(2023, 5, 24, 19, 42, 19),
    duration=60.0,
    channel_number=0,
    channel=channel,
    sensor=sensor
)

Downloading Datasets¶

Using CLI (Recommended)¶

# Download and build database
spidb download --build

# Download specific dataset
spidb download --dataset aspids

See CLI Guide for details.

Using Kaggle API¶

The datasets are available on Kaggle:

A-SPID Dataset: Acoustic Stored Product Insect Dataset
M-SPID Dataset: Microwave Stored Product Insect Dataset

Manual download:

# Install kaggle
pip install kaggle

# Download A-SPID
kaggle datasets download -d dkadyrov/stored-product-insect-database-spidb-aspids -p data/aspids --unzip

# Download M-SPID
kaggle datasets download -d dkadyrov/stored-product-insect-database-spidb-mspids -p data/mspids --unzip

Building Database from Downloaded Data¶

from spidb.build import populate_db, generate_records, generate_samples
from spidb import spidb

# Create database
db = spidb.Database("data/spi.db")

# Populate from dataset
populate_db(
    db=db,
    file_directory="data/aspids",
    deck="data/aspids/metadata.json"
)

# Generate records (60-second windows)
generate_records(db, duration=60, overwrite=True)

# Generate samples
generate_samples(db, overwrite=True)

Directory Structure¶

Expected structure after downloading datasets:

data/
├── aspids/
│   ├── metadata.json
│   └── 2023/
│       ├── 04_13/
│       ├── 04_24/
│       └── 05_09/
│           └── 2023_05_09_HH_MM_SS.fff_descriptor/
│               ├── 2023_05_09_HH_MM_SS.fff_descriptor_Ch0_00001.wav
│               ├── 2023_05_09_HH_MM_SS.fff_descriptor_Ch1_00001.wav
│               └── ...
├── mspids/
│   ├── metadata.json
│   └── 2024/
│       └── ...
└── spi.db

Metadata Structure¶

Each dataset includes a metadata.json file with:

{
  "sensor": {
    "name": "A-SPIDS",
    "subname": "v1.0",
    "manufacturer": "Custom",
    "number_of_channels": 8,
    "type_class": "multi-sensor"
  },
  "channels": [
    {"type_class": "piezoelectric"},
    {"type_class": "piezoelectric"}
  ],
  "subjects": [
    {
      "name": "Cowpea Beetle Adult",
      "scientific_name": "Callosobruchus maculatus",
      "common_name": "Cowpea Beetle",
      "life_stage": "Adult",
      "length": 3.5,
      "width": 1.5,
      "height": 1.5,
      "weight": 0.003,
      "volume": null,
      "density": null
    }
  ],
  "materials": [
    {
      "name": "rice",
      "scientific_name": "Oryza sativa",
      "common_name": "Rice",
      "density": 1.4
    }
  ],
  "events": [
    {
      "start": "2023-05-24T19:41:19.559",
      "end": "2023-05-24T19:51:19.559",
      "target": "Cowpea Beetle Adult",
      "material": "rice",
      "description": "Single specimen in center",
      "noise": "Silence"
    }
  ]
}

Database Statistics¶

Query database statistics:

from sqlalchemy import func

# Count records by model
n_subjects = db.session.query(func.count(spidb.Subject.id)).scalar()
n_materials = db.session.query(func.count(spidb.Material.id)).scalar()
n_events = db.session.query(func.count(spidb.Event.id)).scalar()
n_records = db.session.query(func.count(spidb.Record.id)).scalar()
n_samples = db.session.query(func.count(spidb.Sample.id)).scalar()

print(f"Subjects: {n_subjects}")
print(f"Materials: {n_materials}")
print(f"Events: {n_events}")
print(f"Records: {n_records}")
print(f"Samples: {n_samples}")

Next Steps¶

See Usage for query examples
See Models for complete model reference
See CLI Guide for command-line tools