Database Structure and Access# Downloading Dataset

Overview## Initializing the Database

SPIDB extends the SONICDB framework to provide specialized data models for stored product insect research. The database uses SQLAlchemy ORM to manage relationships between sensors, subjects, materials, events, records, samples, and audio files.The SPIDB package uses the SONICDB package to interact with the SQLite database. The database can be initialized using the following command:

Database Schema```python

from spidb import spidb

The SPIDB database consists of the following core models:

db = spidb.Database(“data/spi.db”)

Core Models (from SONICDB)```

  • File - Audio file metadata and locations

  • Sensor - Recording equipment specifications## Stored Product Insect Dataset

  • Channel - Individual sensor channels

  • Event - Recording sessionsThe Stored Product Insect Dataset (SPID) is available on Kaggle at the following links:

  • Sample - Time-windowed data segments

  • Record - Fixed-duration event segments- A-SPIDS Dataset[1]: https://www.kaggle.com/dkadyrov/a-spids

  • M-SPIDS Dataset[2]: https://www.kaggle.com/dkadyrov/m-spids

Extended Models (SPIDB-specific)

The datasets can be manually downloaded from the Kaggle website or by using the Kaggle API. An example script for downloading the datasets using the Kaggle API is provided in the examples folder.

  • Subject - Insect subjects with biological measurements

  • Material - Substrate materials (grains, seeds, etc.)The datasets should be downloaded and extracted into the data directory of the SPIDB package. The following is the directory structure for the SPIDB package to work out of the box with the examples provided in the package:

Initializing the Database```none

spidb

Opening an Existing Database+—data

| |—spi.db


from spidb import spidb |   \---mspids

db = spidb.Database(“data/spi.db”)

```The data can be downloaded into a seperate directory a new database will need to be generated and populated with the updated filepaths. An example of this script is available in the examples folder.

###Creating a New Database## BugBytes Dataset


# Creates a new SQLite database

db = spidb.Database("my_new_database.db")```none

spidb 

# Or use PostgreSQL+---data

db = spidb.Database("postgresql://user:password@localhost/spidb")|   |---bugbytes.db

|   \---bugbytes

# Or MySQL```

db = spidb.Database("mysql://user:password@localhost/spidb")

Database Relationships

Hierarchy

Sensor
├── Channel
│   └── File
└── Event
    ├── Subject
    ├── Material
    ├── Record
    │   └── Sample
    │       ├── Subject
    │       ├── Material
    │       ├── Channel
    │       └── File

Key Relationships

  • Sensor has many Channels

  • Sensor records many Events

  • Event has one Subject and one Material

  • Event contains many Records

  • Record generates many Samples (one per channel)

  • Sample links to one File, Channel, Subject, and Material

Model Details

Subject Model

Stores information about insect specimens:

subject = spidb.Subject(
    name="Cowpea Beetle Adult",
    scientific_name="Callosobruchus maculatus",
    common_name="Cowpea Beetle",
    life_stage="Adult",
    length=3.5,      # mm
    width=1.5,       # mm
    height=1.5,      # mm
    weight=0.003,    # grams
    volume=None,     # cm³
    density=None     # g/cm³
)

Attributes:

  • name - Descriptive identifier

  • scientific_name - Taxonomic binomial name

  • common_name - Common/vernacular name

  • life_stage - Development stage (egg, larva, pupa, adult)

  • length, width, height - Physical dimensions in mm

  • weight - Mass in grams

  • volume - Volume in cm³

  • density - Density in g/cm³

Material Model

Stores information about substrate materials:

material = spidb.Material(
    name="rice",
    scientific_name="Oryza sativa",
    common_name="Rice",
    density=1.4  # g/cm³
)

Attributes:

  • name - Material identifier

  • scientific_name - Taxonomic name (for organic materials)

  • common_name - Common name

  • density - Material density in g/cm³

Event Model

Records a recording session:

from datetime import datetime

event = spidb.Event(
    start=datetime(2023, 5, 24, 19, 41, 19),
    end=datetime(2023, 5, 24, 19, 51, 19),
    description="Single cowpea beetle in rice",
    noise="Silence",
    sensor=sensor,
    subject=subject,
    material=material
)

Attributes:

  • start, end - Recording timestamps

  • description - Text description

  • noise - Noise level/type (e.g., “Silence”, “60dBA”, “Ambient”)

  • sensor - Associated Sensor object

  • subject - Associated Subject object

  • material - Associated Material object

Record Model

Fixed-duration segments of events:

record = spidb.Record(
    start=datetime(2023, 5, 24, 19, 41, 19),
    end=datetime(2023, 5, 24, 19, 42, 19),  # 60 seconds
    event=event,
    sensor=sensor,
    subject=subject,
    material=material,
    noise="Silence"
)

Attributes:

  • start, end - Time window

  • event - Parent Event object

  • sensor, subject, material, noise - Inherited from event

Sample Model

Individual channel recordings:

sample = spidb.Sample(
    datetime=datetime(2023, 5, 24, 19, 41, 19),
    event=event,
    record=record,
    sensor=sensor,
    channel=channel,
    subject=subject,
    material=material,
    file=audio_file,
    noise="Silence"
)

Attributes:

  • datetime - Sample timestamp

  • event, record - Parent objects

  • sensor, channel - Recording equipment

  • subject, material, noise - Context

  • file - Associated audio File object

File Model

Audio file metadata:

file = spidb.File(
    filepath="/path/to/file.wav",
    filename="2023_05_24_19_41_19.559_t00100_Ch0_00001.wav",
    extension="wav",
    sample_rate=44100,
    start=datetime(2023, 5, 24, 19, 41, 19),
    end=datetime(2023, 5, 24, 19, 42, 19),
    duration=60.0,
    channel_number=0,
    channel=channel,
    sensor=sensor
)

Downloading Datasets

Using Kaggle API

The datasets are available on Kaggle:

Manual download:

# Install kaggle
pip install kaggle

# Download A-SPID
kaggle datasets download -d dkadyrov/stored-product-insect-database-spidb-aspids -p data/aspids --unzip

# Download M-SPID
kaggle datasets download -d dkadyrov/stored-product-insect-database-spidb-mspids -p data/mspids --unzip

Building Database from Downloaded Data

from spidb.build import populate_db, generate_records, generate_samples
from spidb import spidb

# Create database
db = spidb.Database("data/spi.db")

# Populate from dataset
populate_db(
    db=db,
    file_directory="data/aspids",
    deck="data/aspids/metadata.json"
)

# Generate records (60-second windows)
generate_records(db, duration=60, overwrite=True)

# Generate samples
generate_samples(db, overwrite=True)

Directory Structure

Expected structure after downloading datasets:

data/
├── aspids/
│   ├── metadata.json
│   └── 2023/
│       ├── 04_13/
│       ├── 04_24/
│       └── 05_09/
│           └── 2023_05_09_HH_MM_SS.fff_descriptor/
│               ├── 2023_05_09_HH_MM_SS.fff_descriptor_Ch0_00001.wav
│               ├── 2023_05_09_HH_MM_SS.fff_descriptor_Ch1_00001.wav
│               └── ...
├── mspids/
│   ├── metadata.json
│   └── 2024/
│       └── ...
└── spi.db

Metadata Structure

Each dataset includes a metadata.json file with:

{
  "sensor": {
    "name": "A-SPIDS",
    "subname": "v1.0",
    "manufacturer": "Custom",
    "number_of_channels": 8,
    "type_class": "multi-sensor"
  },
  "channels": [
    {"type_class": "piezoelectric"},
    {"type_class": "piezoelectric"}
  ],
  "subjects": [
    {
      "name": "Cowpea Beetle Adult",
      "scientific_name": "Callosobruchus maculatus",
      "common_name": "Cowpea Beetle",
      "life_stage": "Adult",
      "length": 3.5,
      "width": 1.5,
      "height": 1.5,
      "weight": 0.003,
      "volume": null,
      "density": null
    }
  ],
  "materials": [
    {
      "name": "rice",
      "scientific_name": "Oryza sativa",
      "common_name": "Rice",
      "density": 1.4
    }
  ],
  "events": [
    {
      "start": "2023-05-24T19:41:19.559",
      "end": "2023-05-24T19:51:19.559",
      "target": "Cowpea Beetle Adult",
      "material": "rice",
      "description": "Single specimen in center",
      "noise": "Silence"
    }
  ]
}

Database Statistics

Query database statistics:

from sqlalchemy import func

# Count records by model
n_subjects = db.session.query(func.count(spidb.Subject.id)).scalar()
n_materials = db.session.query(func.count(spidb.Material.id)).scalar()
n_events = db.session.query(func.count(spidb.Event.id)).scalar()
n_records = db.session.query(func.count(spidb.Record.id)).scalar()
n_samples = db.session.query(func.count(spidb.Sample.id)).scalar()

print(f"Subjects: {n_subjects}")
print(f"Materials: {n_materials}")
print(f"Events: {n_events}")
print(f"Records: {n_records}")
print(f"Samples: {n_samples}")

Next Steps

  • See Usage for query examples

  • See Models for complete model reference

  • See CLI Guide for command-line tools