Database Structure and Access# Downloading Dataset¶
Overview## Initializing the Database¶
SPIDB extends the SONICDB framework to provide specialized data models for stored product insect research. The database uses SQLAlchemy ORM to manage relationships between sensors, subjects, materials, events, records, samples, and audio files.The SPIDB package uses the SONICDB package to interact with the SQLite database. The database can be initialized using the following command:
Database Schema```python¶
from spidb import spidb
The SPIDB database consists of the following core models:
db = spidb.Database(“data/spi.db”)
Core Models (from SONICDB)```¶
File - Audio file metadata and locations
Sensor - Recording equipment specifications## Stored Product Insect Dataset
Channel - Individual sensor channels
Event - Recording sessionsThe Stored Product Insect Dataset (SPID) is available on Kaggle at the following links:
Sample - Time-windowed data segments
Record - Fixed-duration event segments- A-SPIDS Dataset[1]: https://www.kaggle.com/dkadyrov/a-spids
M-SPIDS Dataset[2]: https://www.kaggle.com/dkadyrov/m-spids
Extended Models (SPIDB-specific)¶
The datasets can be manually downloaded from the Kaggle website or by using the Kaggle API. An example script for downloading the datasets using the Kaggle API is provided in the examples folder.
Subject - Insect subjects with biological measurements
Material - Substrate materials (grains, seeds, etc.)The datasets should be downloaded and extracted into the
datadirectory of the SPIDB package. The following is the directory structure for the SPIDB package to work out of the box with the examples provided in the package:
Initializing the Database```none¶
spidb
Opening an Existing Database+—data¶
| |—spi.db
from spidb import spidb | \---mspids
db = spidb.Database(“data/spi.db”)
```The data can be downloaded into a seperate directory a new database will need to be generated and populated with the updated filepaths. An example of this script is available in the examples folder.
###Creating a New Database## BugBytes Dataset
# Creates a new SQLite database
db = spidb.Database("my_new_database.db")```none
spidb
# Or use PostgreSQL+---data
db = spidb.Database("postgresql://user:password@localhost/spidb")| |---bugbytes.db
| \---bugbytes
# Or MySQL```
db = spidb.Database("mysql://user:password@localhost/spidb")
Database Relationships¶
Hierarchy¶
Sensor
├── Channel
│ └── File
└── Event
├── Subject
├── Material
├── Record
│ └── Sample
│ ├── Subject
│ ├── Material
│ ├── Channel
│ └── File
Key Relationships¶
Sensor has many Channels
Sensor records many Events
Event has one Subject and one Material
Event contains many Records
Record generates many Samples (one per channel)
Sample links to one File, Channel, Subject, and Material
Model Details¶
Subject Model¶
Stores information about insect specimens:
subject = spidb.Subject(
name="Cowpea Beetle Adult",
scientific_name="Callosobruchus maculatus",
common_name="Cowpea Beetle",
life_stage="Adult",
length=3.5, # mm
width=1.5, # mm
height=1.5, # mm
weight=0.003, # grams
volume=None, # cm³
density=None # g/cm³
)
Attributes:
name- Descriptive identifierscientific_name- Taxonomic binomial namecommon_name- Common/vernacular namelife_stage- Development stage (egg, larva, pupa, adult)length,width,height- Physical dimensions in mmweight- Mass in gramsvolume- Volume in cm³density- Density in g/cm³
Material Model¶
Stores information about substrate materials:
material = spidb.Material(
name="rice",
scientific_name="Oryza sativa",
common_name="Rice",
density=1.4 # g/cm³
)
Attributes:
name- Material identifierscientific_name- Taxonomic name (for organic materials)common_name- Common namedensity- Material density in g/cm³
Event Model¶
Records a recording session:
from datetime import datetime
event = spidb.Event(
start=datetime(2023, 5, 24, 19, 41, 19),
end=datetime(2023, 5, 24, 19, 51, 19),
description="Single cowpea beetle in rice",
noise="Silence",
sensor=sensor,
subject=subject,
material=material
)
Attributes:
start,end- Recording timestampsdescription- Text descriptionnoise- Noise level/type (e.g., “Silence”, “60dBA”, “Ambient”)sensor- Associated Sensor objectsubject- Associated Subject objectmaterial- Associated Material object
Record Model¶
Fixed-duration segments of events:
record = spidb.Record(
start=datetime(2023, 5, 24, 19, 41, 19),
end=datetime(2023, 5, 24, 19, 42, 19), # 60 seconds
event=event,
sensor=sensor,
subject=subject,
material=material,
noise="Silence"
)
Attributes:
start,end- Time windowevent- Parent Event objectsensor,subject,material,noise- Inherited from event
Sample Model¶
Individual channel recordings:
sample = spidb.Sample(
datetime=datetime(2023, 5, 24, 19, 41, 19),
event=event,
record=record,
sensor=sensor,
channel=channel,
subject=subject,
material=material,
file=audio_file,
noise="Silence"
)
Attributes:
datetime- Sample timestampevent,record- Parent objectssensor,channel- Recording equipmentsubject,material,noise- Contextfile- Associated audio File object
File Model¶
Audio file metadata:
file = spidb.File(
filepath="/path/to/file.wav",
filename="2023_05_24_19_41_19.559_t00100_Ch0_00001.wav",
extension="wav",
sample_rate=44100,
start=datetime(2023, 5, 24, 19, 41, 19),
end=datetime(2023, 5, 24, 19, 42, 19),
duration=60.0,
channel_number=0,
channel=channel,
sensor=sensor
)
Downloading Datasets¶
Using CLI (Recommended)¶
# Download and build database
spidb download --build
# Download specific dataset
spidb download --dataset aspids
See CLI Guide for details.
Using Kaggle API¶
The datasets are available on Kaggle:
A-SPID Dataset: Acoustic Stored Product Insect Dataset
M-SPID Dataset: Microwave Stored Product Insect Dataset
Manual download:
# Install kaggle
pip install kaggle
# Download A-SPID
kaggle datasets download -d dkadyrov/stored-product-insect-database-spidb-aspids -p data/aspids --unzip
# Download M-SPID
kaggle datasets download -d dkadyrov/stored-product-insect-database-spidb-mspids -p data/mspids --unzip
Building Database from Downloaded Data¶
from spidb.build import populate_db, generate_records, generate_samples
from spidb import spidb
# Create database
db = spidb.Database("data/spi.db")
# Populate from dataset
populate_db(
db=db,
file_directory="data/aspids",
deck="data/aspids/metadata.json"
)
# Generate records (60-second windows)
generate_records(db, duration=60, overwrite=True)
# Generate samples
generate_samples(db, overwrite=True)
Directory Structure¶
Expected structure after downloading datasets:
data/
├── aspids/
│ ├── metadata.json
│ └── 2023/
│ ├── 04_13/
│ ├── 04_24/
│ └── 05_09/
│ └── 2023_05_09_HH_MM_SS.fff_descriptor/
│ ├── 2023_05_09_HH_MM_SS.fff_descriptor_Ch0_00001.wav
│ ├── 2023_05_09_HH_MM_SS.fff_descriptor_Ch1_00001.wav
│ └── ...
├── mspids/
│ ├── metadata.json
│ └── 2024/
│ └── ...
└── spi.db
Metadata Structure¶
Each dataset includes a metadata.json file with:
{
"sensor": {
"name": "A-SPIDS",
"subname": "v1.0",
"manufacturer": "Custom",
"number_of_channels": 8,
"type_class": "multi-sensor"
},
"channels": [
{"type_class": "piezoelectric"},
{"type_class": "piezoelectric"}
],
"subjects": [
{
"name": "Cowpea Beetle Adult",
"scientific_name": "Callosobruchus maculatus",
"common_name": "Cowpea Beetle",
"life_stage": "Adult",
"length": 3.5,
"width": 1.5,
"height": 1.5,
"weight": 0.003,
"volume": null,
"density": null
}
],
"materials": [
{
"name": "rice",
"scientific_name": "Oryza sativa",
"common_name": "Rice",
"density": 1.4
}
],
"events": [
{
"start": "2023-05-24T19:41:19.559",
"end": "2023-05-24T19:51:19.559",
"target": "Cowpea Beetle Adult",
"material": "rice",
"description": "Single specimen in center",
"noise": "Silence"
}
]
}
Database Statistics¶
Query database statistics:
from sqlalchemy import func
# Count records by model
n_subjects = db.session.query(func.count(spidb.Subject.id)).scalar()
n_materials = db.session.query(func.count(spidb.Material.id)).scalar()
n_events = db.session.query(func.count(spidb.Event.id)).scalar()
n_records = db.session.query(func.count(spidb.Record.id)).scalar()
n_samples = db.session.query(func.count(spidb.Sample.id)).scalar()
print(f"Subjects: {n_subjects}")
print(f"Materials: {n_materials}")
print(f"Events: {n_events}")
print(f"Records: {n_records}")
print(f"Samples: {n_samples}")