WilData Architecture

WilData is the data management foundation of the WildDetect ecosystem, providing a unified pipeline for importing, transforming, and exporting object detection datasets.

Overview

Purpose: Unified data pipeline and management system for computer vision datasets

Key Responsibilities:

Multi-format dataset import/export
Data transformations and augmentation
ROI dataset creation
DVC integration for versioning
REST API for programmatic access

Architecture Diagram

Core Components

Dataset Adapters

Adapters provide the logic to convert various annotation formats into the WilData Master Format:

COCO Adapter: Processes COCO JSON annotations.
YOLO Adapter: Processes YOLO TXT annotations and YAML configuration.
Label Studio Adapter: Processes exports from the Label Studio annotation platform.

2. Master Format

Internal unified representation for all datasets.

A unified internal representation used to store images, annotations, categories, and metadata consistently across the toolkit.

3. Transformation Pipeline

Apply transformations to datasets.

Reusable processing steps applied to datasets during import or export:

Bbox Clipping: Ensures bounding boxes stay within image limits.
Tiling: Splits large images into smaller tiles while preserving and adjusting annotations.
Augmentation: Generates synthetic training data through rotations and flips.

4. ROI Adapter

Extract regions of interest for classification datasets.

Specialized tool for extracting sub-images (ROIs) from detection datasets to create classification datasets for species identification.

**Use Cases**:
- Hard sample mining
- Error analysis
- Training ROI-based classifiers
- Creating balanced classification datasets

### 5. Data Pipeline

Main orchestrator for data operations.

```python
The central coordination layer that handles dataset lifecycle: loading, validation, transformation, and storage.

### 6. DVC Manager

Handle data versioning with DVC.

```python
Integrates Data Version Control for tracking large image files and synchronizing them with cloud or remote storage.

## REST API

FastAPI-based API for remote operations.

### API Structure

```python
WilData provides a REST API built with FastAPI that exposes endpoints for dataset management and job status monitoring.

### Background Jobs

Long-running operations handled asynchronously:

```python
Long-running operations (like large transformations) are handled in a background job queue to ensure responsiveness.

## CLI Interface

Command-line interface built with Typer.

```python
All core functionalities are exposed through a comprehensive CLI built with Typer. Detailed command documentation is available in the [CLI Reference](../api-reference/wildata-cli.md).

## Configuration System

### Import Configuration

```yaml
# configs/import-config-example.yaml
source_path: "annotations.json"
source_format: "coco"  # coco, yolo, ls
dataset_name: "my_dataset"

root: "data"
split_name: "train"  # train, val, test
processing_mode: "batch"  # streaming, batch

# Transformations
transformations:
  enable_bbox_clipping: true
  bbox_clipping:
    tolerance: 5
    skip_invalid: false
  
  enable_augmentation: false
  augmentation:
    rotation_range: [-45, 45]
    probability: 1.0
    num_transforms: 2
  
  enable_tiling: true
  tiling:
    tile_size: 800
    stride: 640
    min_visibility: 0.7
    max_negative_tiles_in_negative_image: 2

# ROI Configuration
roi_config:
  random_roi_count: 10
  roi_box_size: 128
  min_roi_size: 32
  background_class: "background"
  save_format: "jpg"
  quality: 95

Data Storage

Directory Structure

data/
├── datasets/
│   ├── dataset_name/
│   │   ├── images/
│   │   │   ├── train/
│   │   │   ├── val/
│   │   │   └── test/
│   │   └── annotations/
│   │       ├── train.json        # Master format
│   │       ├── val.json
│   │       └── test.json
│   └── ...
├── exports/
│   ├── coco/
│   └── yolo/
└── .dvc/                         # DVC metadata

Master Format Storage

Datasets are stored in an extended COCO-like format:

{
  "info": {
    "dataset_name": "my_dataset",
    "created_at": "2024-01-01T00:00:00",
    "source_format": "coco",
    "transformations_applied": ["tiling", "clipping"]
  },
  "images": [...],
  "annotations": [...],
  "categories": [...]
}

Validation

All data is validated at import:

Automatic data integrity checks are performed during import to ensure all fields, coordinates, and image references are valid.

## Performance Optimization

### Streaming Mode

For large datasets:

```python
The pipeline supports efficient processing of large datasets using streaming modes and parallel I/O operations.


### 3. DVC Workflow

```bash
# Setup DVC
wildata dvc setup --storage-type s3 --storage-path s3://bucket/data

# Import with tracking
wildata import-dataset data.json --format coco --name ds --track-dvc

# Push to remote
wildata dvc push

# On another machine
wildata dvc pull ds

Overview​

Architecture Diagram​

Core Components​

Dataset Adapters​

2. Master Format​

3. Transformation Pipeline​

4. ROI Adapter​

Data Storage​

Directory Structure​

Master Format Storage​

Validation​

Next Steps​