Skip to content

Data Flow

This document describes how data flows through the WildDetect ecosystem, from raw annotations to final detection results and analysis.

Complete Pipeline Overview

flowchart TB
    subgraph "Stage 1: Data Collection"
        A[Raw Images]
        B[Annotation Tools<br/>Label Studio/CVAT]
        C[Annotations<br/>COCO/YOLO/LS]
    end

    subgraph "Stage 2: Data Preparation (WilData)"
        D[Import & Validate]
        E[Transformations<br/>Tile/Augment/Clip]
        F[Master Format<br/>Storage]
        G[Export<br/>Train/Val/Test]
    end

    subgraph "Stage 3: Model Training (WildTrain)"
        H[DataLoader]
        I[Training Loop]
        J[Validation]
        K[MLflow Tracking]
        L[Model Registry]
    end

    subgraph "Stage 4: Deployment (WildDetect)"
        M[Load Model]
        N[Detection Pipeline]
        O[Detections]
    end

    subgraph "Stage 5: Analysis"
        P[Census Statistics]
        Q[Geographic Analysis]
        R[Visualizations]
        S[Reports]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I
    I --> J
    J --> K
    I --> L
    L --> M
    M --> N
    A --> N
    N --> O
    O --> P
    O --> Q
    Q --> R
    P --> S

    style F fill:#e3f2fd
    style L fill:#fff3e0
    style O fill:#e8f5e9
    style S fill:#f3e5f5

Stage 1: Data Collection

Raw Image Acquisition

Aerial images captured from drones or aircraft:

Input: Raw aerial images
Format: JPG, PNG, TIFF, GeoTIFF
Metadata: GPS coordinates (EXIF), flight parameters
Size: Varies (100MB - 10GB per image for rasters)

Annotation Process

Images are annotated using labeling tools:

Supported Tools: - Label Studio (recommended for collaboration) - CVAT (Computer Vision Annotation Tool) - Manual COCO/YOLO annotation

Output Formats: - COCO JSON: annotations.json - YOLO: labels/*.txt + data.yaml - Label Studio: Export JSON

Example: Label Studio Workflow

# 1. Setup Label Studio project
# 2. Upload images
# 3. Annotate with bounding boxes
# 4. Export annotations

# Example export structure
{
  "annotations": [
    {
      "id": 1,
      "image": "drone_001.jpg",
      "annotations": [
        {
          "result": [{
            "value": {
              "x": 10, "y": 20, "width": 50, "height": 60,
              "rectanglelabels": ["elephant"]
            }
          }]
        }
      ]
    }
  ]
}

Stage 2: Data Preparation (WilData)

Import Process

flowchart LR
    A[Source<br/>Annotations] --> B[Format<br/>Adapter]
    B --> C[Master<br/>Format]
    C --> D[Validation]
    D --> E{Valid?}
    E -->|Yes| F[Save]
    E -->|No| G[Error Report]
    F --> H[Master<br/>Storage]

Format Conversion

All formats converted to unified master format:

# COCO Input
{
  "images": [...],
  "annotations": [...],
  "categories": [...]
}

# ↓ Converted to ↓

# Master Format
{
  "info": {
    "dataset_name": "my_dataset",
    "source_format": "coco",
    "created_at": "2024-01-01T00:00:00"
  },
  "images": [
    {
      "id": 1,
      "file_name": "image.jpg",
      "width": 1920,
      "height": 1080,
      "path": "data/images/train/image.jpg"
    }
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [x, y, width, height],
      "area": 12000,
      "confidence": 1.0
    }
  ],
  "categories": [
    {"id": 1, "name": "elephant"}
  ]
}

Transformation Pipeline

Data transformations applied sequentially:

1. Bbox Clipping

# Before
bbox = [x=-10, y=20, width=100, height=80]  # Outside image bounds

# After clipping
bbox = [x=0, y=20, width=90, height=80]  # Clipped to image

2. Tiling

For large images:

# Original: 8000x6000 image with 5 animals
# ↓
# Tiles: 12 tiles of 800x800
# - Tile (0,0): 1 animal
# - Tile (1,0): 2 animals
# - Tile (0,1): 1 animal
# - etc.

3. Augmentation

Create variations for training:

# Original image
# ↓
# Augmented versions:
# - Rotated +15°
# - Rotated -15°
# - Brightness adjusted
# - etc.

Export for Training

Convert to framework-specific format:

YOLO Format:

dataset/
├── images/
│   ├── train/
│   └── val/
├── labels/
│   ├── train/
│   └── val/
└── data.yaml

COCO Format:

dataset/
├── images/
├── train.json
└── val.json

Stage 3: Model Training (WildTrain)

Data Loading

# DataLoader creates batches
for batch in dataloader:
    images, targets = batch
    # images: tensor [B, 3, H, W]
    # targets: list of dicts with 'boxes', 'labels'

Training Loop

flowchart LR
    A[Load Batch] --> B[Forward Pass]
    B --> C[Calculate Loss]
    C --> D[Backpropagation]
    D --> E[Update Weights]
    E --> F{Epoch End?}
    F -->|No| A
    F -->|Yes| G[Validation]
    G --> H[Log Metrics]
    H --> I{Training Done?}
    I -->|No| A
    I -->|Yes| J[Save Model]

Model Versioning

# Training produces:
1. Model weights: model.pt
2. Training metrics: logged to MLflow
3. Model artifacts: configs, preprocessing params
4. Model metadata: framework, version, dataset

# Registered to MLflow:
models:/detector_name/version

Data Flow in Training

# Epoch 1:
train_images  model  predictions  loss  optimizer  updated_model
val_images  updated_model  predictions  metrics  log

# Epoch 2:
train_images  updated_model  predictions  loss  optimizer  updated_model
val_images  updated_model  predictions  metrics  log

# ...

# Epoch N:
train_images  final_model  predictions  loss  optimizer  best_model
val_images  best_model  predictions  metrics  save

Stage 4: Deployment (WildDetect)

Model Loading

# Load from MLflow
model = mlflow.pytorch.load_model("models:/detector/production")

# Or from file
model = torch.load("detector.pt")

# Model ready for inference

Detection Pipeline

flowchart TB
    A[Input Image] --> B{Large Raster?}
    B -->|Yes| C[Tile Image]
    B -->|No| D[Preprocess]
    C --> E[Process Tiles]
    E --> F[Detect on Each Tile]
    F --> G[Stitch Results]
    G --> H[Apply NMS]
    D --> I[Detect]
    I --> H
    H --> J[Format Results]
    J --> K[Output Detections]

Detection Output Format

{
  "image_path": "drone_001.jpg",
  "image_size": [1920, 1080],
  "processing_time": 0.5,
  "detections": [
    {
      "class_name": "elephant",
      "confidence": 0.95,
      "bbox": [100, 200, 150, 180],
      "bbox_normalized": [0.052, 0.185, 0.078, 0.167]
    },
    {
      "class_name": "giraffe",
      "confidence": 0.89,
      "bbox": [500, 300, 120, 200]
    }
  ],
  "metadata": {
    "model_name": "detector_v1",
    "model_version": "3",
    "timestamp": "2024-01-01T12:00:00"
  }
}

Batch Processing

# Process multiple images
images = ["img1.jpg", "img2.jpg", "img3.jpg"]

# Parallel detection
results = []
for image in images:
    result = pipeline.detect(image)
    results.append(result)

# Save all results
save_results(results, "batch_results.json")

Stage 5: Analysis

Census Statistics

Aggregate detections across campaign:

# Input: All detection results
total_detections = 1523 animals

# Statistics by species:
{
  "elephant": {
    "count": 423,
    "percentage": 27.8,
    "avg_confidence": 0.93
  },
  "giraffe": {
    "count": 612,
    "percentage": 40.2,
    "avg_confidence": 0.89
  },
  "zebra": {
    "count": 488,
    "percentage": 32.0,
    "avg_confidence": 0.91
  }
}

# Density analysis:
survey_area = 25 km²
density = {
  "elephant": 16.9 per km²,
  "giraffe": 24.5 per km²,
  "zebra": 19.5 per km²
}

Geographic Analysis

# Extract GPS coordinates from images
image_locations = [
  (lat1, lon1),  # Image 1 location
  (lat2, lon2),  # Image 2 location
  ...
]

# Map detections to geographic space
detection_map = {
  (lat1, lon1): ["elephant", "giraffe"],
  (lat2, lon2): ["zebra", "elephant", "elephant"],
  ...
}

# Analyze distribution
hotspots = identify_hotspots(detection_map)
coverage = calculate_coverage(image_locations)

Visualization Pipeline

flowchart LR
    A[Detections] --> B[Statistics]
    A --> C[Geographic Data]
    B --> D[Charts & Graphs]
    C --> E[Maps]
    D --> F[Report]
    E --> F
    A --> G[FiftyOne]
    G --> H[Interactive Viewer]

Data Storage and Persistence

Directory Structure

project/
├── data/
│   ├── raw/                    # Original images
│   ├── annotations/            # Original annotations
│   └── datasets/               # Processed datasets
│       └── my_dataset/
│           ├── images/
│           │   ├── train/
│           │   └── val/
│           └── annotations/
│               ├── train.json  # Master format
│               └── val.json
├── models/
│   ├── checkpoints/            # Training checkpoints
│   └── trained/                # Final models
├── results/
│   ├── detections/             # Detection outputs
│   ├── census/                 # Census reports
│   └── visualizations/         # Maps, charts
└── mlruns/                     # MLflow tracking data

Data Versioning with DVC

# Track data with DVC
dvc add data/datasets/my_dataset

# Creates .dvc file
data/datasets/my_dataset.dvc

# Push to remote storage
dvc push

# On another machine
dvc pull

Performance Considerations

Data Loading Optimization

# Efficient data loading
dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,      # Parallel loading (threads on Windows)
    pin_memory=True,    # Faster GPU transfer
    prefetch_factor=2   # Prefetch batches
)

Memory Management

# For large images
with rasterio.open(large_image) as src:
    # Process in windows
    for window in tile_windows:
        tile = src.read(window=window)
        process(tile)
        del tile  # Free memory

Caching

# Cache loaded models
@lru_cache(maxsize=1)
def load_model(model_path):
    return torch.load(model_path)

# Cache detection results
results_cache = {}
if image_hash in results_cache:
    return results_cache[image_hash]

Error Handling and Recovery

Validation Checkpoints

flowchart TB
    A[Input Data] --> B{Valid Format?}
    B -->|No| C[Error Report]
    B -->|Yes| D{Images Exist?}
    D -->|No| C
    D -->|Yes| E{Bboxes Valid?}
    E -->|No| C
    E -->|Yes| F[Process Data]

Recovery Mechanisms

# Checkpoint-based recovery
for i, image in enumerate(images):
    try:
        result = detect(image)
        save_checkpoint(i, result)
    except Exception as e:
        logger.error(f"Failed on image {i}: {e}")
        if should_continue:
            continue
        else:
            # Resume from last checkpoint
            resume_from_checkpoint(i)

Integration Points

WilData ↔ WildTrain

# WilData exports dataset
wildata.export_dataset("my_dataset", format="yolo", output="data/yolo")

# WildTrain loads dataset
datamodule = DataModule(data_root="data/yolo")
trainer.fit(model, datamodule)

WildTrain ↔ WildDetect

# WildTrain registers model
mlflow.pytorch.log_model(model, "model")
mlflow.register_model("runs:/.../model", "detector")

# WildDetect loads model
pipeline = DetectionPipeline(mlflow_model_name="detector")

WildDetect ↔ FiftyOne

# WildDetect creates FiftyOne dataset
fo_dataset = create_fiftyone_dataset(detections)

# Launch viewer
session = fo.launch_app(fo_dataset)

Example: Complete Workflow

# 1. Import annotations (WilData)
from wildata import DataPipeline

pipeline = DataPipeline("data")
pipeline.import_dataset(
    source_path="annotations.json",
    source_format="coco",
    dataset_name="training_data",
    transformations={"enable_tiling": True}
)

# 2. Train model (WildTrain)
from wildtrain import Trainer

trainer = Trainer.from_config("configs/yolo.yaml")
model = trainer.train()
model_uri = trainer.register_model("wildlife_detector")

# 3. Run detection (WildDetect)
from wildetect import DetectionPipeline

detector = DetectionPipeline(mlflow_model_uri=model_uri)
results = detector.detect_batch("survey_images/")

# 4. Analyze results (WildDetect)
from wildetect import CensusEngine

census = CensusEngine.from_detections(results)
census.generate_report("census_report.pdf")

Next Steps