Skip to main content

WilData Scripts Reference

This page documents all batch scripts available in the WilData package for data management operations.

Overview

All scripts are located in wildata/scripts/ directory.

Quick Reference

ScriptPurposeConfig File
import-dataset-example.batImport single datasetconfigs/import-config-example.yaml
bulk-import-dataset.batBulk import datasetsconfigs/bulk-import-*.yaml
create-roi-dataset.batCreate ROI datasetconfigs/roi-create-config.yaml
bulk-roi-create-config.batBulk create ROI datasetsconfigs/bulk-roi-create-config.yaml
update-gps-example.batUpdate GPS from CSVconfigs/gps-update-config-example.yaml
visualize_data.batVisualize datasetNone
dvc-setup.batSetup DVCNone
launch_api.batLaunch REST API.env
running_tests.batRun testsNone

Data Import Scripts

import-dataset-example.bat

Purpose: Import a single dataset from COCO, YOLO, or Label Studio format.

Location: wildata/scripts/import-dataset-example.bat

Command:

uv run wildata import-dataset --config configs\import-config-example.yaml

Configuration: wildata/configs/import-config-example.yaml

Key Parameters:

source_path: "path/to/annotations.json"
source_format: "coco" # coco, yolo, ls
dataset_name: "my_dataset"

root: "data"
split_name: "train" # train, val, test

transformations:
enable_bbox_clipping: true
enable_tiling: true
tiling:
tile_size: 800
stride: 640

roi_config:
roi_box_size: 384
random_roi_count: 2

Example Usage:

cd wildata

# Edit config file first
notepad configs\import-config-example.yaml

# Run import
scripts\import-dataset-example.bat

Output:

  • Master format dataset in data/datasets/
  • Processed images (tiled if enabled)
  • ROI dataset (if configured)

bulk-import-dataset.bat

Purpose: Import multiple datasets in batch mode.

Location: wildata/scripts/bulk-import-dataset.bat

Command:

uv run wildata bulk-import-datasets --config configs\bulk-import-config-example.yaml -n 2

Configuration: wildata/configs/bulk-import-train.yaml or bulk-import-val.yaml

Parameters:

  • -n 2: Number of parallel workers (uses threading on Windows)
  • --config: Path to bulk import config

Example Config:

# configs/bulk-import-train.yaml
source_paths:
- "D:/annotations/dataset1.json"
- "D:/annotations/dataset2.json"
- "D:/annotations/dataset3.json"

source_format: "coco"
root: "D:/data"
split_name: "train"

# Shared transformation settings
transformations:
enable_tiling: true
tiling:
tile_size: 800
stride: 640

Example Usage:

cd wildata
scripts\bulk-import-dataset.bat

Features:

  • Parallel processing (thread-based)
  • Progress tracking
  • Error handling per dataset
  • Summary report

ROI Dataset Scripts

create-roi-dataset.bat

Purpose: Create Region of Interest (ROI) classification dataset from detection annotations.

Location: wildata/scripts/create-roi-dataset.bat

Command:

uv run wildata create-roi-dataset --config configs\roi-create-config.yaml

Configuration: wildata/configs/roi-create-config.yaml

Key Parameters:

source_path: "annotations.json"
source_format: "coco"
dataset_name: "roi_dataset"

roi_config:
roi_box_size: 128 # Size of extracted ROI
min_roi_size: 32 # Min object size to extract
random_roi_count: 10 # Background samples per image
background_class: "background"
save_format: "jpg"
quality: 95

Use Cases:

  • Hard sample mining
  • Error analysis
  • Training classification models
  • Creating balanced datasets

Example Usage:

cd wildata
scripts\create-roi-dataset.bat

Output:

  • ROI image crops
  • Classification labels
  • Class mapping JSON
  • Statistics file

bulk-roi-create.bat

Purpose: Create multiple ROI datasets in batch.

Location: Script not shown, but referenced in configs

Configuration: wildata/configs/bulk-roi-create-config.yaml

Example Config:

source_paths:
- "dataset1.json"
- "dataset2.json"

source_format: "coco"
split_name: "val"

roi_config:
roi_box_size: 128
random_roi_count: 5

GPS Management Scripts

update-gps-example.bat

Purpose: Update image EXIF GPS data from CSV file.

Location: wildata/scripts/update-gps-example.bat

Command:

uv run wildata update-gps-from-csv --config configs\gps-update-config-example.yaml

Configuration: wildata/configs/gps-update-config-example.yaml

Key Parameters:

image_folder: "path/to/images"
csv_path: "gps_coordinates.csv"
output_dir: "output/images"

skip_rows: 0
filename_col: "filename"
lat_col: "latitude"
lon_col: "longitude"
alt_col: "altitude"

CSV Format:

filename,latitude,longitude,altitude
image1.jpg,40.7128,-74.0060,10.5
image2.jpg,40.7589,-73.9851,15.2

Example Usage:

cd wildata

# Prepare CSV with GPS data
# Edit config
notepad configs\gps-update-config-example.yaml

# Run update
scripts\update-gps-example.bat

Output:

  • Images with updated EXIF GPS
  • Summary report
  • Error log (if any)

Visualization Scripts

visualize_data.bat

Purpose: Launch FiftyOne visualization for datasets.

Location: wildata/scripts/visualize_data.bat

Command:

uv run wildata visualize-dataset --dataset my_dataset --split train

Example Usage:

cd wildata

# Visualize training set
uv run wildata visualize-dataset --dataset my_dataset --split train

# Or use script
scripts\visualize_data.bat

Features:

  • Interactive dataset viewer
  • Annotation visualization
  • Filtering and search
  • Statistics display

DVC Scripts

dvc-setup.bat

Purpose: Initialize and configure DVC for data versioning.

Location: wildata/scripts/dvc-setup.bat

Command:

# Initialize DVC
dvc init

# Add remote storage
dvc remote add -d myremote <storage_path>

Storage Options:

=== "Local Storage"

dvc remote add -d local D:\dvc-storage

=== "AWS S3"

dvc remote add -d s3remote s3://bucket/path
dvc remote modify s3remote access_key_id YOUR_KEY
dvc remote modify s3remote secret_access_key YOUR_SECRET

=== "Google Cloud"

dvc remote add -d gcs gs://bucket/path
set GOOGLE_APPLICATION_CREDENTIALS=path\to\credentials.json

Example Usage:

cd wildata
scripts\dvc-setup.bat

# Track data
dvc add data\datasets\my_dataset

# Commit DVC file
git add data\datasets\my_dataset.dvc
git commit -m "Add dataset"

# Push to remote
dvc push

DVC Workflow:

# On another machine
git pull
dvc pull # Downloads data

API Scripts

launch_api.bat

Purpose: Launch WilData REST API server.

Location: wildata/scripts/launch_api.bat

Command:

uv run python -m wildata.api.main

Default Port: 8441

Example Usage:

cd wildata
scripts\launch_api.bat

Access:

  • API: http://localhost:8441
  • Docs: http://localhost:8441/docs
  • Redoc: http://localhost:8441/redoc

API Endpoints:

Import Dataset

POST /api/v1/datasets/import
Content-Type: application/json

{
"source_path": "/path/to/data.json",
"source_format": "coco",
"dataset_name": "my_dataset",
"root": "data"
}

List Datasets

GET /api/v1/datasets?root=data

Create ROI Dataset

POST /api/v1/roi/create
Content-Type: application/json

{
"source_path": "/path/to/data.json",
"source_format": "coco",
"dataset_name": "roi_dataset",
"roi_config": {
"roi_box_size": 128,
"random_roi_count": 10
}
}

Job Status

GET /api/v1/jobs/{job_id}

Environment Variables:

# In .env
WILDATA_API_HOST=0.0.0.0
WILDATA_API_PORT=8441
WILDATA_API_DEBUG=false

Testing Scripts

running_tests.bat

Purpose: Run WilData test suite.

Location: wildata/scripts/running_tests.bat

Command:

uv run pytest tests/ -v

Example Usage:

cd wildata
scripts\running_tests.bat

Test Categories:

  • Format adapter tests
  • Transformation tests
  • Validation tests
  • API tests
  • Integration tests

Run Specific Tests:

# Test imports
uv run pytest tests/test_coco_import.py -v

# Test transformations
uv run pytest tests/test_transformations.py -v

# Test API
uv run pytest tests/api/ -v

# With coverage
uv run pytest --cov=wildata tests/

Common Workflows

Dataset Preparation Workflow

# 1. Import dataset
cd wildata
scripts\import-dataset-example.bat

# 2. Visualize
scripts\visualize_data.bat

# 3. Export for training
uv run wildata dataset export my_dataset --format yolo

ROI Extraction Workflow

# 1. Import detection dataset
scripts\import-dataset-example.bat

# 2. Create ROI dataset
scripts\create-roi-dataset.bat

# 3. Visualize ROI dataset
uv run wildata visualize-dataset --dataset roi_dataset

GPS Management Workflow

# 1. Extract GPS from images
# (using WildDetect extract_gps.bat)

# 2. Update GPS if needed
cd wildata
scripts\update-gps-example.bat

# 3. Verify GPS data
# Check EXIF data in images

DVC Workflow

# Setup (once)
cd wildata
scripts\dvc-setup.bat

# After each dataset import
dvc add data\datasets\new_dataset
git add data\datasets\new_dataset.dvc
git commit -m "Add new dataset"
dvc push

# On other machines
git pull
dvc pull

Configuration Examples

Complete Import Config

# configs/import-config-example.yaml
source_path: "D:/annotations/dataset.json"
source_format: "coco"
dataset_name: "wildlife_train"

root: "D:/data"
split_name: "train"
processing_mode: "batch"

# Label Studio integration
ls_xml_config: "configs/label_studio_config.xml"
ls_parse_config: false

# ROI extraction
disable_roi: false
roi_config:
random_roi_count: 2
roi_box_size: 384
min_roi_size: 32
background_class: "background"
sample_background: true

# Transformations
transformations:
enable_bbox_clipping: true
bbox_clipping:
tolerance: 5
skip_invalid: false

enable_tiling: true
tiling:
tile_size: 800
stride: 640
min_visibility: 0.7
max_negative_tiles_in_negative_image: 2
dark_threshold: 0.7

enable_augmentation: false
augmentation:
rotation_range: [-45, 45]
probability: 1.0
num_transforms: 2

Troubleshooting

Import Fails

Issue: Dataset import fails with validation errors

Solutions:

  1. Check source file format is correct
  2. Verify all image paths are valid
  3. Check bbox coordinates are within image bounds
  4. Use --verbose flag for detailed errors

DVC Push Fails

Issue: Can't push data to remote

Solutions:

  1. Verify remote credentials
  2. Check network connection
  3. Verify remote storage path exists
  4. Use dvc remote list to check configuration

API Won't Start

Issue: API server fails to start

Solutions:

  1. Check port 8441 is not in use
  2. Verify .env file configuration
  3. Check all dependencies installed
  4. Look at error logs

Out of Memory

Issue: Import fails with memory error

Solutions:

  1. Use processing_mode: "streaming"
  2. Reduce number of parallel workers
  3. Process datasets one at a time
  4. Disable transformations temporarily

Next Steps