Home Services Books Blog Projects Resources About Contact

Computer Vision Research & Engineering

Real-World Vision
Systems That Deploy.

Lightweight deep learning for semantic segmentation, object detection, and edge AI across intelligent transportation, UAV analytics, medical imaging, industrial inspection, and aviation safety.

Start a Project Discussion View Selected Projects

11+

Years Research Experience

29+

Research Contributions

15+

Vision Models Developed

Multi

Application Domains

Inference Demos

Model Inference in Action.

Selected Projects

Published Architectures.

Research-driven lightweight architectures designed for deployment-aware computer vision systems where accuracy, efficiency, and real-world constraints matter.

View All Selected Projects

Applications

Vision AI Across
Real-World Domains.

Some of the application domains where deployment-aware computer vision systems can support real-world analysis, monitoring, and automation.

Computer Vision Research & Engineering

Lightweight Vision Systems
From Architecture Design to Inference.

Semantic Segmentation

Lightweight encoder-decoder systems for pixel-level analysis across medical imaging, infrastructure monitoring, aerial analytics, and vision-driven applications.

LiteFusionNet DFF-UNet

Object Detection

Efficient detection systems for small objects, visual monitoring, and constrained computer vision applications.

VDXNet LiteFODNet

Lightweight Architecture

Parameter-efficient architectures balancing accuracy, latency, computational cost, and memory constraints for efficient vision systems.

Parameter Efficiency Low GFLOPs

Inference Optimization

TensorRT optimization, ONNX deployment workflows, FP32/FP16 benchmarking, and hardware-aware performance evaluation for embedded vision systems.

TensorRT Jetson

Blog

Technical Articles
Insights, Engineering, & Experiments.

Debugging & Failure Analysis

Why Your Segmentation Model Predicts Only Background (And How to Fix It)

Root cause analysis of foreground collapse under cross-entropy on imbalanced datasets. CVC-ClinicDB experiment over 100 epochs.

Practical Guide

How to Train an Object Detection Model on a Custom Dataset (What Tutorials Don't Tell You)

The decisions tutorials skip, dataset format, model scale, evaluation methodology. YOLOv12 on VisDrone.

Architecture

U-Net Explained Clearly (With a Practical Training Example)

Encoder, skip connections, and decoder from first principles. CVC-ClinicDB training results across 100 epochs.

Debugging & Failure Analysis

Common Mistakes in Object Detection Training That Kill Performance

Wrong annotations, bad augmentation, poor evaluation, and overfitting, each with symptom patterns and fixes.

Read All Articles

Books

Beginner to Advanced.
Two Tracks.

Semantic Segmentation Track

A Step-by-Step Guide for BeginnersBeginner

Training Models on Real-World DataIntermediate

Advanced Architecture Design and ExperimentsAdvanced

Object Detection Track

A Step-by-Step Guide for BeginnersBeginner

Training Models on Real-World DatasetsIntermediate

Advanced Architecture Design and ExperimentsAdvanced

Browse Books

Discuss a Computer Vision Project

Whether you are improving a lightweight architecture, optimizing deployment performance, designing a segmentation system, or exploring a computer vision solution. Occulins provides focused technical consultation and development support.

Schedule Consultation Start a Discussion

Services

Computer Vision Research & Engineering.

Lightweight architecture development, deployment-focused engineering, and technical consultation for real-world computer vision systems.

Computer Vision Research & Engineering

Core Services.

Semantic Segmentation Systems

Precise Pixel-Level Analysis

Custom segmentation systems for research and applied computer vision tasks requiring accurate region, boundary, or object-level understanding.

Efficient encoder-decoder architectures
Context-aware feature learning
Lightweight segmentation pipelines
Deployment-aware experimentation

Object Detection Systems

Lightweight Detection for Real Environments

Efficient detection systems for visual recognition tasks where accuracy, speed, and computational cost must be balanced.

Small-object and multi-scale detection
YOLO-family customization
Backbone, neck, and head design
Deployment-aware evaluation

Lightweight Model Design

Architecture Under Computational Constraints

Design and refinement of compact deep learning architectures for scenarios where parameter count, GFLOPs, latency, and memory matter.

Parameter-efficient modules
Multi-scale feature fusion
Attention and context modules
Accuracy-efficiency trade-off analysis

Deployment Support

Model Optimization & Inference Acceleration

Support for moving trained vision models toward efficient runtime behavior through export, optimization, and deployment-aware testing.

ONNX export support
TensorRT optimization
Inference acceleration & latency analysis
Runtime benchmarking and deployment review

Technical AI Consultation

Focused Technical Sessions

Technical AI Consultation

Scheduled, topic-focused sessions for researchers, engineers, and organizations working on computer vision systems.

Architecture selection & review
Debugging training issues
Deployment planning & optimization
Evaluation methodology
Experimentation strategy

Basic1-Hour Session

Targeted discussion around a specific challenge, deployment issue, architecture decision, or experimental problem.

Standard2-Hour Session

Extended technical review covering model design, debugging, deployment considerations, or experimental analysis.

Consultation scope and engagement structure are determined based on project requirements.

Start a Project Discussion

Books

Detection & Segmentation.
Structured Learning Tracks.

Choose a structured track for object detection or semantic segmentation from foundations to real-world experiments and architecture design.

Semantic Segmentation Track

Three books covering segmentation foundations, training workflows, debugging, architecture design, and experiment analysis.

Explore Track

Object Detection Track

Three books covering YOLO workflows, dataset preparation, training, evaluation, architecture design, and deployment-oriented experimentation.

Explore Track

Books

Semantic Segmentation
Book Track.

A three-book path covering segmentation foundations, real-world training, debugging, architecture design, and experiment analysis.

★★★★★ Beginner

Practical Semantic Segmentation with Deep Learning

A Step-by-Step Guide for Beginners

By the end of this book, you'll able to:

Understand the fundamentals of semantic segmentation and pixel-wise prediction
Prepare segmentation datasets, masks, and training pipelines correctly
Train and evaluate U-Net models using standard segmentation metrics
Interpret prediction errors and common segmentation failure cases
Build a complete semantic segmentation project from dataset to inference

PDF • 46 pages • Companion Resources Included

Preview

★★★★★ Intermediate

Practical Semantic Segmentation with Deep Learning

Training Models on Real-World Data

By the end of this book, you'll able to:

Analyze real-world segmentation problems before model development
Build reliable training pipelines for domain-specific datasets
Diagnose and resolve common training and evaluation issues
Improve segmentation performance through practical experimentation
Develop segmentation workflows suitable for real-world applications

PDF • 50 pages • Companion Resources Included

Preview

★★★★★ Advanced

Practical Semantic Segmentation with Deep Learning

Advanced Architecture Design and Experiments

By the end of this book, you'll able to:

Analyze modern segmentation architectures and their design principles
Design efficient feature extraction and feature fusion modules
Evaluate architectural improvements through systematic experimentation
Balance accuracy, efficiency, and computational complexity
Develop research-oriented segmentation models for real-world deployment

PDF • 54 pages • Companion Resources Included

Preview

← Back to Books

Books

Object Detection
Book Track.

A three-book path covering YOLO foundations, custom dataset training, evaluation workflows, architecture design, and deployment-oriented experimentation.

★★★★★ Beginner

Practical Object Detection with YOLO

A Step-by-Step Guide for Beginners

By the end of this book, you'll able to:

Understand modern object detection and the YOLO workflow
Prepare datasets and annotations for reliable model training
Train and evaluate YOLO models using standard detection metrics
Interpret prediction errors and improve detection performance
Build a complete object detection project from dataset to inference

PDF • 44 pages • Companion Resources Included

Preview

★★★★★ Intermediate

Practical Object Detection with YOLO

Training Models on Real-World Datasets

By the end of this book, you'll able to:

Analyze real-world detection problems and domain-specific datasets
Build reliable training pipelines for practical detection tasks
Adapt pretrained models to new domains through transfer learning
Improve detection performance for challenging real-world scenarios
Develop end-to-end detection workflows for deployment

PDF • 59 pages • Companion Resources Included

Preview

★★★★★ Advanced

Practical Object Detection with YOLO

Advanced Detection Architecture Design and Experiments

By the end of this book, you'll able to:

Analyze the design principles of modern object detectors
Design lightweight modules for efficient feature extraction and fusion
Evaluate architectural improvements through rigorous experimentation
Balance detection accuracy with computational efficiency
evelop research-oriented detection models for practical deployment

PDF • 76 pages • Companion Resources Included

Preview

← Back to Books

Blog

Technical Articles
Insights, Engineering, & Experiments.

Research-driven articles on semantic segmentation, object detection, architecture design, and deployment-aware engineering.

BCE vs BCE+Dice sensitivity training curve

Debugging & Failure Analysis

Why Your Segmentation Model Predicts Only Background

Root-cause analysis of foreground collapse, class imbalance, loss selection, and prediction diagnostics in segmentation systems.

Read Article

Practical Guide

How to Train an Object Detection Model on a Custom Dataset

Practical decisions behind dataset setup, configuration, evaluation, and common failure points in detection experiments.

Read Article

Architecture Explained

U-Net Explained Clearly (With a Practical Training Example)

Why U-Net is shaped the way it is, how skip connections solve the spatial precision problem, and what training on CVC-ClinicDB actually looks like across 100 epochs.

Read Article

Debugging & Failure Analysis

Common Mistakes in Object Detection Training That Kill Performance

Four training mistakes that are invisible in your loss curves — wrong annotations, harmful augmentation, misleading evaluation, and hidden overfitting with specific symptoms and fixes for each.

Read Article

Selected Projects

Selected Projects.

A curated portfolio of computer vision research and engineering projects built from peer-reviewed work, lightweight deep learning, and deployment-aware vision systems.

IEEE Trans. Instrumentation & MeasurementVol. 74 · 2025Segmentation

DFF-UNet

DFF-UNet: A Lightweight Deep Feature Fusion U-Net Model for Skin Lesion Segmentation

RFRE encoder with adaptive 3×3 and 1×1 convolutions, PDCP bridge with parallel dilated convolutions at rates (1,3,5), and BSFD decoder using bottleneck blocks and depthwise separable convolutions.

mIoU 0.7926Dice 0.88430.190M params0.109 GFLOPs1.30ms

97.55% fewer parameters and 99.20% fewer GFLOPs than U-Net. Best on ISIC2018. Validated on 4 public datasets.

IEEE Xplore → View Full Details →

Advanced Engineering InformaticsVol. 65 · 2025Segmentation

LiteFusionNet

Advancing Road Safety: A Lightweight Feature Fusion Model for Robust Road Crack Segmentation

Lightweight encoder with residual blocks, DGA, and DVM modules; EASPP bridge for multi-scale crack feature extraction; Dual-Level Decoder with 3×3 convolutions and channel attention.

Dice 0.7828mIoU 0.64310.493M params0.397 GFLOPs

Outperforms SegNet (29.44M params) with 98.3% fewer parameters. Best Dice on Crack500, DeepCrack, and RCD.

ScienceDirect → View Full Details →

IEEE Geoscience & Remote Sensing LettersVol. 22 · 2025Detection

VDXNet

VDXNet: A Novel Lightweight Deep Learning Model for Vehicle Detection With Aerial Images

RxDF module with asymmetric depthwise separable convolutions replaces C3K2; LiteFPP with selective pooling at 5×5 and 9×9 replaces SPPF; CRDown replaces Conv at P5.

mAP 96.3% (UCAS-AOD)mAP 98.9% (UAVDT)1.608M params539 FPS

37.72% fewer parameters than YOLO11n while improving mAP by 0.52%. Best across all four aerial datasets.

IEEE Xplore → View Full Details →

Intelligent Data Analysis · SAGE/IOS2025Detection

LiteFODNet

LiteFODNet: A Lightweight Deep Learning Model for Intelligent Detection of Small Objects in Runway Surveillance Data

CMSP with parallel atrous convolutions replaces SPPF; SCR replaces Conv at P5; FFM for channel-wise calibration; SPA with decoupled height/width attention for small object localization.

mAP@50-95 0.6812.515M params1.3ms

+0.89% mAP vs YOLOv8n with −16.4% parameters and −27.8% inference time.

SAGE → View Full Details →

Expert Systems with Applications · ElsevierVol. 267 · 2025Review & Dataset

CigDet · SURRONE

Deep Learning-Based Smoker Classification and Detection: An Overview and Evaluation

Comprehensive survey introducing CigDet — a novel annotated dataset for cigarette localization (open-source, Mendeley Data). Benchmarks YOLOv1 through YOLO11. Proposes the SURRONE drone-based outdoor surveillance framework.

YOLOv9 mAP 83.50%YOLO11 mAP 81.50%11 YOLO variants557 annotated images

First open-source annotated dataset for cigarette localization. First benchmark of all YOLO variants (v1–v11).

ScienceDirect → View Full Details →

Discuss Computer Vision Work

Resources

Technical Resources.

Protected book companion files, blog code utilities, and curated engineering resources for computer vision workflows.

Book companion resources are organized by track. Resources are provided via post-purchase email.

Semantic Segmentation Track

Companion files for the semantic segmentation book series, loss functions, metric helpers, mask utilities, U-Net references, architecture modules, and experiment support scripts.

Book Resources

View Track Resources

Object Detection Track

Companion files for the object detection book series, dataset YAML templates, annotation checks, prediction helpers, evaluation utilities, ONNX export examples, and configuration references.

Book Resources

View Track Resources

Companion resources accompanying published Occulins articles, including selected implementations, utilities, configurations, and supporting materials for practical experimentation and workflow understanding.

Blog 1 Companion Resources

Supporting resources for segmentation debugging and foreground-collapse analysis, including U-Net implementation, loss functions, metric utilities, visualization helpers, foreground diagnostics, and experiment configuration.

Companion Assets

View Resources Read Article

Blog 2 Companion Resources

Supporting resources for custom YOLO training workflows on VisDrone, including dataset validation, YAML configuration, training utilities, prediction workflows, and evaluation guidance.

Companion Assets

View Resources Read Article

Blog 3 Companion Resources

Supporting resources for the U-Net architecture explanation, including model implementation, DoubleConv block, architecture configuration, skip-connection reference, and selected plotting utilities.

Companion Assets

View Resources Read Article

Blog 4 Companion Resources

Supporting resources for object detection failure analysis, including missing-label checks, YAML split verification, mAP comparison logic, overfitting diagnostics, and evaluation sanity checks.

Companion Assets

View Resources Read Article

Curated engineering resources for computer vision research, deep learning experimentation, edge AI deployment, robotics, UAV systems, and applied AI workflows.

Some links on this page may be affiliate links in the future. Occulins may earn a small commission at no additional cost to you. Recommendations are selected based on technical relevance, practical workflow value, and suitability for computer vision or edge AI experimentation.

Compute & Training Hardware

Resource	Best For	Technical Notes	Status
RTX 4090/5090 24GB	Heavy segmentation and detection training	Higher VRAM capacity supports larger input resolutions, heavier architectures, and larger batch sizes for deep learning experiments.	Coming Soon
RTX 4080/5080 16GB	Mid-to-high range experimentation	Useful for moderate segmentation and detection workloads with lower cost and power requirements than flagship GPUs.	Coming Soon
RTX 4070/5070 16GB	Lightweight to moderate AI research	Suitable for lightweight computer vision experimentation, architecture debugging, and deployment-focused model development.	Coming Soon
NVMe SSD 2TB	Dataset storage and fast experiment loading	Useful when working with image datasets, checkpoints, logs, and repeated training runs.	Coming Soon

Edge AI & Deployment Devices

Resource	Best For	Technical Notes	Status
Jetson Orin Nano	TensorRT FP16 deployment experiments	Useful for lightweight object detection and segmentation inference testing on embedded AI hardware.	Coming Soon
Jetson Orin NX	Higher-performance edge AI deployment	Suitable for heavier real-time computer vision workloads where embedded inference performance matters.	Coming Soon
Raspberry Pi AI Kit	Low-power AI prototyping	Useful for small embedded AI demonstrations and lightweight inference experiments.	Coming Soon

Vision Sensors, Robotics & UAV Hardware

Resource	Best For	Technical Notes	Status
USB / RGB Camera Module	Real-time detection and segmentation demos	Useful for prototyping image acquisition pipelines and live computer vision experiments.	Coming Soon
Thermal Camera Module	Multimodal and environmental monitoring	Relevant for RGB-thermal object detection, fire monitoring, surveillance, and low-light perception workflows.	Coming Soon
AI Robotic Car Kit	Autonomous driving demonstrations	Useful for lane detection, obstacle detection, small-scale perception, and robotics vision experiments.	Coming Soon
UAV / Drone Platform	Aerial analytics and remote sensing experiments	Useful for UAV-based crack detection, aerial object detection, forest monitoring, and environmental inspection workflows.	Coming Soon

Books & Technical References

Resource	Best For	Technical Notes	Status
Deep Learning Reference Book	Foundational understanding	Useful for building stronger intuition around neural networks, optimization, and representation learning.	Coming Soon
Computer Vision Reference Book	Computer vision fundamentals	Useful for understanding classical and modern vision concepts before moving into applied deep learning systems.	Coming Soon
PyTorch Reference Book	Implementation and experimentation	Useful for researchers and engineers implementing models, training loops, debugging, and deployment pipelines.	Coming Soon

Resource Policy

Protected book resources may include selected model files, configuration files, modules, architecture references, and companion assets. Full proprietary repositories, commercial training frameworks, and complete production pipelines are not publicly distributed unless explicitly stated otherwise. Blog companion code may be released separately as open source where applicable.

Recommendations and external resource references are included for educational, research, and workflow guidance purposes. Some links may be affiliate links.

Affiliate Disclosure

About

About Occulins.

Computer vision research and engineering for deployment-aware systems built for real-world constraints.

Occulins is a computer vision research and engineering lab specializing in lightweight deep learning systems for semantic segmentation, object detection, and deployment-aware vision engineering.

Our work focuses on designing efficient computer vision architectures that balance accuracy, inference speed, parameter efficiency, and deployment constraints across medical imaging, aerial analytics, intelligent transportation, infrastructure monitoring, industrial inspection, aviation safety, and edge AI systems.

Built on years of research experience in deep learning and computer vision, Occulins focuses on translating research into practical engineering workflows. Every architecture is developed with practical constraints in mind, including latency, memory usage, computational efficiency, embedded inference, and real-world deployment requirements.

Occulins is evolving toward a full computer vision research and engineering lab focused on intelligent vision systems, scalable AI infrastructure, and practical computer vision technologies for industry and research-driven applications.

We collaborate on computer vision problems requiring both research depth and practical deployment awareness across constrained and real-world environments.

Lightweight Architecture Design

Parameter-efficient architectures designed for balanced accuracy, computational efficiency, reduced memory footprint, and deployment-aware inference.

Semantic Segmentation

Encoder-decoder systems for pixel-level scene understanding across medical, infrastructure, aerial, and industrial imaging domains.

Object Detection

Real-time detection systems for visual monitoring, small object analysis, dense scenes, and deployment-constrained computer vision applications.

Inference Optimization & Deployment

Model optimization, deployment workflows, TensorRT acceleration, ONNX export, FP32/FP16 benchmarking, and hardware-aware performance evaluation.

Computer Vision Research & Engineering

Computer vision research, architecture design, and system development for domain-specific applications across research and industry.

Contact

Start a Project Discussion.

Focused technical consultation and development support for applied computer vision systems.

Get in Touch

CigDet datasetYOLO benchmarkingDrone-based frameworkEnvironmental monitoring

View Paper Back to Projects

Architecture Diagram

Application Domain

Infrastructure Monitoring.

Automated visual analysis of civil infrastructure using lightweight computer vision, enabling detection and segmentation of surface conditions, structural elements, and anomalies at scale.

Infrastructure AI

Computer vision for infrastructure monitoring.

Infrastructure assets degrade continuously through weather, load cycles, and material fatigue. Lightweight vision models enable automated detection of road cracks, bridge spalling, corrosion, and structural deformation from standard camera and drone imagery providing consistent coverage across large asset portfolios where manual inspection is impractical at scale. Models are designed to operate across varied surface types and imaging conditions, supporting deployment in both fixed-camera monitoring systems and UAV-based inspection workflows.

Surface analysisStructural inspectionDefect localizationAutomated monitoring

Infrastructure Monitoring segmentation output

Object Detection

Detection and localization of surface defects, damage types, and structural anomalies for infrastructure health assessment and maintenance prioritization.

Semantic Segmentation

Pixel-level segmentation of road surfaces, bridge decks, and structural facades, enabling precise crack boundary delineation, spalling localization, and defect extent quantification for maintenance prioritization.

Benefits

Why Computer Vision Matters
in Infrastructure Monitoring.

01 — Detect defects before they become failures

Automated crack and spalling detection identifies surface deterioration weeks or months before it reaches the threshold requiring emergency repair reducing the likelihood of costly structural failures and unplanned road closures.

02 — Cut field survey time significantly

Image-based inspection of roads, bridges, and facades processes in seconds what takes field teams hours on-site. Coverage that previously required days of manual survey can be completed from drone or vehicle-mounted camera footage in a fraction of the time.

03 — Scale across entire road and asset networks

Lightweight models handle thousands of images per day without proportional increases in cost or staffing, making city-scale crack mapping and bridge monitoring operationally feasible for the first time.

04 — Reach assets that are unsafe to inspect manually

Elevated structures, active carriageways, and confined spaces that carry significant risk for inspection personnel can be assessed remotely using UAV-deployed vision systems with no access constraints.

← Back to Home

Application Domain

Medical Imaging.

Lightweight AI for clinical image analysis, enabling automated detection and segmentation of regions of interest across a range of medical imaging modalities.

Medical AI

Computer vision for medical imaging.

Medical image analysis requires high sensitivity to subtle boundaries and strong generalization across patient populations and imaging conditions. Lightweight architectures designed for clinical deployment provide lesion detection, organ segmentation, and pathology localization across dermatology, histology, and radiology imaging, operating without GPU-heavy infrastructure. Models are built to generalise across patient populations and imaging protocols, supporting integration into clinical review workflows without requiring specialized hardware.

Clinical image analysisRegion of interest detectionDiagnostic supportMulti-modality analysis

Object Detection

Detection and localization of cellular structures, clinical regions of interest, and pathological markers within medical image analysis workflows.

Semantic Segmentation

Pixel-level delineation of lesion boundaries, organ contours, and pathological regions, supporting precise area measurement, clinical grading, and multi-dataset generalization across imaging modalities.

Benefits

Why Computer Vision Matters
in Medical Imaging.

01 — Reduce variability in clinical image assessment

Automated segmentation and detection provide a consistent baseline across clinicians, imaging sessions, and patient cohorts, reducing the inter-observer disagreement that affects manual region-of-interest identification in dermoscopy, pathology, and radiology.

02 — Quantify lesion boundaries with pixel precision

Segmentation models delineate lesion margins at the pixel level, enabling objective area measurement and boundary characterization that supports clinical grading, treatment response tracking, and longitudinal comparison across patient visits.

03 — Deploy without specialist infrastructure

Sub-megaparam architectures designed for efficiency run on standard clinical workstation hardware without dedicated GPU infrastructure, making AI-assisted image analysis accessible outside large academic medical centers.

← Back to Home

Application Domain

UAV & Aerial Analytics.

Real-time detection and segmentation for UAV platforms, satellite imagery, and aerial imaging systems, enabling automated analysis of scenes, objects, and land cover from elevated viewpoints.

Aerial Vision

Computer vision for aerial analytics.

Aerial imaging introduces unique challenges, small object size, altitude-dependent scale variation, dense scene clutter, and the need for onboard real-time inference under tight power and compute constraints. Lightweight architectures address vehicle detection in satellite imagery, foreign object localization on airport runways, and dense scene segmentation from drone platforms, designed for edge-deployable aerial perception across a range of operational altitudes and imaging conditions.

Aerial object detectionScene segmentationRemote sensingSmall object detection

UAV & Aerial Analytics segmentation output

Object Detection

Real-time localization of objects, vehicles, and structures in aerial and satellite imagery, enabling surveillance, monitoring, and remote sensing at scale.

Semantic Segmentation

Dense pixel classification of aerial scenes into land cover categories, structures, and surfaces, supporting change detection, area estimation, and environment mapping from UAV and satellite platforms.

Benefits

Why Computer Vision Matters
in UAV & Aerial Analytics.

01 — Cover in minutes what ground teams take days to survey

UAV-mounted detection systems survey large areas, transmission corridors, pipelines, coastlines, agricultural land, in a single flight, compressing inspection timelines that would otherwise require days of ground-level access.

02 — Detect objects too small for conventional analysis

Specialised lightweight architectures address the small object detection challenge inherent to aerial imagery, reliably localizing pedestrians, vehicles, and foreign objects that occupy only a handful of pixels at operational altitude.

03 — Process onboard without ground link dependency

Edge-optimized models run inference directly on UAV hardware, eliminating the need to transmit raw video to ground stations for processing, reducing bandwidth requirements and enabling real-time decision-making in the field.

04 — Operate consistently across altitude and conditions

Models trained on varied aerial data generalise across imaging altitude, lighting conditions, and scene density, maintaining reliable detection performance from low-altitude close inspection through to high-altitude wide-area surveillance.

← Back to Home

Application Domain

Intelligent Transportation.

Computer vision for road scene understanding, traffic analysis, and transportation monitoring, supporting safer and smarter infrastructure through automated visual analysis.

Transportation AI

Computer vision for intelligent transportation.

Transportation environments are fast-moving, visually complex, and safety-critical. Automated vision systems detect vehicles, pedestrians, road markings, and hazard conditions from fixed cameras, dashcams, and roadside sensors, providing continuous monitoring without human operators. Lightweight architectures designed for edge deployment enable real-time scene analysis on in-vehicle hardware and roadside compute units, supporting applications from traffic management to road condition assessment.

Road scene analysisTraffic monitoringObject detectionAutonomous systems

Intelligent Transportation segmentation output

Object Detection

Detection of vehicles, road users, and objects of interest in real-time transportation monitoring and road scene analysis applications.

Semantic Segmentation

Pixel-wise labelling of road surfaces, lane markings, vehicles, and pedestrians, providing the dense scene understanding required for road condition assessment and autonomous vehicle perception pipelines.

Benefits

Why Computer Vision Matters
in Intelligent Transportation.

01 — Monitor traffic continuously without human operators

Automated detection and scene analysis runs 24/7 across fixed camera networks, providing continuous coverage of junctions, motorways, and urban corridors without the staffing cost of manual video monitoring.

02 — Respond to incidents faster

Real-time detection of stopped vehicles, pedestrian incursions, and road hazards enables faster alert generation for traffic management centers, reducing the window between incident occurrence and operator response.

03 — Assess road condition at scale

Dense segmentation of road surfaces and markings from vehicle-mounted cameras provides structured condition data across entire road networks, supporting evidence-based maintenance planning without dedicated inspection campaigns.

← Back to Home

Application Domain

AI for Environmental Monitoring.

Computer vision for ecological surveillance, hazard detection, and environmental analysis across outdoor and natural settings.

Environmental AI

Computer vision for environmental monitoring.

Environmental monitoring at scale requires processing large volumes of aerial, satellite, and ground-level imagery to detect hazards, track ecological change, and support emergency response. Vision models trained on outdoor and natural scene data identify fire fronts, smoke plumes, flood boundaries, vegetation loss, and industrial anomalies, providing automated alerts and spatial analysis for agencies operating across geography too large for ground-based survey.

Hazard detectionAnomaly detectionEcological surveillanceEnvironmental analysis

AI for Environmental Monitoring segmentation output

Object Detection

Detection and localization of environmental hazards, anomalies, and objects of interest in outdoor scenes for ecological surveillance and safety monitoring.

Semantic Segmentation

Pixel-level delineation of fire fronts, smoke plumes, flood extent, and vegetation coverage, enabling precise area quantification for hazard mapping, ecological assessment, and environmental change tracking.

Benefits

Why Computer Vision Matters
in AI for Environmental Monitoring.

01 — Detect wildfires earlier from aerial imagery

Automated smoke and fire detection in UAV and satellite imagery identifies fire ignition and spread earlier than ground-based observation, compressing the time available to deploy suppression resources before a fire becomes uncontrollable.

02 — Map hazard extent with spatial precision

Segmentation models delineate fire fronts, flood boundaries, and erosion zones at the pixel level, providing accurate area estimates and spatial maps that support evacuation planning, damage assessment, and resource allocation.

03 — Monitor large environments continuously

Lightweight models applied to satellite and drone imagery enable ongoing surveillance of forests, wetlands, and coastlines at a geographic scale that ground-based observation cannot match, detecting gradual ecological change alongside acute hazard events.

← Back to Home

Legal

Privacy Policy.

Information Collection

Occulins may collect limited information through contact forms, analytics tools, newsletter subscriptions, external integrations, or services used on this website.

Collected information may include names, email addresses, browser information, device information, interaction data, and technical usage information related to website activity.

Use of Information

Collected information may be used to improve website functionality, respond to inquiries, maintain platform security, analyze website performance, process requested resources, support communication related to services, technical content, or provide updates regarding Occulins resources and offerings.

Cookies & Analytics

This website may use cookies, analytics services, or related technologies to understand website usage, improve user experience, monitor website performance, and maintain platform functionality.

Third-Party Services

Third-party services such as analytics providers, payment processors, embedded media platforms, affiliate platforms, or external integrations may use cookies, tracking technologies, or related scripts as part of their own services.

External Links

External websites linked through this website operate under their own policies, terms, and privacy practices. Occulins is not responsible for external platforms, third-party websites, or their policies and services.

Data Protection

Reasonable technical and administrative measures are used to protect collected information, however no online platform, network, or digital system can guarantee absolute security.

Policy Updates

Occulins reserves the right to update, modify, or revise this Privacy Policy at any time without prior notice.

Agreement

By using this website, you acknowledge and agree to this Privacy Policy.

For legal, privacy, or policy-related inquiries, please use the contact page provided on this website.

Disclosure

Affiliate Disclosure.

Affiliate Relationships

Some links, resources, tools, books, hardware references, or recommended products presented on Occulins may be affiliate links. If purchases are made through these links, Occulins may earn a small commission at no additional cost to the user.

Recommendation Philosophy

Resources referenced on this website are selected based on technical relevance, engineering workflow value, deployment considerations, research practicality, or applicability to computer vision, deep learning, edge AI, robotics, UAV systems, or applied AI engineering workflows.

Affiliate relationships do not influence technical opinions, research discussions, engineering evaluations, or educational content presented on this website.

External Products & Services

Occulins does not manufacture, control, or guarantee third-party products, platforms, services, or external resources referenced through affiliate or external links. Users are responsible for evaluating products, compatibility, pricing, availability, and suitability before making purchases or technical decisions.

External Platforms

Third-party websites, marketplaces, payment systems, or affiliate platforms operate under their own policies, terms, and privacy practices. Occulins is not responsible for the content, availability, security, or practices of external services or websites.

Disclosure Updates

Occulins reserves the right to modify, update, or revise this Affiliate Disclosure at any time without prior notice.

Agreement

By using this website, you acknowledge and agree to this Affiliate Disclosure.

For policy-related inquiries, please use the contact page provided on this website.

Legal

Terms of Service.

Website Use

By accessing or using Occulins, you agree to comply with these Terms of Service. If you do not agree with any part of these terms, please do not use this website.

Occulins provides research content, engineering resources, technical articles, digital materials, and computer vision-related information for educational, informational, and professional purposes.

Content & Intellectual Property

All content, visuals, resources, models, documents, books, branding, engineering assets, and materials published on this website remain the intellectual property of Occulins unless otherwise stated.

Protected books, digital products, premium resources, technical documents, architecture designs, configuration files, code modules, and companion materials distributed through Occulins are licensed for authorized personal or organizational use only unless explicitly stated otherwise.

Occulins reserves all rights related to its published materials, engineering resources, research content, technical assets, branding, and proprietary workflows.

Restrictions

Unauthorized reproduction, redistribution, public sharing, reselling, mirroring, commercial redistribution, repackaging, re-uploading, or unauthorized distribution of Occulins content, books, resources, source code, or proprietary engineering materials is strictly prohibited.

Users may not reproduce, redistribute, resell, commercially exploit, or claim ownership of protected content, digital resources, or proprietary materials without permission from Occulins.

Technical & Informational Disclaimer

Information presented on this website, including technical discussions, code references, deployment suggestions, research insights, architecture explanations, benchmark analyses, engineering recommendations, and educational materials, is provided for informational and educational purposes only and does not constitute professional, legal, financial, medical, or other specialized advice.

While reasonable efforts are made to maintain accurate and up-to-date information, Occulins makes no representations or warranties regarding the accuracy, completeness, reliability, suitability, availability, or performance of any content, resources, downloads, tools, code, or materials provided through this website.

Any use of information, downloads, resources, external tools, code snippets, implementation examples, or referenced workflows is undertaken solely at the user's own discretion and risk. Users are responsible for independently evaluating the suitability, safety, and applicability of any information or materials before use in research, development, commercial, or operational environments.

External Services & Links

External links, third-party platforms, payment providers, affiliate resources, embedded services, or recommended products are provided for convenience only. Occulins is not responsible for the content, availability, security, policies, or practices of external services or websites.

Limitation of Liability

Under no circumstances shall Occulins be liable for any direct, indirect, incidental, consequential, technical, financial, or operational damages arising from the use of this website, its content, downloads, resources, or referenced external services.

Policy & Content Updates

Occulins reserves the right to modify, update, remove, or discontinue content, resources, services, features, policies, or website sections at any time without prior notice.

Agreement

By continuing to use this website, you acknowledge and agree to these Terms of Service.

For legal, privacy, or policy-related inquiries, please use the contact page provided on this website.

Companion Resources

Semantic Segmentation
Companion Resources.

Companion resources for the semantic segmentation book track. Book PDFs are delivered through Gumroad after purchase, while companion resources are delivered separately by Occulins via email after purchase verification.

📩 Resource delivery: Companion resources are delivered separately by Occulins via email after purchase verification. Resource delivery is typically completed within 1–2 business days. If resources are not received within the expected timeframe, contact contact@occulins.com with your purchase receipt.

Book 1 Resources

Companion materials for segmentation fundamentals, U-Net architecture references, BCE-Dice loss examples, segmentation metrics, mask overlay tools, training configuration examples, and prediction visualization utilities.

Beginner

Get Book

Book 2 Resources

Companion materials for real-world segmentation training, dataset validation utilities, evaluation scripts, training diagnostics, debugging tools, prediction analysis utilities, and failure-case visualization examples.

Intermediate

Get Book

Book 3 Resources

Advanced companion materials for segmentation architecture design, DFF-UNet architecture references, RFRE, PDCP, and BSFD module descriptions, ablation study templates, experiment configuration examples, evaluation utilities, and published errata.

Advanced

Get Book

Resource Delivery Note

Companion resources are intended exclusively for verified book owners. After purchasing a book through Gumroad, the PDF is delivered by Gumroad, while the related companion resources are delivered separately by Occulins via email after purchase verification. Resource delivery is typically completed within 1–2 business days.

If resources are not received within the expected timeframe, contact us at contact@occulins.com with your purchase receipt.

← Back to Resources

Companion Resources

Object Detection
Companion Resources.

Companion resources for the object detection book track. Book PDFs are delivered through Gumroad after purchase, while companion resources are delivered separately by Occulins via email after purchase verification.

Book 1 Resources

Companion materials for object detection fundamentals, dataset configuration examples, label validation utilities, dataset integrity checking tools, training and validation templates, inference examples, and reproducibility utilities.

Beginner

Get Book

Book 2 Resources

Companion materials for transfer learning, domain adaptation, training configuration examples, evaluation checklists, benchmarking utilities, validation scripts, and real-world deployment workflows.

Intermediate

Get Book

Book 3 Resources

Advanced companion materials for detection architecture design, selected model configuration files, custom module references, ablation study templates, efficiency analysis utilities, deployment examples, benchmarking templates, and published errata.

Advanced

Get Book

Resource Delivery Note

If resources are not received within the expected timeframe, contact us at contact@occulins.com with your purchase receipt.

← Back to Resources

Research Blog

Why Your Segmentation Model
Predicts Only Background

By Dr. Ali Khan | Occulins

You train a segmentation model. Loss decreases. Validation accuracy looks stable. You feel good about where things are heading.

Then you check the actual predictions. Everything is black. No polyps detected. No boundaries. Just empty masks where the objects should be.

Or a more dangerous version of the same problem: the predictions look reasonable after 100 epochs of full training, but if you had stopped at epoch 20, which many practitioners do, especially when time or compute is limited, your model would have been missing 70% of the objects it was supposed to find.

Both of these are symptoms of the same root cause. Understanding it precisely is what allows you to fix it rather than just adjust training until something works.

What Is Actually Happening

In semantic segmentation, the model assigns a class label to every pixel in the image. When the foreground class occupies a small fraction of pixels, the model can achieve high overall accuracy simply by predicting background everywhere. From the perspective of a pixel-wise loss function, this is a perfectly rational solution.

This is what is called a degenerate solution. The model is not broken, it found a local minimum that satisfies the training objective without learning to detect anything useful. The problem is that the training objective did not make foreground detection valuable enough to pull the model out of that minimum.

The Real Cause: Class Imbalance in Your Dataset

Figure 1

$Figure 1 — Foreground pixel fraction distribution across CVC-ClinicDB$

Foreground pixel fraction distribution across CVC-ClinicDB. Mean foreground: 9.2%. Median: 6.8%. 229 images (37%) have foreground below 5%. When foreground occupies this fraction of pixels, a pixel-wise loss function treated background and foreground equally will consistently underweight the foreground signal.

The CVC-ClinicDB colonoscopy dataset has the following pixel distribution across its 612 images:

Mean foreground (polyp):   9.2%
Median foreground:         6.8%
Images below 5% foreground: 229 (37% of dataset)
Images below 10% foreground: 404 (66% of dataset)

At these fractions, a model that predicts background everywhere achieves pixel accuracy of 90 to 93%. That number will appear in your training logs and look completely reasonable. The foreground IoU will be near zero, but if you are only checking overall accuracy, or if your framework reports it prominently, you may not notice until you look at the actual predictions.

Why Cross-Entropy Makes This Worse

Cross-entropy computes the loss at every pixel independently and takes the mean across all pixels. This means the total gradient signal is a weighted average, weighted by pixel count, of the per-pixel gradients. At 9% foreground fraction, background pixels contribute 91% of the gradient and foreground pixels contribute 9%.

The model receives ten times more information about how to classify background than how to classify foreground. It learns background first, fast, and confidently. It learns foreground slowly, noisily, and only after background is fully saturated.

This is not a flaw in cross-entropy. It is exactly how an unweighted average behaves on an imbalanced distribution. The flaw is applying it without modification when the imbalance is this severe.

Cross-entropy does not fail on imbalanced data because it is badly designed. It fails because it treats every pixel equally and equal treatment of an unequal distribution produces biased learning.

The Experiment: Two Identical Models, Two Loss Functions

To make this concrete rather than theoretical, I ran two identical experiments on CVC-ClinicDB polyp segmentation. Same model (U-Net trained from scratch with no pretrained weights), same optimizer, same hyperparameters, same 80/20 train-test split with fixed seed. Only the loss function changed.

Experiment A used binary cross-entropy only. Experiment B used BCE combined with Dice loss.

Figure 2

Validation sensitivity across 100 training epochs for Experiment A (BCE only, red) and Experiment B (BCE + Dice, green) on CVC-ClinicDB. At epoch 2, BCE sensitivity drops to 0.27, the model is predicting almost entirely background. BCE + Dice reaches 0.68 sensitivity at epoch 2 and maintains high sensitivity throughout training.

Early Training — Where the Problem Is Most Visible

The sensitivity metric, the fraction of true polyp pixels the model correctly identifies, tells the most important part of the story:

Epoch	BCE Sensitivity	BCE Specificity	BCE+Dice Sensitivity	BCE+Dice Specificity
1	0.424	0.896	0.606	0.906
2	0.268	0.981	0.684	0.918
5	0.541	0.982	0.795	0.971
10	0.747	0.986	0.889	0.951
20	0.850	0.990	0.899	0.975
30	0.899	0.992	0.913	0.990

Look at epoch 2. BCE sensitivity drops to 0.268 while BCE specificity climbs to 0.981. The model is predicting background with increasing confidence, exactly the degenerate solution described above. BCE + Dice shows sensitivity of 0.684 at the same epoch. It is finding polyps from the very start because the Dice component makes foreground detection non-negotiable for the loss.

By epoch 10, BCE has recovered somewhat to 0.747 sensitivity, but BCE + Dice is already at 0.889. The gap is 14 percentage points at epoch 10, and it does not close until around epoch 30 to 40.

The early stopping danger If you stop training at epoch 20 with BCE only, a common decision when compute is limited or when loss appears to have plateaued, your model has sensitivity of 0.85. It is missing 15% of polyps. With BCE + Dice at the same epoch, sensitivity is 0.899. The difference is not dramatic on paper. In a clinical colonoscopy screening context, it represents real missed pathology.

Final Results at Epoch 100

Configuration	Best Dice	Final IoU	Final Sensitivity	Final Specificity
BCE only	0.9245	0.8508	0.9177	0.9929
BCE + Dice	0.9277	0.8574	0.9220	0.9951

At full convergence after 100 epochs, the two models reach similar performance. BCE + Dice has slightly higher Dice and IoU. BCE only has slightly higher final sensitivity, though it took 30 epochs longer to get there. Both are viable at epoch 100. The question is what happens if you stop earlier, and what happens during the 30 epochs where BCE is still catching up.

Figure 3

Best checkpoint predictions, same model, same data, different loss function. BCE only (red border): Dice=0.924, Sensitivity=0.917. BCE + Dice (green border): Dice=0.928, Sensitivity=0.922. Three representative colonoscopy frames showing both models detect polyps at convergence, with BCE + Dice achieving marginally better scores.

The Fix: Combined BCE and Dice Loss

The single most effective change is replacing cross-entropy alone with a combination of BCE and Dice loss. The code is straightforward:

import torch
import torch.nn as nn

class DiceLoss(nn.Module):
    def __init__(self, eps=1.0):
        super().__init__()
        self.eps = eps

    def forward(self, pred, target):
        # pred is already sigmoid-activated
        num   = 2 * (pred * target).sum() + self.eps
        denom = pred.sum() + target.sum() + self.eps
        return 1 - num / denom

bce_loss  = nn.BCELoss()
dice_loss = DiceLoss()

def criterion(pred, mask):
    return bce_loss(pred, mask) + dice_loss(pred, mask)

BCE provides stable gradients throughout training particularly important in early epochs when predictions are still near random. Dice ensures the foreground class cannot be overwhelmed by the background gradient signal. Together they are consistently more robust than either alone on imbalanced segmentation tasks.

One note on implementation: if your model applies torch.sigmoid internally in its forward method, as many U-Net implementations do, use nn.BCELoss which expects probabilities in [0, 1]. If your model outputs raw logits, use nn.BCEWithLogitsLoss which applies sigmoid internally. Mixing these causes double-sigmoiding which produces near-uniform outputs and very small gradients throughout training.

Before You Change the Loss: Diagnose First

Not every blank prediction is a loss function problem. Before changing your training configuration, verify these three things.

Is your model actually in foreground collapse?

After 20 epochs, compute foreground IoU and sensitivity specifically, not overall pixel accuracy. Foreground IoU below 0.10 combined with overall accuracy above 88% is the signature of foreground underweighting. Overall accuracy above 90% on a dataset with 9% foreground is a warning sign, not a success signal.

Are your masks correct?

Open five random image-mask pairs and look at them directly. Confirm the masks contain the objects you expect, that they are not inverted, and that filenames sort in the same order for images and masks. Mismatched pairs are more common than expected and they corrupt the training signal silently.

Are your mask values what you think they are?

import numpy as np
from PIL import Image
import os

mask_dir = 'CVC-ClinicDB/masks'

for fname in sorted(os.listdir(mask_dir))[:5]:
    mask = np.array(
        Image.open(
            os.path.join(mask_dir, fname)
        ).convert('L')
    )
    fg = (mask > 127).sum() / mask.size
    print(f"{fname}: unique={np.unique(mask)} "
          f"fg_fraction={fg:.4f}")

Binary masks should have values 0 and 255 (before normalisation) or 0.0 and 1.0 (after). If unique values return something unexpected, all zeros, all 255, or a range of intermediate values, fix the mask loading before adjusting anything else.

Monitoring Predictions During Training

Do not wait until epoch 100 to look at predictions. Save a prediction image at fixed intervals during training:

def save_prediction_sample(model, loader, epoch,
                            save_dir, device):
    model.eval()
    os.makedirs(save_dir, exist_ok=True)
    with torch.no_grad():
        images, masks = next(iter(loader))
        images = images.to(device)
        preds  = model(images)
        pred_mask = (preds[0].cpu().squeeze() > 0.5
                     ).float().numpy()
    fig, axes = plt.subplots(1, 3, figsize=(10, 3))
    axes[0].imshow(images[0].cpu().permute(1,2,0))
    axes[0].set_title('Input'); axes[0].axis('off')
    axes[1].imshow(masks[0].squeeze(), cmap='gray')
    axes[1].set_title('Ground Truth'); axes[1].axis('off')
    axes[2].imshow(pred_mask, cmap='gray')
    axes[2].set_title(f'Epoch {epoch}'); axes[2].axis('off')
    plt.tight_layout()
    plt.savefig(f'{save_dir}/epoch_{epoch:03d}.png', dpi=120, bbox_inches='tight')
    plt.close()
    model.train()

Figure 4

BCE + Dice prediction progression on CVC-ClinicDB at epochs 1, 20, 50, and 100. Epoch 1: rough initial detection, partial polyp outline. Epoch 20: recognisable polyp shape with noisy boundaries. Epoch 50: clean boundaries, consistent detection. Epoch 100: precise segmentation matching ground truth closely.

The One Metric That Reveals Foreground Collapse

Sensitivity, the fraction of true foreground pixels correctly identified, is the metric that exposes this failure mode most clearly. Always track foreground sensitivity alongside your primary metrics. For binary medical segmentation especially, it is the number that tells you whether the model is clinically useful or not.

Summary

Symptom	Cause	Fix
All-black predictions	Complete foreground collapse under BCE	Replace with BCE + Dice combined loss
Low sensitivity in early training	BCE gradient dominated by background pixels	BCE + Dice — Dice component protects foreground signal
Good accuracy, poor IoU	Model predicting background accurately	Report foreground IoU and sensitivity, not overall accuracy
Training looks normal, predictions wrong	Wrong activation-loss pairing	Match BCELoss to sigmoid model, BCEWithLogitsLoss to logit model
Cannot tell what is happening	Only tracking aggregate metrics	Visualize predictions every 10 epochs, track sensitivity separately

The pattern in this experiment is consistent across datasets with moderate to severe foreground imbalance. BCE + Dice does not always produce dramatically higher final metrics after full convergence. What it consistently produces is faster, more reliable foreground detection, particularly in the first 30 epochs where BCE is still learning to find the foreground at all.

Selected implementations, supporting utilities, experiment configurations, and companion resources related to this article are available through the Blog 1 Companion Resources page .

Companion Resources Included

U-Net architecture implementation
Dice Loss implementation
BCE + Dice combined loss function
Metric calculation utilities
Prediction visualization utilities
Foreground imbalance inspection utilities
Example experiment configuration

Working on segmentation systems where metrics and predictions do not align?
Reach out through Occulins Contact for deployment-aware computer vision research and engineering support.

Tags: Semantic Segmentation Deep Learning Loss Functions Class Imbalance CVC-ClinicDB U-Net Dice Loss Medical Imaging Polyp Segmentation

Reference implementation of the standard U-Net architecture used during experiments.


class DoubleConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(DoubleConv, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels,
                      kernel_size=3, stride=1,
                      padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels,
                      kernel_size=3, stride=1,
                      padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
        )

    def forward(self, x):
        return self.conv(x)


class UNET(nn.Module):
    def __init__(self, num_classes=1,
                 input_channels=3, **kwargs):
        super().__init__()
        nb_filter = [32, 64, 128, 256, 512]
        self.pool = nn.MaxPool2d(2, 2)

        # Encoder
        self.conv0_0 = DoubleConv(input_channels,
                                   nb_filter[0])
        self.conv1_0 = DoubleConv(nb_filter[0],
                                   nb_filter[1])
        self.conv2_0 = DoubleConv(nb_filter[1],
                                   nb_filter[2])
        self.conv3_0 = DoubleConv(nb_filter[2],
                                   nb_filter[3])
        self.conv4_0 = DoubleConv(nb_filter[3],
                                   nb_filter[4])

        # Bottleneck
        self.bottleneck = DoubleConv(nb_filter[4],
                                      nb_filter[4] * 2)

        # Decoder
        self.upconv4 = nn.ConvTranspose2d(
            nb_filter[4] * 2, nb_filter[4],
            kernel_size=2, stride=2)
        self.conv4_1 = DoubleConv(
            nb_filter[4] * 2, nb_filter[4])

        self.upconv3 = nn.ConvTranspose2d(
            nb_filter[4], nb_filter[3],
            kernel_size=2, stride=2)
        self.conv3_2 = DoubleConv(
            nb_filter[3] * 2, nb_filter[3])

        self.upconv2 = nn.ConvTranspose2d(
            nb_filter[3], nb_filter[2],
            kernel_size=2, stride=2)
        self.conv2_3 = DoubleConv(
            nb_filter[2] * 2, nb_filter[2])

        self.upconv1 = nn.ConvTranspose2d(
            nb_filter[2], nb_filter[1],
            kernel_size=2, stride=2)
        self.conv1_4 = DoubleConv(
            nb_filter[1] * 2, nb_filter[1])

        self.upconv0 = nn.ConvTranspose2d(
            nb_filter[1], nb_filter[0],
            kernel_size=2, stride=2)
        self.conv0_5 = DoubleConv(
            nb_filter[0] * 2, nb_filter[0])

        self.final = nn.Conv2d(
            nb_filter[0], num_classes, kernel_size=1)

    def forward(self, x):
        x0_0 = self.conv0_0(x)
        x1_0 = self.conv1_0(self.pool(x0_0))
        x2_0 = self.conv2_0(self.pool(x1_0))
        x3_0 = self.conv3_0(self.pool(x2_0))
        x4_0 = self.conv4_0(self.pool(x3_0))
        x5_0 = self.bottleneck(self.pool(x4_0))

        x4_1 = self.conv4_1(
            torch.cat([self.upconv4(x5_0), x4_0], dim=1))
        x3_2 = self.conv3_2(
            torch.cat([self.upconv3(x4_1), x3_0], dim=1))
        x2_3 = self.conv2_3(
            torch.cat([self.upconv2(x3_2), x2_0], dim=1))
        x1_4 = self.conv1_4(
            torch.cat([self.upconv1(x2_3), x1_0], dim=1))
        x0_5 = self.conv0_5(
            torch.cat([self.upconv0(x1_4), x0_0], dim=1))

        return torch.sigmoid(self.final(x0_5))

Resources

Loss Functions.


class DiceLoss(nn.Module):

    def __init__(self,eps=1.0):

        super().__init__()

        self.eps=eps

    def forward(self,pred,target):

        num=2*(pred*target).sum()+self.eps

        den=pred.sum()+target.sum()+self.eps

        return 1-num/den


bce_loss = nn.BCELoss()

dice_loss = DiceLoss()

def criterion(pred,mask):

    return bce_loss(pred,mask) + dice_loss(pred,mask)

Resources

Metric Utilities.


iou = tp/(tp+fp+fn+1e-6)

dice = 2*tp/(2*tp+fp+fn+1e-6)

sensitivity = tp/(tp+fn+1e-6)

specificity = tn/(tn+fp+1e-6)

Resources

Foreground Diagnostics.


mask = np.array(mask)

foreground_ratio = (

(mask > 127).sum()

/

mask.size

)

print(foreground_ratio)

Resources

Visualization Utilities.


images,masks = next(iter(loader))

preds = model(images)

save_prediction_sample(

images,

masks,

preds

)

Resources

Experiment Configuration.


epochs: 100

batch_size: 8

optimizer: Adam

learning_rate: 1e-4

image_size: 256

dataset: CVC-ClinicDB

Practical Guide

How to Train an Object Detection Model
on a Custom Dataset

By Dr. Ali Khan | Occulins

Most object detection tutorials follow the same pattern.

Install the framework. Download a pre-prepared dataset. Run the training command. Look at the predictions. Done.

That pattern works perfectly for the tutorial. It almost never works for your actual dataset.

The gap between running a tutorial successfully and training a model on your own data is where most people get stuck, and it is not because they are missing a command or a library. It is because tutorials teach you the steps, not the decisions. And in object detection, the decisions are what determine whether your model learns anything useful.

This post covers those decisions. We will use YOLOv12 trained on the VisDrone dataset as the running example throughout, VisDrone is an aerial drone detection dataset with real challenges that make the decisions matter, which is exactly what we need to learn from.

Why VisDrone and Why YOLOv12

VisDrone is a drone-captured dataset for detecting pedestrians, cars, vans, trucks, bicycles, and other objects in aerial imagery. It has roughly 6,500 training images and 548 validation images, with objects that are small, densely packed, and photographed from varying altitudes.

Figure 1 — Three sample VisDrone validation images with ground truth annotations. Left: sparse parking lot scene with cars and pedestrians at moderate altitude. Centre: dense night-time street scene with motors, tricycles, and pedestrians. Right: high-altitude view with hundreds of densely packed objects across all ten classes.

It is not a beginner dataset in the sense that the problem is easy, it is a beginner dataset in the sense that it is publicly available, well-structured, and reflects the kinds of real detection challenges you will face in any applied project.

YOLOv12 is one of the newest generations in the YOLO family, introducing architectural changes aimed at improving detection performance while maintaining real-time inference capability. It introduces an attention-centric architecture that improves detection accuracy while maintaining real-time inference speed. This blog uses YOLOv12 while our detection book series covers YOLOv11. The workflow and dataset preparation process remain highly similar across both versions, making the core ideas transferable.

For custom training, the nano variant, yolo12n, is the right starting point. It trains fastest, uses the least memory, and gives you a quick feedback loop on whether your configuration and data are set up correctly before committing to a larger model.

What YOLO Actually Needs From Your Dataset

Before writing a single line of training code, your dataset needs to be in the exact format YOLO expects. Getting this wrong produces errors that look like model failures but are actually data failures.

The Directory Structure

YOLO expects images and labels in parallel directories, with training and validation splits clearly separated:

visdrone/
├── images/
│   ├── train/
│   └── val/
└── labels/
    ├── train/
    └── val/

The image filename and its label filename must match exactly. frame_0001.jpg must have a corresponding frame_0001.txt in the labels directory. YOLO will silently skip images with no matching label file, which means you can appear to be training on your full dataset while actually training on a fraction of it.

The Label Format

Each label file contains one line per object in the image. The format is:

class_id  x_center  y_center  width  height

All values except class_id are normalised to the range [0, 1] relative to image dimensions. A bounding box that starts at pixel (100, 50) and has width 200, height 80 in a 640×480 image becomes:

0  0.3125  0.1875  0.3125  0.1667

If your annotations are in COCO JSON format, Pascal VOC XML, or any other format, you need to convert them before training. Do not skip this verification step:

import os

label_dir = 'visdrone/labels/train'
error_count = 0

for fname in os.listdir(label_dir):
    with open(os.path.join(label_dir, fname)) as f:
        for line_num, line in enumerate(f, 1):
            parts = line.strip().split()
            if len(parts) != 5:
                print(f"Bad line in {fname}:{line_num}"
                      f" → {line.strip()}")
                error_count += 1
            else:
                vals = list(map(float, parts[1:]))
                if not all(0.0 <= v <= 1.0
                           for v in vals):
                    print(f"Out-of-range in "
                          f"{fname}:{line_num}")
                    error_count += 1

print(f"Checked. Errors found: {error_count}")

The Dataset YAML File

YOLO reads dataset configuration from a YAML file. For VisDrone:

# visdrone.yaml
path: /path/to/visdrone
train: images/train
val: images/val

nc: 10
names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor

The class indices in your label files must match the order in this YAML exactly. A mismatch here, even by one class, will not produce an error during training. It will produce a model that quietly labels everything wrong.

Detection Book 1 covers the complete annotation workflow in depth, from raw images through to verified, training-ready labels, including common format conversion pitfalls for each major annotation tool. Explore Detection Books

The Four Decisions That Determine Success Before Training Starts

These four decisions collectively matter more than any hyperparameter tuning you do after training begins. Most tutorials present them as fixed values. They are not, each depends on your specific dataset.

Decision 1 — Model Scale

YOLOv12 comes in five scales: nano (n), small (s), medium (m), large (l), and extra-large (x). Start with nano. Not because nano is the best model, but because it gives you the fastest feedback loop. Train for 20 epochs with nano. If the model is learning, mAP increasing, losses decreasing, you have confirmed your dataset and configuration are correct. Then scale up if needed.

Decision 2 — Image Size

The Ultralytics YOLO framework uses 640×640 as its standard default image size, and this is what we use here. It is the right starting point for most datasets, well-tested, memory-efficient, and fast to train. On VisDrone at 640 resolution, large and medium objects are detected reliably. Very small objects at high altitude become challenging, which is a characteristic of the dataset rather than a failure of the image size setting.

Decision 3 — Number of Epochs

The right number of epochs is not a fixed number, it is whenever the validation mAP stops improving. For VisDrone with a nano model at 640 resolution, this typically happens somewhere between 80 and 150 epochs. Use early stopping with the patience parameter. Set patience=20 and let the model stop itself. In our run, the model converged at around 115 epochs.

Decision 4 — Pretrained Weights

Always start from COCO pretrained weights, not random initialization. The pretrained weights give the model basic visual feature detectors from the start, edges, textures, shapes, that would otherwise take tens of epochs to learn from your custom data alone. The performance difference between pretrained and random initialization is typically 5 to 15 mAP points on a custom dataset of this size.

The Training Command and What It Actually Does

from ultralytics import YOLO

model = YOLO('yolo12n.pt')

results = model.train(
    data='visdrone.yaml',
    epochs=150,
    imgsz=640,
    batch=8,
    patience=20,
    device=0,
    project='visdrone_runs',
    name='yolo12n_baseline'
)

Reading the Training Output

Every epoch produces a line that looks like this:

Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
    1/150     5.37G      1.946      2.388      1.032        895        640: 100% ━━━━━━━━━━━━ 809/809 9.9it/s 1:22
              Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100% ━━━━━━━━━━━━ 35/35 17.1it/s 2.0s
                all        548      38759      0.393      0.177      0.123     0.0641

Three numbers tell you whether training is going correctly:

box_loss — how accurately the predicted bounding boxes align with ground truth. Should decrease steadily over the first 30 to 50 epochs.
cls_loss — how accurately the model classifies detected objects. Should also decrease, though typically more slowly than box_loss for datasets with many classes like VisDrone.
dfl_loss — distribution focal loss, related to bounding box precision. Should decrease gradually throughout training.

If any of these three values is not decreasing after 20 epochs, something is wrong. The most common causes are a learning rate that is too high, label errors, or image-label filename mismatches.

Figure 2 — Training curves for YOLOv12n on VisDrone at 640 resolution over 115 epochs. Box loss and class loss decrease steadily with validation closely tracking training, no overfitting. Validation mAP50 reaches 0.32 and mAP50-95 reaches 0.19 at convergence.

In our training run, box_loss dropped from 1.946 at epoch 1 to 1.41 at convergence. Class loss fell from 2.388 to 0.95. The model converged at epoch 115, reaching a validation mAP50 of 0.32 and mAP50-95 of 0.19.

One Mistake Per Stage That Will Kill Your Results

At the Annotation Stage

Mistake: Inconsistent class definitions A "person" annotated standing upright in one image and "person" annotated including a bicycle in another image. The model learns inconsistent boundaries between classes and never converges on clean class separation. Define your classes precisely before annotating, not after.

At the Dataset Preparation Stage

Mistake: Random train/validation split from a video or sequence source VisDrone images come from drone flight sequences. Consecutive frames look almost identical. If you split randomly, the same visual content appears in both training and validation sets, producing inflated validation mAP numbers that collapse completely when you test on genuinely unseen footage. Split at the sequence level.

At the Training Stage

Mistake: Evaluating on the training set The Ultralytics framework saves a best.pt checkpoint based on validation mAP. Some users accidentally point their evaluation script at the training data rather than the validation data. Always verify which split your evaluation is running on before reporting any number.

At the Evaluation Stage

Mistake: Reporting only mAP50 and ignoring mAP50-95 mAP50 measures detection at a single IoU threshold of 0.5. mAP50-95 averages across IoU thresholds from 0.5 to 0.95, a far stricter measure. In our VisDrone run, mAP50 reached 0.32 while mAP50-95 reached only 0.19. Always report both.

What Good Predictions Look Like — And What Failure Looks Like

YOLOv12n predictions on VisDrone validation set

Figure 3 — YOLOv12n predictions on two VisDrone validation images. Left: a successful case, cars detected with high confidence (0.81–0.88), van at 0.91, and a small pedestrian correctly identified at 0.45. Right: a class confusion failure, vehicles are correctly located but misclassified as bus (0.32, 0.76) and truck (0.35).

from ultralytics import YOLO
import glob

model = YOLO(
    'visdrone_runs/yolo12n_baseline/weights/best.pt'
)

val_images = glob.glob(
    'visdrone/images/val/*.jpg')[:10]

for img_path in val_images:
    results = model.predict(
        img_path,
        conf=0.25,
        iou=0.45,
        save=True,
        project='predictions',
        name='val_sample'
    )

When you look at the prediction images, you are checking for three things:

Are the boxes finding the right objects? Correct class labels on the majority of clearly visible objects means the model has learned the classes.
Is class confusion occurring? On VisDrone specifically, visually similar classes, car, van, truck, bus, are commonly confused, especially when objects are small or partially visible.
Are there obvious false positives? Background regions being detected as objects suggests the confidence threshold is too low, or the model has seen insufficient negative examples during training.

A model that produces reasonable-looking predictions on a handful of validation images is not the same as a model that generalises. The mAP on your held-out test set is the only number that tells you whether training succeeded.

Where Most People Stop — And Why That Is the Problem

The steps above will get most people to a working baseline model. For many applications, that is enough.

But a baseline model on VisDrone is not a deployable model. A nano model at 640 resolution reaching mAP50 of 0.32 is a respectable starting point, it confirms your pipeline works and your data is correctly formatted. The harder work is understanding why the remaining 68% of objects are missed or misclassified, and which of those failures are addressable through better training strategy versus which are fundamental limitations of the model scale.

VisDrone has characteristics that a standard training run does not fully address: severe small-object density, large variations in altitude and scale, class imbalance between common classes like cars and rare ones like tricycles, and class confusion between visually similar vehicle types. That is exactly where the baseline ends and the real work begins.

Even after improving accuracy, deployment constraints still matter. Parameter count, latency, inference throughput, memory usage, and hardware limitations ultimately determine whether a detector remains useful outside controlled experimentation.

Continue Beyond the Baseline

This article focuses on building a reliable baseline detection pipeline. The next challenge is improving robustness, handling domain-specific failures, optimizing deployment constraints, and designing stronger experiments for real-world datasets.

📖 The Occulins detection book track expands these topics through structured workflows, deployment-oriented experimentation, and practical case studies.

Explore Detection Books

Selected implementations, supporting utilities, dataset templates, and companion resources related to this article are available through the Blog 2 Companion Resources page .

Companion Resources Included

Dataset directory structure reference
Label format verification utility
Dataset YAML configuration example
Baseline training workflow
Inference and prediction utilities
Evaluation checklist

Quick Reference

Stage	Key Decision	Common Mistake
Dataset preparation	Verify label format and filename matching	Silently missing label files
Split strategy	Split at sequence level, not image level	Random split inflates validation mAP
Model scale	Start with nano, scale up after confirming setup	Training large model on misconfigured data
Image size	Use the 640 default — understand what it gives you	Changing image size before confirming data is correct
Pretrained weights	Always use COCO pretrained initialization	Training from scratch loses 5–15 mAP
Training monitoring	Watch box_loss, cls_loss, dfl_loss per epoch	Waiting until end to check predictions
Evaluation	Report both mAP50 and mAP50-95	mAP50 alone overstates performance by up to 13 points

Working on a detection project and running into challenges beyond what this post covers?
Feel free to reach out through occulins.com/contact

Tags: Object Detection YOLOv12 Custom Dataset VisDrone Ultralytics Deep Learning Aerial Detection mAP

Metrics and sanity checks for validating training success.

Evaluation

View Checklist

Resource Notes

Selected resources are designed to support experimentation and practical understanding through focused implementations, utilities, and companion materials accompanying each article.

Return to Blog 2


dataset/

├── images/

│   ├── train/

│   └── val/

└── labels/

    ├── train/

    └── val/


import os

label_dir='labels/train'

for fname in os.listdir(label_dir):

    with open(

        os.path.join(

            label_dir,

            fname

        )

    ) as f:

        for line in f:

            parts=line.split()

            if len(parts)!=5:

                print(fname)


path: dataset/

train: images/train

val: images/val

nc: 10

names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor


from ultralytics import YOLO

model = YOLO(

'yolo12n.pt'

)

model.train(

data='visdrone.yaml',

epochs=150,

imgsz=640,

batch=8,

patience=20

)


results = model.predict(

img_path,

conf=0.25,

iou=0.45,

save=True

)


Check:

✓ box_loss decreasing

✓ cls_loss decreasing

✓ dfl_loss decreasing

✓ mAP50 improving

✓ mAP50-95 improving

✓ prediction quality

Architecture Explained

U-Net Explained Clearly
With a Practical Training Example

By Dr. Ali Khan | Occulins

U-Net is the most widely used architecture in medical image segmentation. If you have worked in this space for more than a week, you have encountered it.

But most explanations of U-Net either go straight to diagrams without explaining why the architecture is shaped the way it is, or they go so deep into the mathematics that the core idea gets buried.

This post takes a different approach. Before showing you the architecture, it explains the problem that forced the architecture into existence. Once you understand the problem, the design makes complete sense, and you will remember it in a way that a diagram alone cannot achieve.

We will use CVC-ClinicDB polyp segmentation as the practical example throughout, the same dataset and model used in Blog 1 of this series. The training results shown here come from the U-Net trained from scratch with BCE + Dice loss in that post. If you have not read Blog 1, you do not need to, but if your model is predicting only background, that post addresses it directly.

The Problem U-Net Was Designed to Solve

Before U-Net existed, the standard approach to segmentation was to take a classification network, something like VGG or AlexNet, and adapt it to produce pixel-level output instead of a single class label.

This sounds reasonable. Classification networks are good at recognising what is in an image. Surely that knowledge can be extended to recognise what is in each pixel.

The problem is what happens to spatial information inside a classification network.

A classification network progressively reduces the spatial dimensions of its feature maps through pooling and strided convolutions. A 256×256 input passes through five downsampling stages and arrives at the bottleneck as an 8×8 feature map. The network then applies a global pooling operation and produces a single class prediction.

At that 8×8 stage, each position in the feature map corresponds to a 32×32 region of the original image. The model knows that something is present somewhere in that region. It does not know where exactly within the region.

For classification, that is fine. For segmentation, where you need to know the exact boundary of a polyp at pixel level, that spatial uncertainty is fatal.

Downsampling builds understanding. It destroys location. Segmentation needs both. That tension is the problem U-Net solves.

The Encoder: Building Understanding by Destroying Location

Encoder pathway showing spatial resolution decreasing and channels increasing

Figure 1 — The encoder pathway. A 256×256 input passes through five downsampling stages, arriving at the bottleneck as an 8×8 feature map with 1024 channels. Spatial resolution shrinks at each stage while channel depth grows, the network trades location precision for semantic understanding.

The encoder is the left half of U-Net. It applies two convolutional layers at each stage, followed by a max-pooling operation that halves the spatial dimensions before the next stage.

For a 256×256 input image, the five-stage encoder used in our CVC-ClinicDB model produces feature maps at the following resolutions:

Encoder Stage	Feature Map Size	Channels	Receptive Field
Input	256 × 256	3	1 pixel
Stage 1 output	256 × 256	32	~3 × 3 region
Stage 2 output	128 × 128	64	~6 × 6 region
Stage 3 output	64 × 64	128	~12 × 12 region
Stage 4 output	32 × 32	256	~24 × 24 region
Stage 5 output	16 × 16	512	~32 × 32 region
Bottleneck	8 × 8	1024	~64 × 64 region

Notice two things happening simultaneously. The spatial dimensions shrink, from 256×256 to 8×8, while the channel count grows, from 3 to 1024. The network is trading spatial resolution for representational depth. At the bottleneck, each of the 8×8 positions carries a rich 1024-dimensional description of a large region of the original image. It knows what is there. It has lost the fine-grained where.

This is by design, not by accident. The large receptive field at the bottleneck is what allows the network to understand global context, whether the overall image looks like it contains a large central polyp, or scattered small ones, or nothing unusual at all.

The Decoder: Trying to Recover What Was Lost

The decoder is the right half of U-Net. Its job is to take the bottleneck feature map, rich in semantic content but poor in spatial detail and progressively restore spatial resolution until the output matches the original image dimensions.

It does this through transposed convolution operations that reverse the encoder's pooling. At each stage, the feature map is spatially enlarged by a factor of two, and a pair of convolutional layers refines the upsampled features.

But here is the fundamental problem with a decoder operating alone.

Upsampling is not the inverse of downsampling. When max-pooling reduces a region to a single value, it retains the maximum and discards everything else. No upsampling operation can recover what was discarded. The decoder can produce a spatially large output, but that output will be blurry and imprecise at boundaries because the precise boundary information was lost during encoding and cannot be reconstructed from the bottleneck alone.

Analogy Imagine taking a high-resolution photograph, shrinking it to thumbnail size, and then enlarging it back to the original dimensions. The enlarged image has the right overall composition, you can see where the subject is, roughly what shape it has. But the fine detail, sharp edges, precise boundaries, is gone. No enlargement algorithm can invent detail that was not preserved.

This is exactly the problem that forced the skip connection into existence.

Skip Connections: The Actual Innovation

Skip connection diagram showing encoder stages connected directly to matching decoder stages

Figure 2 — Skip connections in U-Net. Each encoder stage passes its feature map directly to the corresponding decoder stage at the same spatial resolution. Encoder 1 (128×128) connects to Decoder 5 (128×128), Encoder 2 (64×64) to Decoder 4 (64×64), and so on. The bottleneck feeds into Decoder 1 (8×8), the first decoder stage. Decoder stages then upsample progressively toward the final output.

The skip connection is U-Net's defining contribution. Rather than requiring the decoder to reconstruct fine spatial detail from the bottleneck alone, it gives the decoder direct access to the encoder's feature maps at each spatial scale, before those maps were downsampled.

At each decoder stage, two sources of information are concatenated:

The upsampled feature map from the previous decoder stage, semantically rich, spatially coarse
The encoder feature map from the corresponding spatial scale, spatially precise, semantically shallow

The convolutional layers that follow the concatenation learn to integrate these two sources. The semantic information from the decoder path tells the network what the region is. The spatial information from the encoder path tells it exactly where the boundary is.

This is why U-Net produces sharp, precise segmentation boundaries when a decoder-only architecture produces blurry ones. The boundary precision does not come from clever upsampling. It comes from having direct access to the original encoder features that contained that precision before it was lost to downsampling.

Why Concatenation and Not Addition

Skip connections in U-Net use concatenation, the encoder and decoder feature maps are stacked along the channel dimension, doubling the channel count before the next convolution. ResNets use addition instead. The choice matters.

Addition requires the two tensors to have the same meaning for the operation to make sense, you are combining them into a single representation. Concatenation preserves both representations independently and lets the following convolution learn how to use each one. For segmentation, where the encoder and decoder features carry fundamentally different types of information, spatial precision vs semantic depth, concatenation is the right choice.

The Complete Architecture

Complete U-Net architecture with encoder, bottleneck, decoder and skip connections

Figure 3 — Complete U-Net architecture as used in the CVC-ClinicDB polyp segmentation experiment. Five encoder stages compress a 256×256 input down to 8×8 at the bottleneck. Five decoder stages restore spatial resolution back to 256×256. Dashed amber arrows show the five skip connections transferring encoder feature maps directly to matching decoder stages. The 1×1 conv with sigmoid at the top of the decoder produces the final binary segmentation mask at full input resolution.

The complete U-Net has a symmetric structure, five encoder stages on the left, a bottleneck at the bottom, five decoder stages on the right, with skip connections bridging each encoder-decoder pair at the same spatial resolution.

The U shape is not an accident of diagram layout. It is a direct consequence of the architecture's function: compress spatial information as you go down, expand it as you go up, and maintain direct connections between the corresponding levels on each side.

📖 Segmentation Book 1 covers the U-Net architecture in full mathematical detail, including the precise equations for each operation, the role of batch normalization, and the design rationale behind the channel progression at each stage. Explore Segmentation Books

Training U-Net on CVC-ClinicDB — What It Actually Looks Like

The model used throughout this post is a U-Net trained from scratch on the CVC-ClinicDB colonoscopy polyp dataset. The channel widths follow the pattern [32, 64, 128, 256, 512] across the five encoder stages, with a 1024-channel bottleneck. The loss function is BCE + Dice combined, which was shown in Blog 1 to produce reliable foreground detection from the first epochs.

class DoubleConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels,
                      kernel_size=3, padding=1,
                      bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels,
                      kernel_size=3, padding=1,
                      bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
        )

    def forward(self, x):
        return self.conv(x)


class UNET(nn.Module):
    def __init__(self, num_classes=1,
                 input_channels=3, **kwargs):
        super().__init__()
        nb_filter = [32, 64, 128, 256, 512]
        self.pool = nn.MaxPool2d(2, 2)

        # Encoder
        self.conv0_0 = DoubleConv(input_channels, nb_filter[0])
        self.conv1_0 = DoubleConv(nb_filter[0],   nb_filter[1])
        self.conv2_0 = DoubleConv(nb_filter[1],   nb_filter[2])
        self.conv3_0 = DoubleConv(nb_filter[2],   nb_filter[3])
        self.conv4_0 = DoubleConv(nb_filter[3],   nb_filter[4])

        # Bottleneck
        self.bottleneck = DoubleConv(nb_filter[4], nb_filter[4] * 2)

        # Decoder
        self.upconv4 = nn.ConvTranspose2d(nb_filter[4] * 2, nb_filter[4], 2, 2)
        self.conv4_1 = DoubleConv(nb_filter[4] * 2, nb_filter[4])

        self.upconv3 = nn.ConvTranspose2d(nb_filter[4], nb_filter[3], 2, 2)
        self.conv3_2 = DoubleConv(nb_filter[3] * 2, nb_filter[3])

        self.upconv2 = nn.ConvTranspose2d(nb_filter[3], nb_filter[2], 2, 2)
        self.conv2_3 = DoubleConv(nb_filter[2] * 2, nb_filter[2])

        self.upconv1 = nn.ConvTranspose2d(nb_filter[2], nb_filter[1], 2, 2)
        self.conv1_4 = DoubleConv(nb_filter[1] * 2, nb_filter[1])

        self.upconv0 = nn.ConvTranspose2d(nb_filter[1], nb_filter[0], 2, 2)
        self.conv0_5 = DoubleConv(nb_filter[0] * 2, nb_filter[0])

        self.final = nn.Conv2d(nb_filter[0], num_classes, kernel_size=1)

    def forward(self, x):
        x0_0 = self.conv0_0(x)
        x1_0 = self.conv1_0(self.pool(x0_0))
        x2_0 = self.conv2_0(self.pool(x1_0))
        x3_0 = self.conv3_0(self.pool(x2_0))
        x4_0 = self.conv4_0(self.pool(x3_0))
        x5_0 = self.bottleneck(self.pool(x4_0))

        x4_1 = self.conv4_1(
            torch.cat([self.upconv4(x5_0), x4_0], dim=1))
        x3_2 = self.conv3_2(
            torch.cat([self.upconv3(x4_1), x3_0], dim=1))
        x2_3 = self.conv2_3(
            torch.cat([self.upconv2(x3_2), x2_0], dim=1))
        x1_4 = self.conv1_4(
            torch.cat([self.upconv1(x2_3), x1_0], dim=1))
        x0_5 = self.conv0_5(
            torch.cat([self.upconv0(x1_4), x0_0], dim=1))

        return torch.sigmoid(self.final(x0_5))


def criterion(pred, mask):
    return bce_loss(pred, mask) + dice_loss(pred, mask)

What Predictions Look Like at Different Training Stages

U-Net prediction progression at epochs 1, 20, 50 and 100

Figure 4 — U-Net predictions on CVC-ClinicDB test images at epochs 1, 20, 50, and 100. Epoch 1: rough initial detection with noisy boundaries and false positives. Epoch 20: recognisable polyp shapes with improved boundaries, Dice 0.827. Epoch 50: clean segmentation, Dice 0.920. Epoch 100: precise boundaries closely matching ground truth, Dice 0.922.

The progression of predictions across epochs reflects what the model is learning in sequence. It learns the global presence of a polyp before it learns its extent, and it learns approximate extent before it learns precise boundaries. The skip connections are what enable the final stage, precise boundaries require the spatial detail that the encoder preserved, and that detail only becomes useful to the decoder after it has learned the semantic context from the bottleneck.

Training Curves

Figure 5 — Validation metrics over 100 training epochs. IoU and Dice converge steadily, reaching best values of 0.8664 and 0.9277 respectively at epoch 80. Sensitivity peaks at 0.9234 at epoch 25. Specificity reaches 0.9959 at epoch 53. All metrics plateau after epoch 60, indicating full convergence.

The training curves confirm the pattern established in Blog 1. BCE + Dice produces reliable sensitivity from the first epochs, the model finds polyps early and refines boundary precision over subsequent epochs. The IoU and Dice curves show steady improvement without collapse, which is the signature of a well-functioning loss function on an imbalanced dataset.

What U-Net Does Not Do Well

Understanding the architecture's limitations is as important as understanding its strengths.

Very small objects. When a polyp occupies only 1 to 2% of the image area, it may be represented by just a handful of pixels at the bottleneck. The global context is dominated by the surrounding healthy tissue. The model has very little signal to work with for objects this small. Higher input resolution and specialized loss functions help, but the fundamental constraint is architectural.

Computational cost at high resolution. The five-stage U-Net with channel widths [32, 64, 128, 256, 512] has approximately 7 million parameters. At 512×512 input, a batch of eight images requires substantial GPU memory. Scaling to higher resolutions quickly hits hardware limits.

Multiscale feature capture at the bottleneck. The standard U-Net bottleneck applies the same double-convolution block used throughout the encoder. It has no dedicated mechanism for capturing contextual information at multiple scales simultaneously, a limitation that more recent architectures address explicitly through modules like ASPP or dilated convolution pyramids.

U-Net's skip connections solve the spatial precision problem elegantly. The limitations that remain, small object detection, computational cost, and multiscale context at the bottleneck, are the problems that drive architecture research beyond the baseline.

📖 Segmentation Book 1 covers all three limitations in detail, with practical strategies for each. The following two books in the series address the architectural solutions that have emerged from research on these specific failure modes. Explore Segmentation Books

Selected implementations, supporting utilities, and companion resources related to this article are available through the Blog 3 Companion Resources page .

Companion Resources Included

U-Net architecture implementation
DoubleConv block reference
U-Net training configuration
Skip connection logic reference
Training curve plotting utility
Architecture figure notes

The One Thing to Remember

U-Net works because it gives the decoder direct access to the spatial detail that the encoder preserved before discarding it through downsampling. The skip connections are not a regularization trick or an optimization convenience. They are the solution to a specific, fundamental problem, the irrecoverable loss of spatial information during encoding.

Every modern segmentation architecture that outperforms U-Net does so by addressing one of the limitations listed above, while keeping the core encoder-decoder-skip-connection structure intact. Understanding why U-Net is designed the way it is makes those improvements immediately comprehensible, because you can see exactly which problem each one is solving.

Training a segmentation model and running into issues this post does not cover?
Feel free to reach out through occulins.com/contact

Tags: U-Net Semantic Segmentation Skip Connections Encoder Decoder CVC-ClinicDB Polyp Segmentation Medical Imaging Deep Learning Architecture

Minimal example showing how encoder features are fused with decoder features at matching spatial resolutions.


decoder_feature = upsample(

    decoder_feature

)

fusion = torch.cat(

    [

        encoder_feature,

        decoder_feature

    ],

    dim=1

)

Resources

Training Curve Plotting.

Utility for visualizing validation metrics during training.


plt.plot(

epochs,

history['val_dice'],

label='Dice'

)

plt.plot(

epochs,

history['val_iou'],

label='IoU'

)

plt.xlabel(

'Epoch'

)

plt.ylabel(

'Metric'

)

plt.legend()

plt.grid(True)

plt.show()

Resources

Figure Notes.

Design philosophy behind the explanatory architecture figures.

Architecture figures were intentionally designed to progressively explain encoder behavior, skip connections, feature hierarchy, and information flow while maintaining visual consistency across diagrams.

Visual complexity was reduced by emphasizing resolution transitions, channel growth, and feature propagation instead of implementation-level detail.

Figures prioritize conceptual understanding and visual clarity rather than framework-specific implementation details.

Return to Blog 3 Resources

Debugging & Failure Analysis

Common Mistakes in Object Detection Training
That Kill Performance

By Dr. Ali Khan | Occulins

Object detection training failures share an uncomfortable characteristic: they are often invisible until you look in the right place.

A model can train for a hundred epochs, produce a respectable mAP number, and still be fundamentally broken, predicting the wrong classes, failing on the objects that matter most, or performing well only because the validation set leaked into training. The loss curves look fine. The metrics look acceptable. The predictions on casual inspection look plausible.

After years of working on detection problems across aerial imagery, medical imaging, and industrial inspection, I have seen the same mistakes appear repeatedly, not because the people making them are careless, but because the mistakes are genuinely subtle and the feedback signals that would reveal them are easy to overlook.

This post covers four of them. Each one has a specific symptom pattern that tells you it is present, and a specific fix that resolves it. None of them require changing your architecture or your hardware. All of them require paying closer attention to things you may currently be skipping.

Wrong or Inconsistent Annotations

This is the most common root cause of poor detection performance, and the one most rarely investigated because it requires looking at data rather than adjusting training parameters.

Annotation errors come in several forms, and they have different effects on training. Understanding which type you have determines what to do about it.

Tight vs Loose Bounding Boxes

Different annotators, or even the same annotator on different days, draw bounding boxes with different tightness. One person draws the box flush to the visible object edges. Another adds a few pixels of margin. A third includes part of the background context.

When a model is trained on this mixed data, it learns an inconsistent definition of where an object ends and background begins. During inference, predicted boxes are evaluated against ground truth boxes using IoU. If your ground truth boxes are inconsistently sized, your IoU measurements are measuring annotator inconsistency as much as model performance.

Figure 1 — Three annotation styles on the same object: tight (flush to visible edges), loose (2–5px margin), and inconsistent (partial background included). A training set mixing all three styles produces a model that learns no consistent boundary definition.

Inconsistent Class Definitions

This is harder to detect and more damaging than box tightness variation. It happens when the same visual object is labelled differently depending on context, annotator, or a class definition that was never written down precisely.

Common examples: a partially occluded car labelled as "car" in one image and ignored in another. A person on a bicycle labelled as "person" in some images and "cyclist" in others when cyclist is not a defined class. A van labelled as "car" by one annotator and "truck" by another.

The symptom in training is a class loss that decreases slowly or plateaus early, and a confusion matrix where two specific classes consistently swap with each other. If you see that pattern, look at the annotations for those two classes directly.

The Silent Problem: Missing Labels

In YOLO format, an image with no objects should have an empty label file. An image with objects but a missing label file is treated as a negative example — the model is taught that the objects in that image are background. This is one of the most damaging annotation errors because it actively teaches the model wrong information, not just inconsistent information.

Verify this before training:

import os

img_dir   = 'dataset/images/train'
label_dir = 'dataset/labels/train'

missing = []
for img_file in os.listdir(img_dir):
    stem       = os.path.splitext(img_file)[0]
    label_file = os.path.join(label_dir, stem + '.txt')
    if not os.path.exists(label_file):
        missing.append(img_file)

print(f"Images with no label file: {len(missing)}")
for f in missing[:10]:
    print(f"  {f}")

If this returns more than zero for a dataset that should have objects in every image, the missing label files are training your model to ignore those objects.

Symptom pattern Class loss plateaus early. Confusion matrix shows two classes swapping frequently. Model misses objects that are clearly visible in validation images. mAP is lower than expected given the apparent quality of your images.

Fix Define class boundaries in writing before annotating, not after. Spot-check 50 random image-label pairs visually before training. Run the missing label verification script above. If you find inconsistent annotations, correct them rather than hoping the model learns through them.

Augmentation That Hurts More Than It Helps

Data augmentation is universally recommended, and for good reason, it is one of the most effective tools for improving generalization on limited datasets. But augmentation is not a dial you turn up for better performance. The wrong augmentation strategy can actively damage your model, and the damage is difficult to diagnose because the training metrics often look fine while it is happening.

Augmentation That Destroys Small Objects

The most common augmentation mistake in detection is applying aggressive random cropping or zooming-out on datasets where objects are small relative to the image. When you randomly crop 40% of a 640×640 image, a pedestrian that was 20 pixels tall may disappear entirely, but its label file still says a pedestrian is present. The model is being trained on images where the labelled object is no longer visible.

For aerial datasets with small objects, the safe augmentation operations are flipping, rotation, and mild color jitter. Heavy cropping, large-scale mosaic augmentation with significant zoom-out, and perspective transforms that shrink objects further are all candidates for removal.

Color Augmentation on Domain-Specific Imagery

Aggressive color jitter is appropriate for natural photography where color balance varies widely between cameras and lighting conditions. It is not appropriate for datasets acquired under controlled conditions, certain medical imaging formats, thermal imagery, or nighttime surveillance footage where color characteristics are fixed by the acquisition protocol.

If your deployment images will always look roughly the same in terms of color and exposure, training on aggressively color-jittered images teaches the model to be robust to variation it will never encounter, while reducing its precision on the actual color characteristics of your domain.

Mosaic Augmentation and Dense Small Objects

YOLO's mosaic augmentation combines four images into one training sample. For most datasets this improves performance by exposing the model to more objects per forward pass. For datasets with very dense small objects, aerial imagery, crowd detection, microscopy, mosaic can produce images where object density exceeds anything in the real deployment distribution. The model learns to expect densities it will never see, which affects both detection thresholds and confidence calibration.

Augmentation is not universally beneficial Every augmentation operation should be evaluated on your specific dataset. The default augmentation settings in any framework were tuned on benchmark datasets that may not resemble yours. Check whether disabling or reducing specific augmentations improves your validation mAP, not just your training stability.

Symptom pattern Model detects objects of normal size but consistently misses small ones. Confidence scores are poorly calibrated, very high or very low, with few in the middle range. Disabling augmentation temporarily and retraining produces higher validation mAP than with augmentation enabled.

Fix Start with minimal augmentation, horizontal flip and mild color jitter only. Add operations one at a time and measure their effect on validation mAP. If an augmentation does not improve validation performance after 20 epochs, remove it. For small-object datasets, be especially conservative with any operation that reduces object size.

📖 Detection Book 2 covers domain-specific augmentation strategy in detail, including which operations help and which hurt for aerial, medical, and industrial detection tasks. Explore Detection Books

Poor Evaluation — The Metric You Report Is Not What You Think

This mistake does not affect how your model trains. It affects whether you know what your model can actually do. Reporting the wrong metric, or computing the right metric on the wrong data, produces numbers that look good while hiding real failures.

mAP50 vs mAP50-95: A Gap That Changes Everything

mAP50 evaluates detection at a single IoU threshold of 0.5. A predicted box that overlaps the ground truth by 50% counts as a correct detection. This is a lenient standard, a box that covers the correct general region but is significantly larger or smaller than the actual object still passes.

mAP50-95 averages detection performance across IoU thresholds from 0.5 to 0.95 in steps of 0.05. At IoU 0.75, a predicted box needs to overlap 75% with the ground truth to count. At IoU 0.95, the box needs to be almost pixel-perfect.

For applications where precise bounding box location matters, measuring object size, feeding detections into a tracking system, or using boxes to guide downstream processing — mAP50-95 is the metric that tells you whether your model is accurate enough. mAP50 can look 15 to 25 points higher than mAP50-95 on the same model.

Figure 2 — mAP50 vs mAP50-95 across training epochs for YOLOv12n on VisDrone. mAP50 reaches 0.325 while mAP50-95 plateaus at 0.185, a gap that reflects the model's ability to find objects in the right location but not localise them precisely. The shaded region between the two curves is what mAP50 alone hides. Always report both metrics.

Evaluating on Training Data

This sounds too basic to mention. It is not. It happens more often than it should, in two forms.

The first form is accidental: the validation data path in the YAML file points to the training directory due to a typo or copy-paste error. The model evaluates on data it has already memorised. Validation mAP is inflated by 10 to 30 points depending on how long the model has trained.

The second form is subtle: the validation split was created by random sampling at the image level from a dataset where multiple images came from the same video sequence or the same scene. Frames from the same scene look nearly identical. A model that has seen other frames from the same scene during training will perform well on the validation frames, not because it generalizes, but because it has effectively memorised the scene.

import yaml

with open('dataset.yaml') as f:
    cfg = yaml.safe_load(f)

print("Train path:", cfg.get('train'))
print("Val path:",   cfg.get('val'))

assert cfg['train'] != cfg['val'], \
    "Train and val paths are identical — check your YAML"

Ignoring Per-Class Performance

Mean Average Precision averages across all classes. A model that achieves 0.72 mAP on a ten-class dataset may be achieving 0.90 on five common classes and 0.50 on five rare ones. If the rare classes are the ones that matter in your application, the aggregate number is hiding a failure.

Always look at per-class AP alongside the mean. In Ultralytics, the per-class results are printed at the end of validation. Read them, not just the summary line.

Symptom pattern Validation mAP looks strong but model fails visually on many predictions. Performance in deployment is significantly worse than validation numbers suggested. Per-class AP shows large variance across classes.

Fix Always report both mAP50 and mAP50-95. Verify your train and val paths are different before every training run. Split datasets with sequential or scene-grouped images at the scene level, not the image level. Read per-class AP after every evaluation.

📖 Detection Book 1 covers evaluation metrics in depth, including when each metric is appropriate, how to interpret per-class results, and what the numbers actually tell you about deployment readiness. Explore Detection Books

Overfitting — And Why It Is Harder to Spot Than You Think

Overfitting in detection is not always the obvious case where training loss goes to zero and validation loss spikes. In practice it is often subtler, a model that performs well on the validation set but fails on genuinely new data from a slightly different source.

The Classic Pattern

The textbook version of overfitting is easy to diagnose from the loss curves: training loss decreases steadily while validation loss plateaus or begins to rise. If you see this pattern, the model is memorising the training set rather than learning generalizable features.

Healthy training vs overfitting loss curves

Figure 3 — Left: healthy training, training and validation loss decrease together and converge. Right: overfitting, training loss continues decreasing while validation loss plateaus and begins rising. The gold dotted line marks the best checkpoint, the model should be saved here, not at the end of training.

The Hidden Pattern: Dataset Overfitting

The more dangerous form of overfitting does not show up in your loss curves at all. The model generalises well to your validation set, but your validation set is not representative of deployment conditions.

This happens when training and validation data come from the same source, same time period, same camera, or same geographic location, while deployment data comes from a different source. The model has learned the specific visual characteristics of your dataset, a particular camera's color profile, the typical lighting conditions of a specific location, the image quality of a specific acquisition protocol, rather than the underlying object appearance.

The only way to detect this is to evaluate on truly independent test data from a different source than your training data. If performance drops significantly on that data, your model has overfit to your dataset distribution.

How to Actually Fix Overfitting

The standard advice, more data, more dropout, stronger augmentation, is correct but incomplete. The more important question is why the model is overfitting, because the answer determines the right fix.

Model is too large for the dataset: switch to a smaller model variant before adding regularization. A nano model on a 500-image dataset overfits less than a large model with dropout.
Dataset lacks diversity: augmentation helps but is not a substitute for genuinely diverse data. If all your training images come from one camera at one location, no augmentation strategy will teach the model to handle a different camera at a different location.
Training too long: use early stopping based on validation mAP. Save the best checkpoint, not the final one. The final epoch is almost never the best model.

Symptom pattern Validation mAP is strong but performance drops noticeably when tested on images from a different source, time period, or camera. Training loss is significantly lower than validation loss. Per-class performance is strong on common classes and weak on rare ones.

Fix Use early stopping. Save the best checkpoint based on validation mAP, not the final epoch. Evaluate on a held-out test set from a different source than your training data. Start with the smallest model that achieves acceptable performance rather than the largest.

📖 Detection Book 2 covers domain shift and dataset diversity in depth, including how to evaluate whether your model will hold up in deployment and what to do when it does not. Explore Detection Books

The Mistake That Ties All of Them Together

Each of the four mistakes above has something in common: they are all invisible if you only look at your final training metrics.

Wrong annotations produce acceptable loss curves. Bad augmentation produces acceptable training stability. Poor evaluation produces numbers that look fine. Overfitting produces good validation metrics right up until you test on new data.

The habit that prevents all four is simple, and almost nobody does it consistently: look at your actual data and your actual predictions at every stage of the pipeline, not just the numbers that summarise them.

Open five random training images and verify the labels visually before training starts. Save prediction images every ten epochs during training and look at them. Read per-class AP after every evaluation. Test on data from a different source before declaring success. These four habits cost fifteen minutes per training run and prevent the majority of detection failures.

Selected implementations, supporting utilities, and companion resources related to this article are available through the Blog 4 Companion Resources page .

Companion Resources Included

Missing label verification utility
Annotation inspection checklist
Train / validation path verification utility
mAP interpretation reference notes
Overfitting diagnostics guide
Pre-deployment evaluation checklist

Quick Reference

Mistake	Symptom	Fix
Wrong annotations	Class loss plateaus, classes swap in confusion matrix	Define classes in writing, spot-check 50 pairs, verify no missing labels
Bad augmentation	Small objects missed, poor confidence calibration	Start minimal, add one operation at a time, measure each addition
Poor evaluation	Strong metrics, weak real-world performance	Report mAP50 and mAP50-95, verify val path, split at scene level
Overfitting	Val loss rises, deployment performance drops	Early stopping, smallest sufficient model, independent test data

These mistakes are not signs of inexperience, they appear in research projects and production systems alike. What separates teams that catch them quickly from those that spend weeks on the wrong problem is the habit of looking at data and predictions directly, rather than trusting that metrics alone will surface the issue.

The metrics will not surface the issue. They will hide it.

Working on a detection project and running into performance problems that this post does not fully resolve?
Feel free to reach out through occulins.com/contact

Tags: Object Detection Training Mistakes Annotations Data Augmentation Overfitting mAP Evaluation Deep Learning YOLOv12

Resources

Blog 4 Companion Resources.

Supporting assets for debugging object detection training failures, dataset verification, evaluation sanity checks, and failure analysis workflows.

Missing Label Verification

Utility for identifying training images without corresponding annotation files.

Dataset Verification

View Code

Annotation Inspection Workflow

Simple workflow for visually validating image-label pairs before training begins.

Quality Control

View Checklist

Train / Validation Verification

Utility for checking dataset splits and preventing train-validation leakage.

Configuration Check

View Code

mAP Interpretation

Reference helper for understanding the difference between mAP50 and mAP50-95.

Evaluation

View Notes

Overfitting Diagnostics

Guidelines for identifying memorization and poor generalization behavior.

Failure Analysis

View Guide

Evaluation Checklist

Pre-deployment sanity checklist before trusting validation metrics.

Checklist

Open Checklist

Resource Notes

Selected resources are designed to support experimentation and practical understanding through focused implementations, utilities, and companion materials accompanying each article.

Return to Blog 4

Resources

Missing Label Verification.

Check whether each training image has a matching YOLO annotation file.

import os

image_dir = "images/train"
label_dir = "labels/train"

images = {
    os.path.splitext(f)[0]
    for f in os.listdir(image_dir)
}

labels = {
    os.path.splitext(f)[0]
    for f in os.listdir(label_dir)
}

missing = images - labels

print(f"Missing labels: {len(missing)}")

for item in sorted(missing):
    print(item)

Resources

Annotation Inspection Workflow.

A practical checklist for visually validating image-label pairs before training.

Verify image-label alignment visually
Inspect bounding box placement
Check class assignments
Check empty labels
Inspect duplicate images
Check class imbalance

Resources

Train / Validation Verification.

Check dataset paths to avoid train-validation leakage.

import yaml

with open("dataset.yaml", "r") as f:
    cfg = yaml.safe_load(f)

print("Train path:", cfg["train"])
print("Validation path:", cfg["val"])

assert cfg["train"] != cfg["val"], \
    "Train and validation paths are identical. Check your YAML file."

Resources

mAP Interpretation.

Quick notes for interpreting mAP50 and mAP50-95 together.

High mAP50 with lower mAP50-95 often indicates localization weakness rather than complete detection failure.

Large gaps suggest predictions may detect objects correctly but produce poorly aligned bounding boxes.

Always inspect qualitative predictions and per-class AP instead of relying on one aggregate metric.

Resources

Overfitting Diagnostics.

Practical signs that a detector is memorizing instead of generalizing.

Training loss decreases while validation loss rises
Large train-validation metric gap
Performance drops on images from unseen sources
Validation metrics become unstable across epochs
Rare classes perform much worse than common classes

Resources

Evaluation Checklist.

Pre-deployment sanity checklist before trusting validation metrics.

Inspect prediction visualizations
Check per-class metrics
Review failure cases
Compare mAP50 and mAP50-95
Inspect false positives and false negatives
Verify train and validation paths
Validate deployment constraints

Real-World VisionSystems That Deploy.

Model Inference in Action.

Published Architectures.

Vision AI AcrossReal-World Domains.

Lightweight Vision Systems From Architecture Design to Inference.

Semantic Segmentation

Object Detection

Lightweight Architecture

Inference Optimization

Technical Articles Insights, Engineering, & Experiments.

Why Your Segmentation Model Predicts Only Background (And How to Fix It)

How to Train an Object Detection Model on a Custom Dataset (What Tutorials Don't Tell You)

U-Net Explained Clearly (With a Practical Training Example)

Common Mistakes in Object Detection Training That Kill Performance

Beginner to Advanced.Two Tracks.

Discuss a Computer Vision Project

Computer Vision Research & Engineering.

Core Services.

Precise Pixel-Level Analysis

Lightweight Detection for Real Environments

Architecture Under Computational Constraints

Model Optimization & Inference Acceleration

Technical AI Consultation

Detection & Segmentation. Structured Learning Tracks.

Semantic Segmentation Track

Object Detection Track

Semantic SegmentationBook Track.

Object DetectionBook Track.

Technical ArticlesInsights, Engineering, & Experiments.

Why Your Segmentation Model Predicts Only Background

How to Train an Object Detection Model on a Custom Dataset

U-Net Explained Clearly (With a Practical Training Example)

Common Mistakes in Object Detection Training That Kill Performance

Selected Projects.

DFF-UNet: A Lightweight Deep Feature Fusion U-Net Model for Skin Lesion Segmentation

Advancing Road Safety: A Lightweight Feature Fusion Model for Robust Road Crack Segmentation

VDXNet: A Novel Lightweight Deep Learning Model for Vehicle Detection With Aerial Images

LiteFODNet: A Lightweight Deep Learning Model for Intelligent Detection of Small Objects in Runway Surveillance Data

Deep Learning-Based Smoker Classification and Detection: An Overview and Evaluation

Technical Resources.

Semantic Segmentation Track

Object Detection Track

Blog 1 Companion Resources

Blog 2 Companion Resources

Blog 3 Companion Resources

Blog 4 Companion Resources

Compute & Training Hardware

Edge AI & Deployment Devices

Vision Sensors, Robotics & UAV Hardware

Books & Technical References

Resource Policy

About Occulins.

Lightweight Architecture Design

Semantic Segmentation

Object Detection

Inference Optimization & Deployment

Computer Vision Research & Engineering

Start a Project Discussion.

Get in Touch

LiteFusionNet.

Architecture Overview

Architecture Diagram

DFF-UNet.

Architecture Overview

Architecture Diagram

VDXNet.

Architecture Overview

Architecture Diagram

LiteFODNet.

Architecture Overview

Architecture Diagram

CigDet / SURRONE.

Architecture Overview

Architecture Diagram

Infrastructure Monitoring.

Object Detection

Semantic Segmentation

Why Computer Vision Mattersin Infrastructure Monitoring.

01 — Detect defects before they become failures

02 — Cut field survey time significantly

Real-World Vision
Systems That Deploy.

Vision AI Across
Real-World Domains.

Lightweight Vision Systems
From Architecture Design to Inference.

Technical Articles
Insights, Engineering, & Experiments.

Beginner to Advanced.
Two Tracks.

Detection & Segmentation.
Structured Learning Tracks.

Semantic Segmentation
Book Track.

Object Detection
Book Track.

Technical Articles
Insights, Engineering, & Experiments.

Why Computer Vision Matters
in Infrastructure Monitoring.

Why Computer Vision Matters
in Medical Imaging.

Why Computer Vision Matters
in UAV & Aerial Analytics.

Why Computer Vision Matters
in Intelligent Transportation.

Why Computer Vision Matters
in AI for Environmental Monitoring.

Semantic Segmentation
Companion Resources.

Object Detection
Companion Resources.

Why Your Segmentation Model
Predicts Only Background