Lightweight deep learning for semantic segmentation, object detection, and edge AI across intelligent transportation, UAV analytics, medical imaging, industrial inspection, and aviation safety.
Research-driven lightweight architectures designed for deployment-aware computer vision systems where accuracy, efficiency, and real-world constraints matter.
Some of the application domains where deployment-aware computer vision systems can support real-world analysis, monitoring, and automation.
Lightweight encoder-decoder systems for pixel-level analysis across medical imaging, infrastructure monitoring, aerial analytics, and vision-driven applications.
Efficient detection systems for small objects, visual monitoring, and constrained computer vision applications.
Parameter-efficient architectures balancing accuracy, latency, computational cost, and memory constraints for efficient vision systems.
TensorRT optimization, ONNX deployment workflows, FP32/FP16 benchmarking, and hardware-aware performance evaluation for embedded vision systems.
Root cause analysis of foreground collapse under cross-entropy on imbalanced datasets. CVC-ClinicDB experiment over 100 epochs.
The decisions tutorials skip, dataset format, model scale, evaluation methodology. YOLOv12 on VisDrone.
Encoder, skip connections, and decoder from first principles. CVC-ClinicDB training results across 100 epochs.
Wrong annotations, bad augmentation, poor evaluation, and overfitting, each with symptom patterns and fixes.
Whether you are improving a lightweight architecture, optimizing deployment performance, designing a segmentation system, or exploring a computer vision solution. Occulins provides focused technical consultation and development support.
Lightweight architecture development, deployment-focused engineering, and technical consultation for real-world computer vision systems.
Custom segmentation systems for research and applied computer vision tasks requiring accurate region, boundary, or object-level understanding.
Efficient detection systems for visual recognition tasks where accuracy, speed, and computational cost must be balanced.
Design and refinement of compact deep learning architectures for scenarios where parameter count, GFLOPs, latency, and memory matter.
Support for moving trained vision models toward efficient runtime behavior through export, optimization, and deployment-aware testing.
Scheduled, topic-focused sessions for researchers, engineers, and organizations working on computer vision systems.
Targeted discussion around a specific challenge, deployment issue, architecture decision, or experimental problem.
Extended technical review covering model design, debugging, deployment considerations, or experimental analysis.
Choose a structured track for object detection or semantic segmentation from foundations to real-world experiments and architecture design.
Three books covering segmentation foundations, training workflows, debugging, architecture design, and experiment analysis.
Explore Track
Three books covering YOLO workflows, dataset preparation, training, evaluation, architecture design, and deployment-oriented experimentation.
Explore TrackA three-book path covering segmentation foundations, real-world training, debugging, architecture design, and experiment analysis.
A three-book path covering YOLO foundations, custom dataset training, evaluation workflows, architecture design, and deployment-oriented experimentation.
Research-driven articles on semantic segmentation, object detection, architecture design, and deployment-aware engineering.
Root-cause analysis of foreground collapse, class imbalance, loss selection, and prediction diagnostics in segmentation systems.
Practical decisions behind dataset setup, configuration, evaluation, and common failure points in detection experiments.
Why U-Net is shaped the way it is, how skip connections solve the spatial precision problem, and what training on CVC-ClinicDB actually looks like across 100 epochs.
Four training mistakes that are invisible in your loss curves — wrong annotations, harmful augmentation, misleading evaluation, and hidden overfitting with specific symptoms and fixes for each.
A curated portfolio of computer vision research and engineering projects built from peer-reviewed work, lightweight deep learning, and deployment-aware vision systems.
RFRE encoder with adaptive 3×3 and 1×1 convolutions, PDCP bridge with parallel dilated convolutions at rates (1,3,5), and BSFD decoder using bottleneck blocks and depthwise separable convolutions.
Lightweight encoder with residual blocks, DGA, and DVM modules; EASPP bridge for multi-scale crack feature extraction; Dual-Level Decoder with 3×3 convolutions and channel attention.
RxDF module with asymmetric depthwise separable convolutions replaces C3K2; LiteFPP with selective pooling at 5×5 and 9×9 replaces SPPF; CRDown replaces Conv at P5.
CMSP with parallel atrous convolutions replaces SPPF; SCR replaces Conv at P5; FFM for channel-wise calibration; SPA with decoupled height/width attention for small object localization.
Comprehensive survey introducing CigDet — a novel annotated dataset for cigarette localization (open-source, Mendeley Data). Benchmarks YOLOv1 through YOLO11. Proposes the SURRONE drone-based outdoor surveillance framework.
Protected book companion files, blog code utilities, and curated engineering resources for computer vision workflows.
Book companion resources are organized by track. Resources are provided via post-purchase email.
Companion files for the semantic segmentation book series, loss functions, metric helpers, mask utilities, U-Net references, architecture modules, and experiment support scripts.
Book ResourcesCompanion files for the object detection book series, dataset YAML templates, annotation checks, prediction helpers, evaluation utilities, ONNX export examples, and configuration references.
Book ResourcesCompanion resources accompanying published Occulins articles, including selected implementations, utilities, configurations, and supporting materials for practical experimentation and workflow understanding.
Supporting resources for segmentation debugging and foreground-collapse analysis, including U-Net implementation, loss functions, metric utilities, visualization helpers, foreground diagnostics, and experiment configuration.
Companion AssetsSupporting resources for custom YOLO training workflows on VisDrone, including dataset validation, YAML configuration, training utilities, prediction workflows, and evaluation guidance.
Companion AssetsSupporting resources for the U-Net architecture explanation, including model implementation, DoubleConv block, architecture configuration, skip-connection reference, and selected plotting utilities.
Companion AssetsSupporting resources for object detection failure analysis, including missing-label checks, YAML split verification, mAP comparison logic, overfitting diagnostics, and evaluation sanity checks.
Companion AssetsCurated engineering resources for computer vision research, deep learning experimentation, edge AI deployment, robotics, UAV systems, and applied AI workflows.
| Resource | Best For | Technical Notes | Status |
|---|---|---|---|
| RTX 4090/5090 24GB | Heavy segmentation and detection training | Higher VRAM capacity supports larger input resolutions, heavier architectures, and larger batch sizes for deep learning experiments. | Coming Soon |
| RTX 4080/5080 16GB | Mid-to-high range experimentation | Useful for moderate segmentation and detection workloads with lower cost and power requirements than flagship GPUs. | Coming Soon |
| RTX 4070/5070 16GB | Lightweight to moderate AI research | Suitable for lightweight computer vision experimentation, architecture debugging, and deployment-focused model development. | Coming Soon |
| NVMe SSD 2TB | Dataset storage and fast experiment loading | Useful when working with image datasets, checkpoints, logs, and repeated training runs. | Coming Soon |
| Resource | Best For | Technical Notes | Status |
|---|---|---|---|
| Jetson Orin Nano | TensorRT FP16 deployment experiments | Useful for lightweight object detection and segmentation inference testing on embedded AI hardware. | Coming Soon |
| Jetson Orin NX | Higher-performance edge AI deployment | Suitable for heavier real-time computer vision workloads where embedded inference performance matters. | Coming Soon |
| Raspberry Pi AI Kit | Low-power AI prototyping | Useful for small embedded AI demonstrations and lightweight inference experiments. | Coming Soon |
| Resource | Best For | Technical Notes | Status |
|---|---|---|---|
| USB / RGB Camera Module | Real-time detection and segmentation demos | Useful for prototyping image acquisition pipelines and live computer vision experiments. | Coming Soon |
| Thermal Camera Module | Multimodal and environmental monitoring | Relevant for RGB-thermal object detection, fire monitoring, surveillance, and low-light perception workflows. | Coming Soon |
| AI Robotic Car Kit | Autonomous driving demonstrations | Useful for lane detection, obstacle detection, small-scale perception, and robotics vision experiments. | Coming Soon |
| UAV / Drone Platform | Aerial analytics and remote sensing experiments | Useful for UAV-based crack detection, aerial object detection, forest monitoring, and environmental inspection workflows. | Coming Soon |
| Resource | Best For | Technical Notes | Status |
|---|---|---|---|
| Deep Learning Reference Book | Foundational understanding | Useful for building stronger intuition around neural networks, optimization, and representation learning. | Coming Soon |
| Computer Vision Reference Book | Computer vision fundamentals | Useful for understanding classical and modern vision concepts before moving into applied deep learning systems. | Coming Soon |
| PyTorch Reference Book | Implementation and experimentation | Useful for researchers and engineers implementing models, training loops, debugging, and deployment pipelines. | Coming Soon |
Protected book resources may include selected model files, configuration files, modules, architecture references, and companion assets. Full proprietary repositories, commercial training frameworks, and complete production pipelines are not publicly distributed unless explicitly stated otherwise. Blog companion code may be released separately as open source where applicable.
Recommendations and external resource references are included for educational, research, and workflow guidance purposes. Some links may be affiliate links.
Computer vision research and engineering for deployment-aware systems built for real-world constraints.
Occulins is a computer vision research and engineering lab specializing in lightweight deep learning systems for semantic segmentation, object detection, and deployment-aware vision engineering.
Our work focuses on designing efficient computer vision architectures that balance accuracy, inference speed, parameter efficiency, and deployment constraints across medical imaging, aerial analytics, intelligent transportation, infrastructure monitoring, industrial inspection, aviation safety, and edge AI systems.
Built on years of research experience in deep learning and computer vision, Occulins focuses on translating research into practical engineering workflows. Every architecture is developed with practical constraints in mind, including latency, memory usage, computational efficiency, embedded inference, and real-world deployment requirements.
Occulins is evolving toward a full computer vision research and engineering lab focused on intelligent vision systems, scalable AI infrastructure, and practical computer vision technologies for industry and research-driven applications.
We collaborate on computer vision problems requiring both research depth and practical deployment awareness across constrained and real-world environments.
Parameter-efficient architectures designed for balanced accuracy, computational efficiency, reduced memory footprint, and deployment-aware inference.
Encoder-decoder systems for pixel-level scene understanding across medical, infrastructure, aerial, and industrial imaging domains.
Real-time detection systems for visual monitoring, small object analysis, dense scenes, and deployment-constrained computer vision applications.
Model optimization, deployment workflows, TensorRT acceleration, ONNX export, FP32/FP16 benchmarking, and hardware-aware performance evaluation.
Computer vision research, architecture design, and system development for domain-specific applications across research and industry.
Focused technical consultation and development support for applied computer vision systems.
Occulins works with researchers, startups, and technical teams on lightweight computer vision systems, model optimization, and deployment-aware AI solutions.
A lightweight road-crack segmentation architecture designed for efficient infrastructure monitoring under strict model-size and computation constraints.
LiteFusionNet is designed for road-crack segmentation where thin structures, cluttered road textures, and deployment constraints make dense prediction difficult. The architecture keeps a compact encoder–decoder structure while strengthening feature extraction through residual blocks, dilated gated attention, and dual vision mamba components. Its EASPP bridge captures crack-related context across multiple receptive fields, while the Dual-Level Decoder combines low-level spatial details with high-level contextual cues for sharper pixel-level localization. The result is an efficient segmentation model intended for infrastructure monitoring scenarios where accuracy, parameter count, and GFLOPs must be balanced rather than optimized in isolation.

A compact skin-lesion segmentation model with RFRE encoder, PDCP bridge, and BSFD decoder for dermoscopic image analysis.
DFF-UNet targets skin-lesion segmentation with a compact U-shaped design that focuses on boundary precision and computational efficiency. The RFRE encoder refines feature extraction through residual connections and adaptive convolutional processing, while the PDCP bridge captures contextual lesion patterns using parallel dilated convolutions. The BSFD decoder then fuses semantic and spatial information through bottleneck-based skip fusion, allowing the model to preserve important lesion details without increasing complexity. This makes the architecture suitable for dermoscopic image analysis where precise segmentation and lightweight deployment are both important.

A real-time aerial vehicle detection architecture using RxDF, LiteFPP, and CRDown components for UAV and aerial imagery.
VDXNet is a lightweight aerial vehicle detection model built for intelligent transportation and remote-sensing scenarios. Its design replaces heavier feature aggregation components with RxDF modules that combine spatial and depthwise information efficiently. LiteFPP supports multiscale contextual representation, while CRDown reduces spatial dimensions with lower computational cost. This architecture is shaped around the practical difficulty of detecting small vehicles with different orientations and background clutter, while keeping the model efficient enough for real-time aerial monitoring workflows.

A lightweight FOD detection model for airport runway safety and small-object detection in cluttered visual conditions.
LiteFODNet is designed for small foreign object debris detection in runway surveillance imagery, where targets are visually diverse, sparse, and often difficult to separate from complex backgrounds. The architecture introduces CMSP for compact multiscale context, SCR for efficient downsampling, FFM for channel-level feature calibration, and SPA for spatial attention along separate axes. Together, these modules guide the detector toward small, ambiguous objects while controlling computational burden. The model is positioned for aviation safety monitoring where real-time inference and reliable small-object localization are critical.

A smoker/cigarette detection dataset and surveillance-oriented research framework benchmarking YOLO variants for environmental and safety monitoring.
CigDet / SURRONE represents the environmental and safety monitoring direction of Occulins. The work combines dataset development, YOLO-based benchmarking, and a drone-oriented conceptual framework for cigarette and smoker detection. Instead of focusing only on model accuracy, the project highlights the practical deployment challenges of monitoring open environments, including small target localization, scene variability, and operational constraints. It provides a foundation for future lightweight detection systems in public-health, surveillance, and environmental monitoring scenarios.

Automated visual analysis of civil infrastructure using lightweight computer vision, enabling detection and segmentation of surface conditions, structural elements, and anomalies at scale.
Infrastructure assets degrade continuously through weather, load cycles, and material fatigue. Lightweight vision models enable automated detection of road cracks, bridge spalling, corrosion, and structural deformation from standard camera and drone imagery providing consistent coverage across large asset portfolios where manual inspection is impractical at scale. Models are designed to operate across varied surface types and imaging conditions, supporting deployment in both fixed-camera monitoring systems and UAV-based inspection workflows.


Detection and localization of surface defects, damage types, and structural anomalies for infrastructure health assessment and maintenance prioritization.

Pixel-level segmentation of road surfaces, bridge decks, and structural facades, enabling precise crack boundary delineation, spalling localization, and defect extent quantification for maintenance prioritization.
Automated crack and spalling detection identifies surface deterioration weeks or months before it reaches the threshold requiring emergency repair reducing the likelihood of costly structural failures and unplanned road closures.
Image-based inspection of roads, bridges, and facades processes in seconds what takes field teams hours on-site. Coverage that previously required days of manual survey can be completed from drone or vehicle-mounted camera footage in a fraction of the time.
Lightweight models handle thousands of images per day without proportional increases in cost or staffing, making city-scale crack mapping and bridge monitoring operationally feasible for the first time.
Elevated structures, active carriageways, and confined spaces that carry significant risk for inspection personnel can be assessed remotely using UAV-deployed vision systems with no access constraints.
Lightweight AI for clinical image analysis, enabling automated detection and segmentation of regions of interest across a range of medical imaging modalities.
Medical image analysis requires high sensitivity to subtle boundaries and strong generalization across patient populations and imaging conditions. Lightweight architectures designed for clinical deployment provide lesion detection, organ segmentation, and pathology localization across dermatology, histology, and radiology imaging, operating without GPU-heavy infrastructure. Models are built to generalise across patient populations and imaging protocols, supporting integration into clinical review workflows without requiring specialized hardware.


Detection and localization of cellular structures, clinical regions of interest, and pathological markers within medical image analysis workflows.

Pixel-level delineation of lesion boundaries, organ contours, and pathological regions, supporting precise area measurement, clinical grading, and multi-dataset generalization across imaging modalities.
Automated segmentation and detection provide a consistent baseline across clinicians, imaging sessions, and patient cohorts, reducing the inter-observer disagreement that affects manual region-of-interest identification in dermoscopy, pathology, and radiology.
Segmentation models delineate lesion margins at the pixel level, enabling objective area measurement and boundary characterization that supports clinical grading, treatment response tracking, and longitudinal comparison across patient visits.
Sub-megaparam architectures designed for efficiency run on standard clinical workstation hardware without dedicated GPU infrastructure, making AI-assisted image analysis accessible outside large academic medical centers.
Real-time detection and segmentation for UAV platforms, satellite imagery, and aerial imaging systems, enabling automated analysis of scenes, objects, and land cover from elevated viewpoints.
Aerial imaging introduces unique challenges, small object size, altitude-dependent scale variation, dense scene clutter, and the need for onboard real-time inference under tight power and compute constraints. Lightweight architectures address vehicle detection in satellite imagery, foreign object localization on airport runways, and dense scene segmentation from drone platforms, designed for edge-deployable aerial perception across a range of operational altitudes and imaging conditions.


Real-time localization of objects, vehicles, and structures in aerial and satellite imagery, enabling surveillance, monitoring, and remote sensing at scale.

Dense pixel classification of aerial scenes into land cover categories, structures, and surfaces, supporting change detection, area estimation, and environment mapping from UAV and satellite platforms.
UAV-mounted detection systems survey large areas, transmission corridors, pipelines, coastlines, agricultural land, in a single flight, compressing inspection timelines that would otherwise require days of ground-level access.
Specialised lightweight architectures address the small object detection challenge inherent to aerial imagery, reliably localizing pedestrians, vehicles, and foreign objects that occupy only a handful of pixels at operational altitude.
Edge-optimized models run inference directly on UAV hardware, eliminating the need to transmit raw video to ground stations for processing, reducing bandwidth requirements and enabling real-time decision-making in the field.
Models trained on varied aerial data generalise across imaging altitude, lighting conditions, and scene density, maintaining reliable detection performance from low-altitude close inspection through to high-altitude wide-area surveillance.
Computer vision for road scene understanding, traffic analysis, and transportation monitoring, supporting safer and smarter infrastructure through automated visual analysis.
Transportation environments are fast-moving, visually complex, and safety-critical. Automated vision systems detect vehicles, pedestrians, road markings, and hazard conditions from fixed cameras, dashcams, and roadside sensors, providing continuous monitoring without human operators. Lightweight architectures designed for edge deployment enable real-time scene analysis on in-vehicle hardware and roadside compute units, supporting applications from traffic management to road condition assessment.


Detection of vehicles, road users, and objects of interest in real-time transportation monitoring and road scene analysis applications.

Pixel-wise labelling of road surfaces, lane markings, vehicles, and pedestrians, providing the dense scene understanding required for road condition assessment and autonomous vehicle perception pipelines.
Automated detection and scene analysis runs 24/7 across fixed camera networks, providing continuous coverage of junctions, motorways, and urban corridors without the staffing cost of manual video monitoring.
Real-time detection of stopped vehicles, pedestrian incursions, and road hazards enables faster alert generation for traffic management centers, reducing the window between incident occurrence and operator response.
Dense segmentation of road surfaces and markings from vehicle-mounted cameras provides structured condition data across entire road networks, supporting evidence-based maintenance planning without dedicated inspection campaigns.
Computer vision for ecological surveillance, hazard detection, and environmental analysis across outdoor and natural settings.
Environmental monitoring at scale requires processing large volumes of aerial, satellite, and ground-level imagery to detect hazards, track ecological change, and support emergency response. Vision models trained on outdoor and natural scene data identify fire fronts, smoke plumes, flood boundaries, vegetation loss, and industrial anomalies, providing automated alerts and spatial analysis for agencies operating across geography too large for ground-based survey.


Detection and localization of environmental hazards, anomalies, and objects of interest in outdoor scenes for ecological surveillance and safety monitoring.

Pixel-level delineation of fire fronts, smoke plumes, flood extent, and vegetation coverage, enabling precise area quantification for hazard mapping, ecological assessment, and environmental change tracking.
Automated smoke and fire detection in UAV and satellite imagery identifies fire ignition and spread earlier than ground-based observation, compressing the time available to deploy suppression resources before a fire becomes uncontrollable.
Segmentation models delineate fire fronts, flood boundaries, and erosion zones at the pixel level, providing accurate area estimates and spatial maps that support evacuation planning, damage assessment, and resource allocation.
Lightweight models applied to satellite and drone imagery enable ongoing surveillance of forests, wetlands, and coastlines at a geographic scale that ground-based observation cannot match, detecting gradual ecological change alongside acute hazard events.
Occulins may collect limited information through contact forms, analytics tools, newsletter subscriptions, external integrations, or services used on this website.
Collected information may include names, email addresses, browser information, device information, interaction data, and technical usage information related to website activity.
Collected information may be used to improve website functionality, respond to inquiries, maintain platform security, analyze website performance, process requested resources, support communication related to services, technical content, or provide updates regarding Occulins resources and offerings.
This website may use cookies, analytics services, or related technologies to understand website usage, improve user experience, monitor website performance, and maintain platform functionality.
Third-party services such as analytics providers, payment processors, embedded media platforms, affiliate platforms, or external integrations may use cookies, tracking technologies, or related scripts as part of their own services.
External websites linked through this website operate under their own policies, terms, and privacy practices. Occulins is not responsible for external platforms, third-party websites, or their policies and services.
Reasonable technical and administrative measures are used to protect collected information, however no online platform, network, or digital system can guarantee absolute security.
Occulins reserves the right to update, modify, or revise this Privacy Policy at any time without prior notice.
By using this website, you acknowledge and agree to this Privacy Policy.
For legal, privacy, or policy-related inquiries, please use the contact page provided on this website.
Some links, resources, tools, books, hardware references, or recommended products presented on Occulins may be affiliate links. If purchases are made through these links, Occulins may earn a small commission at no additional cost to the user.
Resources referenced on this website are selected based on technical relevance, engineering workflow value, deployment considerations, research practicality, or applicability to computer vision, deep learning, edge AI, robotics, UAV systems, or applied AI engineering workflows.
Affiliate relationships do not influence technical opinions, research discussions, engineering evaluations, or educational content presented on this website.
Occulins does not manufacture, control, or guarantee third-party products, platforms, services, or external resources referenced through affiliate or external links. Users are responsible for evaluating products, compatibility, pricing, availability, and suitability before making purchases or technical decisions.
Third-party websites, marketplaces, payment systems, or affiliate platforms operate under their own policies, terms, and privacy practices. Occulins is not responsible for the content, availability, security, or practices of external services or websites.
Occulins reserves the right to modify, update, or revise this Affiliate Disclosure at any time without prior notice.
By using this website, you acknowledge and agree to this Affiliate Disclosure.
For policy-related inquiries, please use the contact page provided on this website.
By accessing or using Occulins, you agree to comply with these Terms of Service. If you do not agree with any part of these terms, please do not use this website.
Occulins provides research content, engineering resources, technical articles, digital materials, and computer vision-related information for educational, informational, and professional purposes.
All content, visuals, resources, models, documents, books, branding, engineering assets, and materials published on this website remain the intellectual property of Occulins unless otherwise stated.
Protected books, digital products, premium resources, technical documents, architecture designs, configuration files, code modules, and companion materials distributed through Occulins are licensed for authorized personal or organizational use only unless explicitly stated otherwise.
Occulins reserves all rights related to its published materials, engineering resources, research content, technical assets, branding, and proprietary workflows.
Unauthorized reproduction, redistribution, public sharing, reselling, mirroring, commercial redistribution, repackaging, re-uploading, or unauthorized distribution of Occulins content, books, resources, source code, or proprietary engineering materials is strictly prohibited.
Users may not reproduce, redistribute, resell, commercially exploit, or claim ownership of protected content, digital resources, or proprietary materials without permission from Occulins.
Information presented on this website, including technical discussions, code references, deployment suggestions, research insights, architecture explanations, benchmark analyses, engineering recommendations, and educational materials, is provided for informational and educational purposes only and does not constitute professional, legal, financial, medical, or other specialized advice.
While reasonable efforts are made to maintain accurate and up-to-date information, Occulins makes no representations or warranties regarding the accuracy, completeness, reliability, suitability, availability, or performance of any content, resources, downloads, tools, code, or materials provided through this website.
Any use of information, downloads, resources, external tools, code snippets, implementation examples, or referenced workflows is undertaken solely at the user's own discretion and risk. Users are responsible for independently evaluating the suitability, safety, and applicability of any information or materials before use in research, development, commercial, or operational environments.
External links, third-party platforms, payment providers, affiliate resources, embedded services, or recommended products are provided for convenience only. Occulins is not responsible for the content, availability, security, policies, or practices of external services or websites.
Under no circumstances shall Occulins be liable for any direct, indirect, incidental, consequential, technical, financial, or operational damages arising from the use of this website, its content, downloads, resources, or referenced external services.
Occulins reserves the right to modify, update, remove, or discontinue content, resources, services, features, policies, or website sections at any time without prior notice.
By continuing to use this website, you acknowledge and agree to these Terms of Service.
For legal, privacy, or policy-related inquiries, please use the contact page provided on this website.
Companion resources for the semantic segmentation book track. Book PDFs are delivered through Gumroad after purchase, while companion resources are delivered separately by Occulins via email after purchase verification.
Companion materials for segmentation fundamentals, U-Net architecture references, BCE-Dice loss examples, segmentation metrics, mask overlay tools, training configuration examples, and prediction visualization utilities.
BeginnerCompanion materials for real-world segmentation training, dataset validation utilities, evaluation scripts, training diagnostics, debugging tools, prediction analysis utilities, and failure-case visualization examples.
IntermediateAdvanced companion materials for segmentation architecture design, DFF-UNet architecture references, RFRE, PDCP, and BSFD module descriptions, ablation study templates, experiment configuration examples, evaluation utilities, and published errata.
AdvancedCompanion resources are intended exclusively for verified book owners. After purchasing a book through Gumroad, the PDF is delivered by Gumroad, while the related companion resources are delivered separately by Occulins via email after purchase verification. Resource delivery is typically completed within 1–2 business days.
If resources are not received within the expected timeframe, contact us at contact@occulins.com with your purchase receipt.
Companion resources for the object detection book track. Book PDFs are delivered through Gumroad after purchase, while companion resources are delivered separately by Occulins via email after purchase verification.
Companion materials for object detection fundamentals, dataset configuration examples, label validation utilities, dataset integrity checking tools, training and validation templates, inference examples, and reproducibility utilities.
BeginnerCompanion materials for transfer learning, domain adaptation, training configuration examples, evaluation checklists, benchmarking utilities, validation scripts, and real-world deployment workflows.
IntermediateAdvanced companion materials for detection architecture design, selected model configuration files, custom module references, ablation study templates, efficiency analysis utilities, deployment examples, benchmarking templates, and published errata.
AdvancedCompanion resources are intended exclusively for verified book owners. After purchasing a book through Gumroad, the PDF is delivered by Gumroad, while the related companion resources are delivered separately by Occulins via email after purchase verification. Resource delivery is typically completed within 1–2 business days.
If resources are not received within the expected timeframe, contact us at contact@occulins.com with your purchase receipt.
You train a segmentation model. Loss decreases. Validation accuracy looks stable. You feel good about where things are heading.
Then you check the actual predictions. Everything is black. No polyps detected. No boundaries. Just empty masks where the objects should be.
Or a more dangerous version of the same problem: the predictions look reasonable after 100 epochs of full training, but if you had stopped at epoch 20, which many practitioners do, especially when time or compute is limited, your model would have been missing 70% of the objects it was supposed to find.
Both of these are symptoms of the same root cause. Understanding it precisely is what allows you to fix it rather than just adjust training until something works.
In semantic segmentation, the model assigns a class label to every pixel in the image. When the foreground class occupies a small fraction of pixels, the model can achieve high overall accuracy simply by predicting background everywhere. From the perspective of a pixel-wise loss function, this is a perfectly rational solution.
This is what is called a degenerate solution. The model is not broken, it found a local minimum that satisfies the training objective without learning to detect anything useful. The problem is that the training objective did not make foreground detection valuable enough to pull the model out of that minimum.
The CVC-ClinicDB colonoscopy dataset has the following pixel distribution across its 612 images:
Mean foreground (polyp): 9.2%
Median foreground: 6.8%
Images below 5% foreground: 229 (37% of dataset)
Images below 10% foreground: 404 (66% of dataset)
At these fractions, a model that predicts background everywhere achieves pixel accuracy of 90 to 93%. That number will appear in your training logs and look completely reasonable. The foreground IoU will be near zero, but if you are only checking overall accuracy, or if your framework reports it prominently, you may not notice until you look at the actual predictions.
Cross-entropy computes the loss at every pixel independently and takes the mean across all pixels. This means the total gradient signal is a weighted average, weighted by pixel count, of the per-pixel gradients. At 9% foreground fraction, background pixels contribute 91% of the gradient and foreground pixels contribute 9%.
The model receives ten times more information about how to classify background than how to classify foreground. It learns background first, fast, and confidently. It learns foreground slowly, noisily, and only after background is fully saturated.
This is not a flaw in cross-entropy. It is exactly how an unweighted average behaves on an imbalanced distribution. The flaw is applying it without modification when the imbalance is this severe.
To make this concrete rather than theoretical, I ran two identical experiments on CVC-ClinicDB polyp segmentation. Same model (U-Net trained from scratch with no pretrained weights), same optimizer, same hyperparameters, same 80/20 train-test split with fixed seed. Only the loss function changed.
Experiment A used binary cross-entropy only. Experiment B used BCE combined with Dice loss.
The sensitivity metric, the fraction of true polyp pixels the model correctly identifies, tells the most important part of the story:
| Epoch | BCE Sensitivity | BCE Specificity | BCE+Dice Sensitivity | BCE+Dice Specificity |
|---|---|---|---|---|
| 1 | 0.424 | 0.896 | 0.606 | 0.906 |
| 2 | 0.268 | 0.981 | 0.684 | 0.918 |
| 5 | 0.541 | 0.982 | 0.795 | 0.971 |
| 10 | 0.747 | 0.986 | 0.889 | 0.951 |
| 20 | 0.850 | 0.990 | 0.899 | 0.975 |
| 30 | 0.899 | 0.992 | 0.913 | 0.990 |
Look at epoch 2. BCE sensitivity drops to 0.268 while BCE specificity climbs to 0.981. The model is predicting background with increasing confidence, exactly the degenerate solution described above. BCE + Dice shows sensitivity of 0.684 at the same epoch. It is finding polyps from the very start because the Dice component makes foreground detection non-negotiable for the loss.
By epoch 10, BCE has recovered somewhat to 0.747 sensitivity, but BCE + Dice is already at 0.889. The gap is 14 percentage points at epoch 10, and it does not close until around epoch 30 to 40.
| Configuration | Best Dice | Final IoU | Final Sensitivity | Final Specificity |
|---|---|---|---|---|
| BCE only | 0.9245 | 0.8508 | 0.9177 | 0.9929 |
| BCE + Dice | 0.9277 | 0.8574 | 0.9220 | 0.9951 |
At full convergence after 100 epochs, the two models reach similar performance. BCE + Dice has slightly higher Dice and IoU. BCE only has slightly higher final sensitivity, though it took 30 epochs longer to get there. Both are viable at epoch 100. The question is what happens if you stop earlier, and what happens during the 30 epochs where BCE is still catching up.
The single most effective change is replacing cross-entropy alone with a combination of BCE and Dice loss. The code is straightforward:
import torch
import torch.nn as nn
class DiceLoss(nn.Module):
def __init__(self, eps=1.0):
super().__init__()
self.eps = eps
def forward(self, pred, target):
# pred is already sigmoid-activated
num = 2 * (pred * target).sum() + self.eps
denom = pred.sum() + target.sum() + self.eps
return 1 - num / denom
bce_loss = nn.BCELoss()
dice_loss = DiceLoss()
def criterion(pred, mask):
return bce_loss(pred, mask) + dice_loss(pred, mask)
BCE provides stable gradients throughout training particularly important in early epochs when predictions are still near random. Dice ensures the foreground class cannot be overwhelmed by the background gradient signal. Together they are consistently more robust than either alone on imbalanced segmentation tasks.
One note on implementation: if your model applies torch.sigmoid internally in its forward method, as many U-Net implementations do, use nn.BCELoss which expects probabilities in [0, 1]. If your model outputs raw logits, use nn.BCEWithLogitsLoss which applies sigmoid internally. Mixing these causes double-sigmoiding which produces near-uniform outputs and very small gradients throughout training.
Not every blank prediction is a loss function problem. Before changing your training configuration, verify these three things.
After 20 epochs, compute foreground IoU and sensitivity specifically, not overall pixel accuracy. Foreground IoU below 0.10 combined with overall accuracy above 88% is the signature of foreground underweighting. Overall accuracy above 90% on a dataset with 9% foreground is a warning sign, not a success signal.
Open five random image-mask pairs and look at them directly. Confirm the masks contain the objects you expect, that they are not inverted, and that filenames sort in the same order for images and masks. Mismatched pairs are more common than expected and they corrupt the training signal silently.
import numpy as np
from PIL import Image
import os
mask_dir = 'CVC-ClinicDB/masks'
for fname in sorted(os.listdir(mask_dir))[:5]:
mask = np.array(
Image.open(
os.path.join(mask_dir, fname)
).convert('L')
)
fg = (mask > 127).sum() / mask.size
print(f"{fname}: unique={np.unique(mask)} "
f"fg_fraction={fg:.4f}")
Binary masks should have values 0 and 255 (before normalisation) or 0.0 and 1.0 (after). If unique values return something unexpected, all zeros, all 255, or a range of intermediate values, fix the mask loading before adjusting anything else.
Do not wait until epoch 100 to look at predictions. Save a prediction image at fixed intervals during training:
def save_prediction_sample(model, loader, epoch,
save_dir, device):
model.eval()
os.makedirs(save_dir, exist_ok=True)
with torch.no_grad():
images, masks = next(iter(loader))
images = images.to(device)
preds = model(images)
pred_mask = (preds[0].cpu().squeeze() > 0.5
).float().numpy()
fig, axes = plt.subplots(1, 3, figsize=(10, 3))
axes[0].imshow(images[0].cpu().permute(1,2,0))
axes[0].set_title('Input'); axes[0].axis('off')
axes[1].imshow(masks[0].squeeze(), cmap='gray')
axes[1].set_title('Ground Truth'); axes[1].axis('off')
axes[2].imshow(pred_mask, cmap='gray')
axes[2].set_title(f'Epoch {epoch}'); axes[2].axis('off')
plt.tight_layout()
plt.savefig(f'{save_dir}/epoch_{epoch:03d}.png', dpi=120, bbox_inches='tight')
plt.close()
model.train()
Sensitivity, the fraction of true foreground pixels correctly identified, is the metric that exposes this failure mode most clearly. Always track foreground sensitivity alongside your primary metrics. For binary medical segmentation especially, it is the number that tells you whether the model is clinically useful or not.
| Symptom | Cause | Fix |
|---|---|---|
| All-black predictions | Complete foreground collapse under BCE | Replace with BCE + Dice combined loss |
| Low sensitivity in early training | BCE gradient dominated by background pixels | BCE + Dice — Dice component protects foreground signal |
| Good accuracy, poor IoU | Model predicting background accurately | Report foreground IoU and sensitivity, not overall accuracy |
| Training looks normal, predictions wrong | Wrong activation-loss pairing | Match BCELoss to sigmoid model, BCEWithLogitsLoss to logit model |
| Cannot tell what is happening | Only tracking aggregate metrics | Visualize predictions every 10 epochs, track sensitivity separately |
The pattern in this experiment is consistent across datasets with moderate to severe foreground imbalance. BCE + Dice does not always produce dramatically higher final metrics after full convergence. What it consistently produces is faster, more reliable foreground detection, particularly in the first 30 epochs where BCE is still learning to find the foreground at all.
Selected implementations, supporting utilities, experiment configurations, and companion resources related to this article are available through the Blog 1 Companion Resources page .
Working on segmentation systems where metrics and predictions do not align?
Reach out through
Occulins Contact
for deployment-aware computer vision research and engineering support.
Supporting assets for the article: Why Your Segmentation Model Predicts Only Background — understanding foreground imbalance, loss behavior, debugging workflows, and segmentation failure modes.
Standard encoder-decoder implementation used during experiments. Included as a reference implementation supporting architectural understanding.
ArchitectureDice loss implementation and BCE + Dice objective used to address foreground imbalance and prediction collapse.
Training UtilitiesUtility functions for IoU, Dice score, sensitivity, specificity, and segmentation quality monitoring.
EvaluationUtilities for foreground ratio inspection, imbalance analysis, and mask sanity checking before training.
DebuggingPrediction monitoring helpers for qualitative analysis and segmentation debugging workflows.
VisualizationTraining configuration example including optimizer settings, experiment parameters, and model setup.
ConfigSelected resources are designed to support experimentation and practical understanding through focused implementations, utilities, and companion materials accompanying each article.
Reference implementation of the standard U-Net architecture used during experiments.
class DoubleConv(nn.Module):
def __init__(self, in_channels, out_channels):
super(DoubleConv, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_channels, out_channels,
kernel_size=3, stride=1,
padding=1, bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels,
kernel_size=3, stride=1,
padding=1, bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
)
def forward(self, x):
return self.conv(x)
class UNET(nn.Module):
def __init__(self, num_classes=1,
input_channels=3, **kwargs):
super().__init__()
nb_filter = [32, 64, 128, 256, 512]
self.pool = nn.MaxPool2d(2, 2)
# Encoder
self.conv0_0 = DoubleConv(input_channels,
nb_filter[0])
self.conv1_0 = DoubleConv(nb_filter[0],
nb_filter[1])
self.conv2_0 = DoubleConv(nb_filter[1],
nb_filter[2])
self.conv3_0 = DoubleConv(nb_filter[2],
nb_filter[3])
self.conv4_0 = DoubleConv(nb_filter[3],
nb_filter[4])
# Bottleneck
self.bottleneck = DoubleConv(nb_filter[4],
nb_filter[4] * 2)
# Decoder
self.upconv4 = nn.ConvTranspose2d(
nb_filter[4] * 2, nb_filter[4],
kernel_size=2, stride=2)
self.conv4_1 = DoubleConv(
nb_filter[4] * 2, nb_filter[4])
self.upconv3 = nn.ConvTranspose2d(
nb_filter[4], nb_filter[3],
kernel_size=2, stride=2)
self.conv3_2 = DoubleConv(
nb_filter[3] * 2, nb_filter[3])
self.upconv2 = nn.ConvTranspose2d(
nb_filter[3], nb_filter[2],
kernel_size=2, stride=2)
self.conv2_3 = DoubleConv(
nb_filter[2] * 2, nb_filter[2])
self.upconv1 = nn.ConvTranspose2d(
nb_filter[2], nb_filter[1],
kernel_size=2, stride=2)
self.conv1_4 = DoubleConv(
nb_filter[1] * 2, nb_filter[1])
self.upconv0 = nn.ConvTranspose2d(
nb_filter[1], nb_filter[0],
kernel_size=2, stride=2)
self.conv0_5 = DoubleConv(
nb_filter[0] * 2, nb_filter[0])
self.final = nn.Conv2d(
nb_filter[0], num_classes, kernel_size=1)
def forward(self, x):
x0_0 = self.conv0_0(x)
x1_0 = self.conv1_0(self.pool(x0_0))
x2_0 = self.conv2_0(self.pool(x1_0))
x3_0 = self.conv3_0(self.pool(x2_0))
x4_0 = self.conv4_0(self.pool(x3_0))
x5_0 = self.bottleneck(self.pool(x4_0))
x4_1 = self.conv4_1(
torch.cat([self.upconv4(x5_0), x4_0], dim=1))
x3_2 = self.conv3_2(
torch.cat([self.upconv3(x4_1), x3_0], dim=1))
x2_3 = self.conv2_3(
torch.cat([self.upconv2(x3_2), x2_0], dim=1))
x1_4 = self.conv1_4(
torch.cat([self.upconv1(x2_3), x1_0], dim=1))
x0_5 = self.conv0_5(
torch.cat([self.upconv0(x1_4), x0_0], dim=1))
return torch.sigmoid(self.final(x0_5))
class DiceLoss(nn.Module):
def __init__(self,eps=1.0):
super().__init__()
self.eps=eps
def forward(self,pred,target):
num=2*(pred*target).sum()+self.eps
den=pred.sum()+target.sum()+self.eps
return 1-num/den
bce_loss = nn.BCELoss()
dice_loss = DiceLoss()
def criterion(pred,mask):
return bce_loss(pred,mask) + dice_loss(pred,mask)
iou = tp/(tp+fp+fn+1e-6)
dice = 2*tp/(2*tp+fp+fn+1e-6)
sensitivity = tp/(tp+fn+1e-6)
specificity = tn/(tn+fp+1e-6)
mask = np.array(mask)
foreground_ratio = (
(mask > 127).sum()
/
mask.size
)
print(foreground_ratio)
images,masks = next(iter(loader))
preds = model(images)
save_prediction_sample(
images,
masks,
preds
)
epochs: 100
batch_size: 8
optimizer: Adam
learning_rate: 1e-4
image_size: 256
dataset: CVC-ClinicDB
Most object detection tutorials follow the same pattern.
Install the framework. Download a pre-prepared dataset. Run the training command. Look at the predictions. Done.
That pattern works perfectly for the tutorial. It almost never works for your actual dataset.
The gap between running a tutorial successfully and training a model on your own data is where most people get stuck, and it is not because they are missing a command or a library. It is because tutorials teach you the steps, not the decisions. And in object detection, the decisions are what determine whether your model learns anything useful.
This post covers those decisions. We will use YOLOv12 trained on the VisDrone dataset as the running example throughout, VisDrone is an aerial drone detection dataset with real challenges that make the decisions matter, which is exactly what we need to learn from.
VisDrone is a drone-captured dataset for detecting pedestrians, cars, vans, trucks, bicycles, and other objects in aerial imagery. It has roughly 6,500 training images and 548 validation images, with objects that are small, densely packed, and photographed from varying altitudes.
Figure 1 — Three sample VisDrone validation images with ground truth annotations. Left: sparse parking lot scene with cars and pedestrians at moderate altitude. Centre: dense night-time street scene with motors, tricycles, and pedestrians. Right: high-altitude view with hundreds of densely packed objects across all ten classes.
It is not a beginner dataset in the sense that the problem is easy, it is a beginner dataset in the sense that it is publicly available, well-structured, and reflects the kinds of real detection challenges you will face in any applied project.
YOLOv12 is one of the newest generations in the YOLO family, introducing architectural changes aimed at improving detection performance while maintaining real-time inference capability. It introduces an attention-centric architecture that improves detection accuracy while maintaining real-time inference speed. This blog uses YOLOv12 while our detection book series covers YOLOv11. The workflow and dataset preparation process remain highly similar across both versions, making the core ideas transferable.
For custom training, the nano variant, yolo12n, is the right starting point. It trains fastest, uses the least memory, and gives you a quick feedback loop on whether your configuration and data are set up correctly before committing to a larger model.
Before writing a single line of training code, your dataset needs to be in the exact format YOLO expects. Getting this wrong produces errors that look like model failures but are actually data failures.
YOLO expects images and labels in parallel directories, with training and validation splits clearly separated:
visdrone/
├── images/
│ ├── train/
│ └── val/
└── labels/
├── train/
└── val/
The image filename and its label filename must match exactly. frame_0001.jpg must have a corresponding frame_0001.txt in the labels directory. YOLO will silently skip images with no matching label file, which means you can appear to be training on your full dataset while actually training on a fraction of it.
Each label file contains one line per object in the image. The format is:
class_id x_center y_center width height
All values except class_id are normalised to the range [0, 1] relative to image dimensions. A bounding box that starts at pixel (100, 50) and has width 200, height 80 in a 640×480 image becomes:
0 0.3125 0.1875 0.3125 0.1667
If your annotations are in COCO JSON format, Pascal VOC XML, or any other format, you need to convert them before training. Do not skip this verification step:
import os
label_dir = 'visdrone/labels/train'
error_count = 0
for fname in os.listdir(label_dir):
with open(os.path.join(label_dir, fname)) as f:
for line_num, line in enumerate(f, 1):
parts = line.strip().split()
if len(parts) != 5:
print(f"Bad line in {fname}:{line_num}"
f" → {line.strip()}")
error_count += 1
else:
vals = list(map(float, parts[1:]))
if not all(0.0 <= v <= 1.0
for v in vals):
print(f"Out-of-range in "
f"{fname}:{line_num}")
error_count += 1
print(f"Checked. Errors found: {error_count}")
YOLO reads dataset configuration from a YAML file. For VisDrone:
# visdrone.yaml
path: /path/to/visdrone
train: images/train
val: images/val
nc: 10
names:
0: pedestrian
1: people
2: bicycle
3: car
4: van
5: truck
6: tricycle
7: awning-tricycle
8: bus
9: motor
The class indices in your label files must match the order in this YAML exactly. A mismatch here, even by one class, will not produce an error during training. It will produce a model that quietly labels everything wrong.
These four decisions collectively matter more than any hyperparameter tuning you do after training begins. Most tutorials present them as fixed values. They are not, each depends on your specific dataset.
YOLOv12 comes in five scales: nano (n), small (s), medium (m), large (l), and extra-large (x). Start with nano. Not because nano is the best model, but because it gives you the fastest feedback loop. Train for 20 epochs with nano. If the model is learning, mAP increasing, losses decreasing, you have confirmed your dataset and configuration are correct. Then scale up if needed.
The Ultralytics YOLO framework uses 640×640 as its standard default image size, and this is what we use here. It is the right starting point for most datasets, well-tested, memory-efficient, and fast to train. On VisDrone at 640 resolution, large and medium objects are detected reliably. Very small objects at high altitude become challenging, which is a characteristic of the dataset rather than a failure of the image size setting.
The right number of epochs is not a fixed number, it is whenever the validation mAP stops improving. For VisDrone with a nano model at 640 resolution, this typically happens somewhere between 80 and 150 epochs. Use early stopping with the patience parameter. Set patience=20 and let the model stop itself. In our run, the model converged at around 115 epochs.
Always start from COCO pretrained weights, not random initialization. The pretrained weights give the model basic visual feature detectors from the start, edges, textures, shapes, that would otherwise take tens of epochs to learn from your custom data alone. The performance difference between pretrained and random initialization is typically 5 to 15 mAP points on a custom dataset of this size.
from ultralytics import YOLO
model = YOLO('yolo12n.pt')
results = model.train(
data='visdrone.yaml',
epochs=150,
imgsz=640,
batch=8,
patience=20,
device=0,
project='visdrone_runs',
name='yolo12n_baseline'
)
Every epoch produces a line that looks like this:
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
1/150 5.37G 1.946 2.388 1.032 895 640: 100% ━━━━━━━━━━━━ 809/809 9.9it/s 1:22
Class Images Instances Box(P R mAP50 mAP50-95): 100% ━━━━━━━━━━━━ 35/35 17.1it/s 2.0s
all 548 38759 0.393 0.177 0.123 0.0641
Three numbers tell you whether training is going correctly:
If any of these three values is not decreasing after 20 epochs, something is wrong. The most common causes are a learning rate that is too high, label errors, or image-label filename mismatches.
Figure 2 — Training curves for YOLOv12n on VisDrone at 640 resolution over 115 epochs. Box loss and class loss decrease steadily with validation closely tracking training, no overfitting. Validation mAP50 reaches 0.32 and mAP50-95 reaches 0.19 at convergence.
In our training run, box_loss dropped from 1.946 at epoch 1 to 1.41 at convergence. Class loss fell from 2.388 to 0.95. The model converged at epoch 115, reaching a validation mAP50 of 0.32 and mAP50-95 of 0.19.
best.pt checkpoint based on validation mAP. Some users accidentally point their evaluation script at the training data rather than the validation data. Always verify which split your evaluation is running on before reporting any number.
Figure 3 — YOLOv12n predictions on two VisDrone validation images. Left: a successful case, cars detected with high confidence (0.81–0.88), van at 0.91, and a small pedestrian correctly identified at 0.45. Right: a class confusion failure, vehicles are correctly located but misclassified as bus (0.32, 0.76) and truck (0.35).
from ultralytics import YOLO
import glob
model = YOLO(
'visdrone_runs/yolo12n_baseline/weights/best.pt'
)
val_images = glob.glob(
'visdrone/images/val/*.jpg')[:10]
for img_path in val_images:
results = model.predict(
img_path,
conf=0.25,
iou=0.45,
save=True,
project='predictions',
name='val_sample'
)
When you look at the prediction images, you are checking for three things:
The steps above will get most people to a working baseline model. For many applications, that is enough.
But a baseline model on VisDrone is not a deployable model. A nano model at 640 resolution reaching mAP50 of 0.32 is a respectable starting point, it confirms your pipeline works and your data is correctly formatted. The harder work is understanding why the remaining 68% of objects are missed or misclassified, and which of those failures are addressable through better training strategy versus which are fundamental limitations of the model scale.
VisDrone has characteristics that a standard training run does not fully address: severe small-object density, large variations in altitude and scale, class imbalance between common classes like cars and rare ones like tricycles, and class confusion between visually similar vehicle types. That is exactly where the baseline ends and the real work begins.
Even after improving accuracy, deployment constraints still matter. Parameter count, latency, inference throughput, memory usage, and hardware limitations ultimately determine whether a detector remains useful outside controlled experimentation.
This article focuses on building a reliable baseline detection pipeline. The next challenge is improving robustness, handling domain-specific failures, optimizing deployment constraints, and designing stronger experiments for real-world datasets.
📖 The Occulins detection book track expands these topics through structured workflows, deployment-oriented experimentation, and practical case studies.
Explore Detection BooksSelected implementations, supporting utilities, dataset templates, and companion resources related to this article are available through the Blog 2 Companion Resources page .
| Stage | Key Decision | Common Mistake |
|---|---|---|
| Dataset preparation | Verify label format and filename matching | Silently missing label files |
| Split strategy | Split at sequence level, not image level | Random split inflates validation mAP |
| Model scale | Start with nano, scale up after confirming setup | Training large model on misconfigured data |
| Image size | Use the 640 default — understand what it gives you | Changing image size before confirming data is correct |
| Pretrained weights | Always use COCO pretrained initialization | Training from scratch loses 5–15 mAP |
| Training monitoring | Watch box_loss, cls_loss, dfl_loss per epoch | Waiting until end to check predictions |
| Evaluation | Report both mAP50 and mAP50-95 | mAP50 alone overstates performance by up to 13 points |
Working on a detection project and running into challenges beyond what this post covers?
Feel free to reach out through occulins.com/contact
Supporting assets for custom object detection training using Ultralytics YOLO, dataset preparation workflows, training configuration, prediction analysis, and evaluation utilities.
Recommended directory organization and dataset preparation workflow.
DatasetUtilities for label validation, annotation checks, and debugging dataset errors.
VerificationBaseline Ultralytics training configuration and early stopping setup.
TrainingPrediction workflow and validation image analysis utilities.
InferenceMetrics and sanity checks for validating training success.
EvaluationSelected resources are designed to support experimentation and practical understanding through focused implementations, utilities, and companion materials accompanying each article.
dataset/
├── images/
│ ├── train/
│ └── val/
└── labels/
├── train/
└── val/
import os
label_dir='labels/train'
for fname in os.listdir(label_dir):
with open(
os.path.join(
label_dir,
fname
)
) as f:
for line in f:
parts=line.split()
if len(parts)!=5:
print(fname)
path: dataset/
train: images/train
val: images/val
nc: 10
names:
0: pedestrian
1: people
2: bicycle
3: car
4: van
5: truck
6: tricycle
7: awning-tricycle
8: bus
9: motor
from ultralytics import YOLO
model = YOLO(
'yolo12n.pt'
)
model.train(
data='visdrone.yaml',
epochs=150,
imgsz=640,
batch=8,
patience=20
)
results = model.predict(
img_path,
conf=0.25,
iou=0.45,
save=True
)
Check:
✓ box_loss decreasing
✓ cls_loss decreasing
✓ dfl_loss decreasing
✓ mAP50 improving
✓ mAP50-95 improving
✓ prediction quality
U-Net is the most widely used architecture in medical image segmentation. If you have worked in this space for more than a week, you have encountered it.
But most explanations of U-Net either go straight to diagrams without explaining why the architecture is shaped the way it is, or they go so deep into the mathematics that the core idea gets buried.
This post takes a different approach. Before showing you the architecture, it explains the problem that forced the architecture into existence. Once you understand the problem, the design makes complete sense, and you will remember it in a way that a diagram alone cannot achieve.
We will use CVC-ClinicDB polyp segmentation as the practical example throughout, the same dataset and model used in Blog 1 of this series. The training results shown here come from the U-Net trained from scratch with BCE + Dice loss in that post. If you have not read Blog 1, you do not need to, but if your model is predicting only background, that post addresses it directly.
Before U-Net existed, the standard approach to segmentation was to take a classification network, something like VGG or AlexNet, and adapt it to produce pixel-level output instead of a single class label.
This sounds reasonable. Classification networks are good at recognising what is in an image. Surely that knowledge can be extended to recognise what is in each pixel.
The problem is what happens to spatial information inside a classification network.
A classification network progressively reduces the spatial dimensions of its feature maps through pooling and strided convolutions. A 256×256 input passes through five downsampling stages and arrives at the bottleneck as an 8×8 feature map. The network then applies a global pooling operation and produces a single class prediction.
At that 8×8 stage, each position in the feature map corresponds to a 32×32 region of the original image. The model knows that something is present somewhere in that region. It does not know where exactly within the region.
For classification, that is fine. For segmentation, where you need to know the exact boundary of a polyp at pixel level, that spatial uncertainty is fatal.
Figure 1 — The encoder pathway. A 256×256 input passes through five downsampling stages, arriving at the bottleneck as an 8×8 feature map with 1024 channels. Spatial resolution shrinks at each stage while channel depth grows, the network trades location precision for semantic understanding.
The encoder is the left half of U-Net. It applies two convolutional layers at each stage, followed by a max-pooling operation that halves the spatial dimensions before the next stage.
For a 256×256 input image, the five-stage encoder used in our CVC-ClinicDB model produces feature maps at the following resolutions:
| Encoder Stage | Feature Map Size | Channels | Receptive Field |
|---|---|---|---|
| Input | 256 × 256 | 3 | 1 pixel |
| Stage 1 output | 256 × 256 | 32 | ~3 × 3 region |
| Stage 2 output | 128 × 128 | 64 | ~6 × 6 region |
| Stage 3 output | 64 × 64 | 128 | ~12 × 12 region |
| Stage 4 output | 32 × 32 | 256 | ~24 × 24 region |
| Stage 5 output | 16 × 16 | 512 | ~32 × 32 region |
| Bottleneck | 8 × 8 | 1024 | ~64 × 64 region |
Notice two things happening simultaneously. The spatial dimensions shrink, from 256×256 to 8×8, while the channel count grows, from 3 to 1024. The network is trading spatial resolution for representational depth. At the bottleneck, each of the 8×8 positions carries a rich 1024-dimensional description of a large region of the original image. It knows what is there. It has lost the fine-grained where.
This is by design, not by accident. The large receptive field at the bottleneck is what allows the network to understand global context, whether the overall image looks like it contains a large central polyp, or scattered small ones, or nothing unusual at all.
The decoder is the right half of U-Net. Its job is to take the bottleneck feature map, rich in semantic content but poor in spatial detail and progressively restore spatial resolution until the output matches the original image dimensions.
It does this through transposed convolution operations that reverse the encoder's pooling. At each stage, the feature map is spatially enlarged by a factor of two, and a pair of convolutional layers refines the upsampled features.
But here is the fundamental problem with a decoder operating alone.
Upsampling is not the inverse of downsampling. When max-pooling reduces a region to a single value, it retains the maximum and discards everything else. No upsampling operation can recover what was discarded. The decoder can produce a spatially large output, but that output will be blurry and imprecise at boundaries because the precise boundary information was lost during encoding and cannot be reconstructed from the bottleneck alone.
This is exactly the problem that forced the skip connection into existence.
Figure 2 — Skip connections in U-Net. Each encoder stage passes its feature map directly to the corresponding decoder stage at the same spatial resolution. Encoder 1 (128×128) connects to Decoder 5 (128×128), Encoder 2 (64×64) to Decoder 4 (64×64), and so on. The bottleneck feeds into Decoder 1 (8×8), the first decoder stage. Decoder stages then upsample progressively toward the final output.
The skip connection is U-Net's defining contribution. Rather than requiring the decoder to reconstruct fine spatial detail from the bottleneck alone, it gives the decoder direct access to the encoder's feature maps at each spatial scale, before those maps were downsampled.
At each decoder stage, two sources of information are concatenated:
The convolutional layers that follow the concatenation learn to integrate these two sources. The semantic information from the decoder path tells the network what the region is. The spatial information from the encoder path tells it exactly where the boundary is.
This is why U-Net produces sharp, precise segmentation boundaries when a decoder-only architecture produces blurry ones. The boundary precision does not come from clever upsampling. It comes from having direct access to the original encoder features that contained that precision before it was lost to downsampling.
Skip connections in U-Net use concatenation, the encoder and decoder feature maps are stacked along the channel dimension, doubling the channel count before the next convolution. ResNets use addition instead. The choice matters.
Addition requires the two tensors to have the same meaning for the operation to make sense, you are combining them into a single representation. Concatenation preserves both representations independently and lets the following convolution learn how to use each one. For segmentation, where the encoder and decoder features carry fundamentally different types of information, spatial precision vs semantic depth, concatenation is the right choice.
Figure 3 — Complete U-Net architecture as used in the CVC-ClinicDB polyp segmentation experiment. Five encoder stages compress a 256×256 input down to 8×8 at the bottleneck. Five decoder stages restore spatial resolution back to 256×256. Dashed amber arrows show the five skip connections transferring encoder feature maps directly to matching decoder stages. The 1×1 conv with sigmoid at the top of the decoder produces the final binary segmentation mask at full input resolution.
The complete U-Net has a symmetric structure, five encoder stages on the left, a bottleneck at the bottom, five decoder stages on the right, with skip connections bridging each encoder-decoder pair at the same spatial resolution.
The U shape is not an accident of diagram layout. It is a direct consequence of the architecture's function: compress spatial information as you go down, expand it as you go up, and maintain direct connections between the corresponding levels on each side.
The model used throughout this post is a U-Net trained from scratch on the CVC-ClinicDB colonoscopy polyp dataset. The channel widths follow the pattern [32, 64, 128, 256, 512] across the five encoder stages, with a 1024-channel bottleneck. The loss function is BCE + Dice combined, which was shown in Blog 1 to produce reliable foreground detection from the first epochs.
class DoubleConv(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_channels, out_channels,
kernel_size=3, padding=1,
bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels,
kernel_size=3, padding=1,
bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
)
def forward(self, x):
return self.conv(x)
class UNET(nn.Module):
def __init__(self, num_classes=1,
input_channels=3, **kwargs):
super().__init__()
nb_filter = [32, 64, 128, 256, 512]
self.pool = nn.MaxPool2d(2, 2)
# Encoder
self.conv0_0 = DoubleConv(input_channels, nb_filter[0])
self.conv1_0 = DoubleConv(nb_filter[0], nb_filter[1])
self.conv2_0 = DoubleConv(nb_filter[1], nb_filter[2])
self.conv3_0 = DoubleConv(nb_filter[2], nb_filter[3])
self.conv4_0 = DoubleConv(nb_filter[3], nb_filter[4])
# Bottleneck
self.bottleneck = DoubleConv(nb_filter[4], nb_filter[4] * 2)
# Decoder
self.upconv4 = nn.ConvTranspose2d(nb_filter[4] * 2, nb_filter[4], 2, 2)
self.conv4_1 = DoubleConv(nb_filter[4] * 2, nb_filter[4])
self.upconv3 = nn.ConvTranspose2d(nb_filter[4], nb_filter[3], 2, 2)
self.conv3_2 = DoubleConv(nb_filter[3] * 2, nb_filter[3])
self.upconv2 = nn.ConvTranspose2d(nb_filter[3], nb_filter[2], 2, 2)
self.conv2_3 = DoubleConv(nb_filter[2] * 2, nb_filter[2])
self.upconv1 = nn.ConvTranspose2d(nb_filter[2], nb_filter[1], 2, 2)
self.conv1_4 = DoubleConv(nb_filter[1] * 2, nb_filter[1])
self.upconv0 = nn.ConvTranspose2d(nb_filter[1], nb_filter[0], 2, 2)
self.conv0_5 = DoubleConv(nb_filter[0] * 2, nb_filter[0])
self.final = nn.Conv2d(nb_filter[0], num_classes, kernel_size=1)
def forward(self, x):
x0_0 = self.conv0_0(x)
x1_0 = self.conv1_0(self.pool(x0_0))
x2_0 = self.conv2_0(self.pool(x1_0))
x3_0 = self.conv3_0(self.pool(x2_0))
x4_0 = self.conv4_0(self.pool(x3_0))
x5_0 = self.bottleneck(self.pool(x4_0))
x4_1 = self.conv4_1(
torch.cat([self.upconv4(x5_0), x4_0], dim=1))
x3_2 = self.conv3_2(
torch.cat([self.upconv3(x4_1), x3_0], dim=1))
x2_3 = self.conv2_3(
torch.cat([self.upconv2(x3_2), x2_0], dim=1))
x1_4 = self.conv1_4(
torch.cat([self.upconv1(x2_3), x1_0], dim=1))
x0_5 = self.conv0_5(
torch.cat([self.upconv0(x1_4), x0_0], dim=1))
return torch.sigmoid(self.final(x0_5))
def criterion(pred, mask):
return bce_loss(pred, mask) + dice_loss(pred, mask)
Figure 4 — U-Net predictions on CVC-ClinicDB test images at epochs 1, 20, 50, and 100. Epoch 1: rough initial detection with noisy boundaries and false positives. Epoch 20: recognisable polyp shapes with improved boundaries, Dice 0.827. Epoch 50: clean segmentation, Dice 0.920. Epoch 100: precise boundaries closely matching ground truth, Dice 0.922.
The progression of predictions across epochs reflects what the model is learning in sequence. It learns the global presence of a polyp before it learns its extent, and it learns approximate extent before it learns precise boundaries. The skip connections are what enable the final stage, precise boundaries require the spatial detail that the encoder preserved, and that detail only becomes useful to the decoder after it has learned the semantic context from the bottleneck.
Figure 5 — Validation metrics over 100 training epochs. IoU and Dice converge steadily, reaching best values of 0.8664 and 0.9277 respectively at epoch 80. Sensitivity peaks at 0.9234 at epoch 25. Specificity reaches 0.9959 at epoch 53. All metrics plateau after epoch 60, indicating full convergence.
The training curves confirm the pattern established in Blog 1. BCE + Dice produces reliable sensitivity from the first epochs, the model finds polyps early and refines boundary precision over subsequent epochs. The IoU and Dice curves show steady improvement without collapse, which is the signature of a well-functioning loss function on an imbalanced dataset.
Understanding the architecture's limitations is as important as understanding its strengths.
Very small objects. When a polyp occupies only 1 to 2% of the image area, it may be represented by just a handful of pixels at the bottleneck. The global context is dominated by the surrounding healthy tissue. The model has very little signal to work with for objects this small. Higher input resolution and specialized loss functions help, but the fundamental constraint is architectural.
Computational cost at high resolution. The five-stage U-Net with channel widths [32, 64, 128, 256, 512] has approximately 7 million parameters. At 512×512 input, a batch of eight images requires substantial GPU memory. Scaling to higher resolutions quickly hits hardware limits.
Multiscale feature capture at the bottleneck. The standard U-Net bottleneck applies the same double-convolution block used throughout the encoder. It has no dedicated mechanism for capturing contextual information at multiple scales simultaneously, a limitation that more recent architectures address explicitly through modules like ASPP or dilated convolution pyramids.
Selected implementations, supporting utilities, and companion resources related to this article are available through the Blog 3 Companion Resources page .
U-Net works because it gives the decoder direct access to the spatial detail that the encoder preserved before discarding it through downsampling. The skip connections are not a regularization trick or an optimization convenience. They are the solution to a specific, fundamental problem, the irrecoverable loss of spatial information during encoding.
Every modern segmentation architecture that outperforms U-Net does so by addressing one of the limitations listed above, while keeping the core encoder-decoder-skip-connection structure intact. Understanding why U-Net is designed the way it is makes those improvements immediately comprehensible, because you can see exactly which problem each one is solving.
Training a segmentation model and running into issues this post does not cover?
Feel free to reach out through occulins.com/contact
Supporting assets for the article: U-Net Explained Clearly, architecture logic, skip connections, encoder-decoder structure, and selected visualization utilities.
Reference U-Net implementation used across the related segmentation articles.
Previously AddedReusable two-layer convolution block used throughout the U-Net encoder, bottleneck, and decoder stages.
Previously AddedArchitecture setup and training configuration used during the CVC-ClinicDB segmentation experiments.
Previously AddedMinimal reference showing how encoder features are concatenated with decoder features at matching spatial scales.
Architecture LogicSelected plotting utility for visualizing validation Dice and IoU across training epochs.
VisualizationShort notes describing how the explanatory U-Net figures were structured for clarity and progressive understanding.
Figure GuideSelected resources are designed to support experimentation and practical understanding through focused implementations, utilities, and companion materials accompanying each article.
Minimal example showing how encoder features are fused with decoder features at matching spatial resolutions.
decoder_feature = upsample(
decoder_feature
)
fusion = torch.cat(
[
encoder_feature,
decoder_feature
],
dim=1
)
Utility for visualizing validation metrics during training.
plt.plot(
epochs,
history['val_dice'],
label='Dice'
)
plt.plot(
epochs,
history['val_iou'],
label='IoU'
)
plt.xlabel(
'Epoch'
)
plt.ylabel(
'Metric'
)
plt.legend()
plt.grid(True)
plt.show()
Design philosophy behind the explanatory architecture figures.
Architecture figures were intentionally designed to progressively explain encoder behavior, skip connections, feature hierarchy, and information flow while maintaining visual consistency across diagrams.
Visual complexity was reduced by emphasizing resolution transitions, channel growth, and feature propagation instead of implementation-level detail.
Figures prioritize conceptual understanding and visual clarity rather than framework-specific implementation details.
Object detection training failures share an uncomfortable characteristic: they are often invisible until you look in the right place.
A model can train for a hundred epochs, produce a respectable mAP number, and still be fundamentally broken, predicting the wrong classes, failing on the objects that matter most, or performing well only because the validation set leaked into training. The loss curves look fine. The metrics look acceptable. The predictions on casual inspection look plausible.
After years of working on detection problems across aerial imagery, medical imaging, and industrial inspection, I have seen the same mistakes appear repeatedly, not because the people making them are careless, but because the mistakes are genuinely subtle and the feedback signals that would reveal them are easy to overlook.
This post covers four of them. Each one has a specific symptom pattern that tells you it is present, and a specific fix that resolves it. None of them require changing your architecture or your hardware. All of them require paying closer attention to things you may currently be skipping.
This is the most common root cause of poor detection performance, and the one most rarely investigated because it requires looking at data rather than adjusting training parameters.
Annotation errors come in several forms, and they have different effects on training. Understanding which type you have determines what to do about it.
Different annotators, or even the same annotator on different days, draw bounding boxes with different tightness. One person draws the box flush to the visible object edges. Another adds a few pixels of margin. A third includes part of the background context.
When a model is trained on this mixed data, it learns an inconsistent definition of where an object ends and background begins. During inference, predicted boxes are evaluated against ground truth boxes using IoU. If your ground truth boxes are inconsistently sized, your IoU measurements are measuring annotator inconsistency as much as model performance.
Figure 1 — Three annotation styles on the same object: tight (flush to visible edges), loose (2–5px margin), and inconsistent (partial background included). A training set mixing all three styles produces a model that learns no consistent boundary definition.
This is harder to detect and more damaging than box tightness variation. It happens when the same visual object is labelled differently depending on context, annotator, or a class definition that was never written down precisely.
Common examples: a partially occluded car labelled as "car" in one image and ignored in another. A person on a bicycle labelled as "person" in some images and "cyclist" in others when cyclist is not a defined class. A van labelled as "car" by one annotator and "truck" by another.
The symptom in training is a class loss that decreases slowly or plateaus early, and a confusion matrix where two specific classes consistently swap with each other. If you see that pattern, look at the annotations for those two classes directly.
In YOLO format, an image with no objects should have an empty label file. An image with objects but a missing label file is treated as a negative example — the model is taught that the objects in that image are background. This is one of the most damaging annotation errors because it actively teaches the model wrong information, not just inconsistent information.
Verify this before training:
import os
img_dir = 'dataset/images/train'
label_dir = 'dataset/labels/train'
missing = []
for img_file in os.listdir(img_dir):
stem = os.path.splitext(img_file)[0]
label_file = os.path.join(label_dir, stem + '.txt')
if not os.path.exists(label_file):
missing.append(img_file)
print(f"Images with no label file: {len(missing)}")
for f in missing[:10]:
print(f" {f}")
If this returns more than zero for a dataset that should have objects in every image, the missing label files are training your model to ignore those objects.
Data augmentation is universally recommended, and for good reason, it is one of the most effective tools for improving generalization on limited datasets. But augmentation is not a dial you turn up for better performance. The wrong augmentation strategy can actively damage your model, and the damage is difficult to diagnose because the training metrics often look fine while it is happening.
The most common augmentation mistake in detection is applying aggressive random cropping or zooming-out on datasets where objects are small relative to the image. When you randomly crop 40% of a 640×640 image, a pedestrian that was 20 pixels tall may disappear entirely, but its label file still says a pedestrian is present. The model is being trained on images where the labelled object is no longer visible.
For aerial datasets with small objects, the safe augmentation operations are flipping, rotation, and mild color jitter. Heavy cropping, large-scale mosaic augmentation with significant zoom-out, and perspective transforms that shrink objects further are all candidates for removal.
Aggressive color jitter is appropriate for natural photography where color balance varies widely between cameras and lighting conditions. It is not appropriate for datasets acquired under controlled conditions, certain medical imaging formats, thermal imagery, or nighttime surveillance footage where color characteristics are fixed by the acquisition protocol.
If your deployment images will always look roughly the same in terms of color and exposure, training on aggressively color-jittered images teaches the model to be robust to variation it will never encounter, while reducing its precision on the actual color characteristics of your domain.
YOLO's mosaic augmentation combines four images into one training sample. For most datasets this improves performance by exposing the model to more objects per forward pass. For datasets with very dense small objects, aerial imagery, crowd detection, microscopy, mosaic can produce images where object density exceeds anything in the real deployment distribution. The model learns to expect densities it will never see, which affects both detection thresholds and confidence calibration.
This mistake does not affect how your model trains. It affects whether you know what your model can actually do. Reporting the wrong metric, or computing the right metric on the wrong data, produces numbers that look good while hiding real failures.
mAP50 evaluates detection at a single IoU threshold of 0.5. A predicted box that overlaps the ground truth by 50% counts as a correct detection. This is a lenient standard, a box that covers the correct general region but is significantly larger or smaller than the actual object still passes.
mAP50-95 averages detection performance across IoU thresholds from 0.5 to 0.95 in steps of 0.05. At IoU 0.75, a predicted box needs to overlap 75% with the ground truth to count. At IoU 0.95, the box needs to be almost pixel-perfect.
For applications where precise bounding box location matters, measuring object size, feeding detections into a tracking system, or using boxes to guide downstream processing — mAP50-95 is the metric that tells you whether your model is accurate enough. mAP50 can look 15 to 25 points higher than mAP50-95 on the same model.
Figure 2 — mAP50 vs mAP50-95 across training epochs for YOLOv12n on VisDrone. mAP50 reaches 0.325 while mAP50-95 plateaus at 0.185, a gap that reflects the model's ability to find objects in the right location but not localise them precisely. The shaded region between the two curves is what mAP50 alone hides. Always report both metrics.
This sounds too basic to mention. It is not. It happens more often than it should, in two forms.
The first form is accidental: the validation data path in the YAML file points to the training directory due to a typo or copy-paste error. The model evaluates on data it has already memorised. Validation mAP is inflated by 10 to 30 points depending on how long the model has trained.
The second form is subtle: the validation split was created by random sampling at the image level from a dataset where multiple images came from the same video sequence or the same scene. Frames from the same scene look nearly identical. A model that has seen other frames from the same scene during training will perform well on the validation frames, not because it generalizes, but because it has effectively memorised the scene.
import yaml
with open('dataset.yaml') as f:
cfg = yaml.safe_load(f)
print("Train path:", cfg.get('train'))
print("Val path:", cfg.get('val'))
assert cfg['train'] != cfg['val'], \
"Train and val paths are identical — check your YAML"
Mean Average Precision averages across all classes. A model that achieves 0.72 mAP on a ten-class dataset may be achieving 0.90 on five common classes and 0.50 on five rare ones. If the rare classes are the ones that matter in your application, the aggregate number is hiding a failure.
Always look at per-class AP alongside the mean. In Ultralytics, the per-class results are printed at the end of validation. Read them, not just the summary line.
Overfitting in detection is not always the obvious case where training loss goes to zero and validation loss spikes. In practice it is often subtler, a model that performs well on the validation set but fails on genuinely new data from a slightly different source.
The textbook version of overfitting is easy to diagnose from the loss curves: training loss decreases steadily while validation loss plateaus or begins to rise. If you see this pattern, the model is memorising the training set rather than learning generalizable features.
Figure 3 — Left: healthy training, training and validation loss decrease together and converge. Right: overfitting, training loss continues decreasing while validation loss plateaus and begins rising. The gold dotted line marks the best checkpoint, the model should be saved here, not at the end of training.
The more dangerous form of overfitting does not show up in your loss curves at all. The model generalises well to your validation set, but your validation set is not representative of deployment conditions.
This happens when training and validation data come from the same source, same time period, same camera, or same geographic location, while deployment data comes from a different source. The model has learned the specific visual characteristics of your dataset, a particular camera's color profile, the typical lighting conditions of a specific location, the image quality of a specific acquisition protocol, rather than the underlying object appearance.
The only way to detect this is to evaluate on truly independent test data from a different source than your training data. If performance drops significantly on that data, your model has overfit to your dataset distribution.
The standard advice, more data, more dropout, stronger augmentation, is correct but incomplete. The more important question is why the model is overfitting, because the answer determines the right fix.
Each of the four mistakes above has something in common: they are all invisible if you only look at your final training metrics.
Wrong annotations produce acceptable loss curves. Bad augmentation produces acceptable training stability. Poor evaluation produces numbers that look fine. Overfitting produces good validation metrics right up until you test on new data.
The habit that prevents all four is simple, and almost nobody does it consistently: look at your actual data and your actual predictions at every stage of the pipeline, not just the numbers that summarise them.
Selected implementations, supporting utilities, and companion resources related to this article are available through the Blog 4 Companion Resources page .
| Mistake | Symptom | Fix |
|---|---|---|
| Wrong annotations | Class loss plateaus, classes swap in confusion matrix | Define classes in writing, spot-check 50 pairs, verify no missing labels |
| Bad augmentation | Small objects missed, poor confidence calibration | Start minimal, add one operation at a time, measure each addition |
| Poor evaluation | Strong metrics, weak real-world performance | Report mAP50 and mAP50-95, verify val path, split at scene level |
| Overfitting | Val loss rises, deployment performance drops | Early stopping, smallest sufficient model, independent test data |
These mistakes are not signs of inexperience, they appear in research projects and production systems alike. What separates teams that catch them quickly from those that spend weeks on the wrong problem is the habit of looking at data and predictions directly, rather than trusting that metrics alone will surface the issue.
The metrics will not surface the issue. They will hide it.
Working on a detection project and running into performance problems that this post does not fully resolve?
Feel free to reach out through occulins.com/contact
Supporting assets for debugging object detection training failures, dataset verification, evaluation sanity checks, and failure analysis workflows.
Utility for identifying training images without corresponding annotation files.
Dataset VerificationSimple workflow for visually validating image-label pairs before training begins.
Quality ControlUtility for checking dataset splits and preventing train-validation leakage.
Configuration CheckReference helper for understanding the difference between mAP50 and mAP50-95.
EvaluationGuidelines for identifying memorization and poor generalization behavior.
Failure AnalysisPre-deployment sanity checklist before trusting validation metrics.
ChecklistSelected resources are designed to support experimentation and practical understanding through focused implementations, utilities, and companion materials accompanying each article.
Check whether each training image has a matching YOLO annotation file.
import os
image_dir = "images/train"
label_dir = "labels/train"
images = {
os.path.splitext(f)[0]
for f in os.listdir(image_dir)
}
labels = {
os.path.splitext(f)[0]
for f in os.listdir(label_dir)
}
missing = images - labels
print(f"Missing labels: {len(missing)}")
for item in sorted(missing):
print(item)
A practical checklist for visually validating image-label pairs before training.
Check dataset paths to avoid train-validation leakage.
import yaml
with open("dataset.yaml", "r") as f:
cfg = yaml.safe_load(f)
print("Train path:", cfg["train"])
print("Validation path:", cfg["val"])
assert cfg["train"] != cfg["val"], \
"Train and validation paths are identical. Check your YAML file."
Quick notes for interpreting mAP50 and mAP50-95 together.
High mAP50 with lower mAP50-95 often indicates localization weakness rather than complete detection failure.
Large gaps suggest predictions may detect objects correctly but produce poorly aligned bounding boxes.
Always inspect qualitative predictions and per-class AP instead of relying on one aggregate metric.
Practical signs that a detector is memorizing instead of generalizing.
Pre-deployment sanity checklist before trusting validation metrics.