ANONYMISED CASE STUDY

Continuous Stock Visibility through Computer-Vision Measurement

How we built a camera-driven view of what is physically on the racks, and how big it is, across a 100+ site distribution estate, reconciled against the WMS in near real time.

Software Development

April 13, 2026

COUNTRY

USA

TEAM SIZE

DURATION

12+ months

BUDGET

INDUSTRY

Logistics

TECHNOLOGIES

Python / PyTorch / ONNX Runtime / TensorRT / NVIDIA Jetson / Azure

table of content

Myroslav Budzanivskyi

Co-Founder & CTO

Discuss a similar project

Talk through scope, risks, and delivery approach with our CTO

Schedule a call

SUMMARY

A multi-site distribution operator runs 100+ warehouses and depots, with a roadmap past 500 sites. They needed continuous stock visibility between periodic cycle counts. We built a camera-driven view of what is physically on the racks and on the floor, and how big it is, reconciled against the WMS in near real time.

The hard part was metrology, not detection. Turning pixels into centimetres with a defensible error bar, on a moving floor, with non-ideal lighting. Three hard constraints shaped the work: raw video could not leave the site, depot-to-centre links were uneven, and the system could not slow the people working the floor.

Client identity, geography, vendor names and exact figures have been generalised at the client's request. The architecture, measurement approach and deployment model described are as delivered.

Client Profile & Strategic Context

Before the engagement, stock visibility ran on a familiar playbook: cycle counts on a schedule, barcode scans at receiving and putaway, and a WMS that treated those inputs as the source of truth. At smaller scale it held up. At the operator's scale, with the expansion roadmap pushing the estate further every quarter, the same approach started accumulating blind spots faster than the operations team could close them.

Between counts, the operator was blind:

Empty pick-faces went unnoticed until a picker arrived to a gap. Wasted travel, missed SLAs.
Mis-stows (item placed in the wrong location on inbound) surfaced days later, by which point downstream picks against that SKU had already failed.
Dimensional data for inbound goods was missing, or keyed by hand at receiving. Cubing errors fed straight into slotting and load planning.

The net effect at scale: overstated system stock, avoidable stockouts at the pick-face, wasted picker travel, bad cube data flowing into bad slotting and bad load plans.

Objective

Add a continuous, camera-driven view that becomes the primary signal between cycle counts. Counts and barcode scans stay as inputs, no longer as the only stock-truth signal. The view covers presence (what is on each rack right now), quantity (count and fill-state per location), dimensions (length, width, height and volume of inbound goods, measured rather than keyed), and reconciliation against the WMS in near real time.

Three hard constraints

Constraint 1. Raw video cannot leave the site. Site-level privacy and contractual obligations meant footage could not be backhauled to a central cloud. Only derived measurements and events could egress.

Constraint 2. Depot-to-centre links are uneven. Bandwidth and reliability varied across the estate. The system had to operate, and keep measuring, through degraded links and full disconnections, with no central blind spots when connectivity returned.

Constraint 3. The system cannot slow the floor. Pickers, putaway and receiving could not be asked to pose for the camera, stop to confirm a measurement, or scan extra barcodes. Capture had to be passive and in-flow, with no added cycle time.

Why This Was a Measurement Problem

Standard warehouse computer-vision projects stop at "is there something on the rack?" That is a classification or detection task. The brief here required two further steps that most teams under-estimate.

1. Localise (computer vision)‍

Find the object in the frame and segment it from background, neighbouring SKUs, packaging and shelf furniture.

2. Measure (metrology)‍

Recover real-world dimensions (length, width, height, volume) from imagery, with a known accuracy budget, on a moving floor with non-ideal lighting.

3. Reconcile (systems)‍

Fuse those measurements against WMS state, raise variances, and route them to the right human in near real time.

Step 2 is the hard part. Pixels are not centimetres. Turning them into centimetres with a defensible error bar is a metrology problem built on top of computer vision. That discipline separates a useful warehouse-vision system from a slideware demo.

Cube measurement detail: segmented carton with calibrated dimensions in millimetres and inches, per-axis tolerance and confidence envelope. — A single measurement event. Length, width and height come from the calibrated silhouette, each with its own ±tolerance. The confidence envelope decides whether the cube writes back to the WMS automatically or queues for one-tap human confirmation./

Scope of Work

To tackle these challenges, our scope of work included:

Scene calibration

We calibrate each camera against a known scene geometry: fiducial markers placed at receiving and pick-faces, plus reference objects of known dimension already present in the workflow (totes, pallet footprints, shelf beams). Calibration removes lens distortion, recovers intrinsics, and locks the pixel-to-millimetre relationship at the depth ranges the camera sees. Drift detection triggers re-calibration on its own. A site does not stop work to recalibrate.

Object segmentation

A segmentation model isolates the object of interest (a carton, parcel, tote or pallet load) from its background. We use instance segmentation rather than bounding boxes because dimensioning requires silhouettes, not rectangles. A tilted carton would otherwise inflate its measured length by several centimetres.

Depth and dimension recovery

Where stereo or depth-capable cameras exist, depth comes from the sensor. Where the existing CCTV is monocular (the common case across the estate), we recover depth from three signals: the calibrated ground plane, reference objects in-frame at known scale, and multi-view fusion when an object passes more than one camera. We then derive length, width and height from the segmented silhouette under the calibrated geometry. Volume follows from those dimensions for cuboidal items, and from the recovered volumetric model for non-cuboidal loads.

Confidence and fallback

Every measurement carries a confidence envelope, not a bare number. Below a per-SKU-class threshold, the system flags the measurement as advisory and asks for confirmation. It does not push low-confidence cubes into the WMS without review. In a production measurement system, an honest "I'm not sure" beats a confident wrong answer.

Ground-truth validation

On a schedule, we match a sample of measurements against physical re-measurement (calipers or a certified cubing station), and against the manufacturer-declared cube where available. The deltas feed model retraining and per-site error budgets. We track accuracy, not assume it.

Why this matters for any measurement-from-imagery project. Defensible dimensional accuracy from imagery (sub-centimetre on warehouse cartons in this engagement, with tighter error budgets achievable in regimes with simpler scene geometry) does not come from picking a better model. It comes from disciplined calibration, segmentation quality, depth recovery, and honest confidence reporting.

Architecture: Edge-First, Metadata-Only Egress

A dedicated on-site processing unit at each depot runs all CV inference locally. Only structured measurements and events leave the site, never raw video. This section describes the on-prem multi-site delivery surface required by this engagement. The CV and metrology pipeline above is the transferable part of the work.

cv-architecture-diagram — *Edge-first architecture across 100+ depots. CV inference runs at each site; only structured measurements and variance events reach the central cloud.*

Constraint	How the architecture addresses it
Raw video cannot leave site	All inference runs on the depot's edge node. The wire carries structured measurements and events: kilobytes, not video streams.
Uneven depot-to-centre links	Metadata transport stays viable on poor links and degrades gracefully. Edge nodes buffer locally during outages and sync on reconnection, with no central blind spots.
Cannot slow the floor	Passive capture from existing CCTV, plus dedicated cubing-angle cameras at receiving. No operator action, no extra scans, no added cycle time.
Multi-site, growing estate	Each depot is an independent edge node; central scales by region. Adding sites is additive, with no re-architecture from 100 to 500+.

Operations console Sites view: per-depot tiles showing accuracy, open variances, edge-node health and last sync time. — Per-depot view. Each tile carries the site's current accuracy, open variances, edge-node health and last sync. Offline sites show buffered-event counts that sync on reconnect, so no central blind spot opens during connectivity loss.

Technology Stack

Every technology choice was driven by three criteria: defensible measurement accuracy at production scale, edge-first operation that respects on-site data constraints, and portability across cloud providers without vendor lock-in.

Layer	Technologies & Rationale
Computer vision (on-device)	Object detection, instance segmentation and multi-object tracking. Calibrated dimensional and volumetric measurement covering length, width, height and volume. Monocular depth recovery from existing CCTV; stereo or RGB-D fusion where depth sensors are present. Per-camera calibration: lens distortion correction, intrinsics, ground-plane recovery, fiducial-anchored re-calibration, automated drift detection. Multi-view fusion when more than one camera sees the object.
Edge runtime (per site)	Python services in containers on GPU-accelerated edge nodes (NVIDIA Jetson, or x86 with a discrete GPU). PyTorch in training; ONNX Runtime or TensorRT for optimised on-device inference. RTSP ingest from existing CCTV, so most positions need no new cameras. Local buffer plus store-and-forward sync for offline operation.
Reconciliation & events (central)	Structured event transport carries measurements and variance events only, never raw video. Near-real-time correlation against WMS state of record. Variance and exception detection with severity routing.
Cloud reference	Azure reference architecture: Entra ID, Key Vault, Azure Policy, regional data residency. Portable to AWS or GCP equivalents.
Integration & surfaces	WMS and ERP read+write via REST for cube write-back, variance feed and stock-state queries. Web operations console in React and TypeScript. Mobile supervisor view for putaway prompts and alert acknowledgement.
ML lifecycle	Training in cloud on the operator's labelled dataset. Ground-truth comparison harness: sampled physical re-measurement versus CV output, feeding model retraining and per-site error budgets. Per-site drift detection, automated re-calibration triggers, scheduled retraining.

Variances queue: live exceptions across the estate with type, severity, routed owner and CV-vs-WMS deltas for resolution. — Live variances across the estate. Each exception is typed (mis-stow, cube variance, restock-required), routed to a named operator, and surfaced with the CV measurement next to the WMS record so the operator can confirm or reject in one tap.

What Gets Reconciled, in Near Real Time

CV measurement at depot → structured event to centre → match against WMS → variance routed to the right human.

Pick-face fill state. Empty, low, or full, on a continuous basis, so a gap is known before a picker travels to it.
Putaway confirmation. Was the item placed in the slot the WMS believes? Mis-stows surface in minutes, not days.
Inbound cube. Every receiving event is measured. High-confidence cubes write back to the WMS without human input; the residual low-confidence cases queue for one-tap human confirmation rather than full re-keying.
Floor inventory. Items staged outside their nominal locations are seen, attributed and reconciled against expected flow.

Team Shape on This Engagement

Role	Focus
System Architect	End-to-end design, edge/cloud split, constraint trade-offs
ML / CV Engineers (×2)	Segmentation, dimensioning pipeline, calibration, model lifecycle
Backend Engineers	Reconciliation engine, WMS integration, variance routing
DevOps	Edge image build, fleet provisioning, offline sync, observability
QA	Accuracy harness, ground-truth comparison, regression discipline
Project Manager	Bi-weekly delivery, rollout coordination across sites

Technologies We Use in This Project

Outcomes

Measured against the operator's own pre-deployment baseline at the pilot sites, and tracked at each rollout wave:

Cube data: measured. Hand-keying drops to the residual low-confidence cases. Cube accuracy holds against caliper ground truth.
Pick-face stockouts: down. The team sees empty pick-faces before a picker arrives. Replen now prioritises against live state, not the last cycle count.
Mis-stow latency: days to minutes. Putaway errors surface in the same shift, not via downstream pick failures.
Downstream effects: slotting and load planning. Better cube in → better slotting, better load planning, fewer cube-driven re-works.

Specific percentage figures are withheld at the client's request.

Operations console Network Overview: 127-depot US estate with cube write-back, pick-face, mis-stow latency and accuracy KPIs. — Operations console Network Overview across a 127-depot US estate. The four KPI tiles map to the project's primary outcomes: cube auto-write-back rate, empty pick-faces, mis-stow latency, and cube-measurement accuracy. The activity feed streams variance events as the edge nodes raise them in near real time.

Why This Work Travels to Other Measurement-from-Imagery Problems

The class of problem solved here: recover a real-world measurement of a physical object from imagery, with a known accuracy budget, at production scale, on mostly existing hardware, with targeted additions only where the geometry requires them. The same class shows up across different industries:

Parcel and pallet cubing in logistics.
Dimensional QC of manufactured parts against spec.
Anatomical measurement from medical imagery (we have separate production work in this space).
Measurement of small, hand-scale objects against a known reference, including consumer-facing video-based sizing applications.

The CV and metrology disciplines that transfer are calibration, segmentation quality, depth recovery, multi-view fusion, honest confidence reporting, and ground-truth validation loops. They are independent of the delivery surface.

The delivery surface itself adapts to the constraint set. In this engagement that meant edge nodes plus central cloud, because raw video could not leave the premises. For consumer-facing sizing on a brand's e-commerce site the constraint set inverts: there are no premises and no fleet of edge nodes, so the pipeline runs server-side behind a cloud API, fronted by an embeddable widget on the brand's own surface. Same five-layer metrology pipeline, different delivery surface.

The reference object that anchors scale also shifts with the regime. In this engagement, reference objects (pallets, totes, fiducial markers) are already in the scene by virtue of the workflow. In a consumer setting they aren't, so scale is anchored on a user-presented object of known dimension (a payment card or printed marker held alongside the object), on the device's own AR-derived scene scale, or on an anatomical prior calibrated against a held reference. Different references, same calibration discipline.

Future Plans

FAQ Section (Technical Deep Dive)

How is dimensional accuracy validated, and how is it maintained over time?

Accuracy is tracked, not assumed. Every measurement leaves the pipeline with a confidence envelope, and a sampled subset feeds a closed-loop validation harness against physical re-measurement using calipers or a certified cubing station.

Three feedback loops run in parallel. Scheduled ground-truth comparison drives per-site error budgets and model retraining. Per-camera drift detection monitors calibration against known scene geometry and fires automated re-calibration when the pixel-to-millimetre relationship moves beyond threshold. Below a per-SKU-class confidence threshold, the system flags the measurement as advisory and asks for one-tap human confirmation rather than writing it back to the WMS.

The discipline matters more than the model architecture. Sub-centimetre accuracy on warehouse cartons in this engagement came from calibration quality, segmentation rigour, and honest confidence reporting.

Why edge-first inference instead of streaming video to a central cloud?

Three constraints made edge-first the only viable choice for this engagement.

Privacy and contract. Raw footage from the warehouse floor could not leave the site. Backhauling video to a central cloud would have violated client policy and the underlying tenant agreements at several depots.

Bandwidth and link reliability. Depot-to-centre connectivity varied widely. Streaming hundreds of camera feeds per site over uneven links was never workable. Edge inference drops wire traffic from continuous video to kilobytes of structured events.

Latency for floor-actionable signals. Pick-face stockouts and putaway errors are only useful if surfaced in near real time. Round-tripping inference to a central cloud added unacceptable delay during connectivity dips.

The same five-layer metrology pipeline runs server-side behind a cloud API in our consumer-facing sizing work. The architecture follows the constraint set; the measurement discipline does not change.

What happens at a depot when connectivity to the central cloud drops?

Each depot operates as an independent edge node. A central connectivity outage does not stop measurement, and it does not produce central blind spots when the link returns.

During an outage, CV inference continues locally. Pick-face fill state, putaway confirmation, inbound cube and floor inventory keep updating against the depot's local view. Structured measurement and variance events are written to a local buffer with store-and-forward semantics. Site dashboards and floor-facing alerts fire from the edge node itself, so floor staff are not blocked.

On reconnection, buffered events sync to the central reconciliation engine in order. Variances raised during the outage land in the same exception queue as live ones, with the original event timestamp preserved. No central blind spot, no silent data loss.

How can existing CCTV deliver measurement accuracy without depth sensors?

Most positions in the estate are monocular CCTV. Depth-capable sensors exist only at receiving and a handful of high-value pick zones. Recovering real-world dimensions from monocular imagery is solvable when scene geometry is calibrated.

Three signals combine to recover depth: a calibrated ground plane that locks the pixel-to-millimetre relationship at the depth ranges the camera sees; in-frame reference objects at known scale (standard totes and pallet footprints, plus fiducial markers placed at receiving) that anchor calibration and keep it self-correcting over time; and multi-view fusion when an object passes more than one camera, so observations are combined into a single volumetric estimate.

Where depth sensors are present we use them directly. Where they are not, the monocular pipeline produces measurements with a confidence envelope, and low-confidence cases queue for human review rather than write back blindly.

How does the architecture scale from 100 depots to 500 without rework?

Each depot is a self-contained edge unit. Adding sites is additive, not architectural.

Inference, buffering and local dashboards run per depot, so adding a new site does not change the failure mode of existing sites or increase central load per site. The reconciliation engine, variance routing and analytics layer scale by region, with wave-based rollout keeping central capacity ahead of edge population. Edge nodes are built from a versioned image, which makes bringing a new site online a fleet operation rather than a bespoke deployment.

What changes between rollout waves is calibration data and per-SKU-class error budgets, not the architecture. The expansion roadmap past 500 sites does not require re-platforming.

How does the system handle non-cuboidal loads and irregular pallets?

Standard cubing pipelines assume cuboids. Real distribution estates do not. Mixed pallets, shrink-wrapped loads with overhanging cartons, irregular cases and partially-built pallets all show up at receiving and have to be measured correctly.

The pipeline isolates the object with instance segmentation rather than bounding boxes. A bounding box around a tilted or irregular load inflates measured dimensions by several centimetres; the silhouette preserves the true outline. For cuboidal items, length, width and height are derived directly from the calibrated silhouette. For non-cuboidal loads, the pipeline reconstructs a volumetric model from multi-view observations and returns volume directly, plus a bounding cuboid for slotting and load planning where downstream systems require one.

Confidence is naturally lower on irregular loads. The per-SKU-class threshold machinery handles that: a confidently-measured tote writes back to the WMS automatically; an irregular pallet with elevated uncertainty queues for one-tap confirmation.

What data leaves the depot, and what stays on-site?

Raw video never leaves the depot. The wire carries structured measurements and events only.

Inside the depot, behind the edge node, sit all camera feeds over RTSP, all CV inference, the local event buffer during connectivity outages, plus site dashboards and floor-facing alerts. What egresses to the central cloud is a deliberately short list:

Structured measurement events with confidence envelope
Variance events against WMS state
Operational telemetry (edge node health, calibration drift signals, queue depth)
Sampled ground-truth measurements for the validation harness, anonymised at source

The boundary is enforced at the edge node, not in policy alone. Bandwidth caps and outbound network rules block raw frame egress at the network layer, so accidental misconfiguration cannot push video off-site.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Want to achieve similar results? Let’s develop your idea!

Talk to us

Schedule a meeting

Write to us

oTHER case studies

See all

View case study

EdTech

Germany

Tutorai AI Tutoring Platform with 3D Avatars

Cover image of Engineering Hiring Automation Platform Case Study

View case study

USA

RecruitAI AI-Assisted Engineering Recruitment Platform

View case study

HealthTech

USA

RadFlow AI AI-Powered Radiology Workflow Assistant

AI-Driven Sales Operations Modernization Case Study

View case study

USA

Multi-Agent AI Sales System AI-Driven Lead Qualification and Sales Pipeline Automation

Our Services

Industries

Company

Continuous Stock Visibility through Computer-Vision Measurement

Discuss a similar project

Client Profile & Strategic Context

Objective

Three hard constraints

Why This Was a Measurement Problem

Scope of Work

Scene calibration

Object segmentation

Depth and dimension recovery

Confidence and fallback

Ground-truth validation

Architecture: Edge-First, Metadata-Only Egress

Technology Stack

What Gets Reconciled, in Near Real Time

Team Shape on This Engagement

Technologies We Use in This Project

Outcomes

Why This Work Travels to Other Measurement-from-Imagery Problems

Future Plans

FAQ Section (Technical Deep Dive)

How is dimensional accuracy validated, and how is it maintained over time?

Why edge-first inference instead of streaming video to a central cloud?

What happens at a depot when connectivity to the central cloud drops?

How can existing CCTV deliver measurement accuracy without depth sensors?

How does the architecture scale from 100 depots to 500 without rework?

How does the system handle non-cuboidal loads and irregular pallets?

What data leaves the depot, and what stays on-site?

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Want to achieve similar results? Let’s develop your idea!

oTHER case studies