Pick and Place Vision System: How It Works and What It Costs to Build One

Apr 13
6 min read

A robot arm without vision is a machine that repeats a fixed motion. It works perfectly until something shifts. A part arrives at a slightly different angle. A bin empties unevenly. A product changeover happens. At that point, a blind robot either stops, crashes, or keeps placing parts in the wrong position until someone intervenes.

A pick and place vision system solves that. It gives the robot arm the ability to see where parts actually are, calculate the correct pick point in real time, and adapt to variation without reprogramming. The result is a system that handles the real world rather than a controlled simulation of it.

This post covers how a pick and place vision system works end to end, when you need 2D versus 3D, what the full system costs, and which robot arms pair cleanly with a vision-guided setup starting at $3,500.

How a Pick and Place Vision System Works

A vision-guided pick and place system runs a repeating loop: capture, process, pick, place, repeat. Each cycle involves four steps working in close sequence.

Image capture - A camera positioned above the work area or mounted on the robot arm captures an image of the part field. The trigger is typically a sensor signal, a robot request, or a timed interval synchronized with the conveyor or staging cycle. Lighting is controlled and consistent, which is one of the most important factors for reliable vision results.

Object detection and pose estimation - The vision software processes the image to identify the target object, determine its position in X and Y coordinates, and calculate its orientation (rotation angle). For 3D systems, depth is also calculated, giving the robot Z-axis placement data. This step is where the intelligence of the system lives, whether that is a rule-based pattern matcher or a deep learning model trained on images of your specific parts.

Coordinate output to the robot - The vision system passes the calculated pick coordinates to the robot controller. This happens over a standard communication protocol, most commonly TCP/IP or a vendor-specific interface. The robot receives the coordinates and moves to the calculated pick position rather than a fixed pre-programmed point.

Pick and place execution - The robot picks from the calculated position, reorients the part if needed, and places it at the target location. The cycle repeats with the next image capture.

The entire loop from image capture to robot motion typically completes in under one second for a well-configured 2D system, and under two seconds for most 3D systems. That translates to 400 to 800 pick and place cycles per hour for a 6-axis cobot, which covers the majority of real manufacturing and packaging applications.

2D vs. 3D Vision: Which One Do You Need

The most common decision in a pick and place vision system is whether to use 2D or 3D. The answer depends on what your parts look like and how they are presented to the robot.

When 2D vision is sufficient - If your parts always arrive in a flat, single-layer presentation and the robot only needs X, Y, and rotation data to pick correctly, a 2D camera is sufficient and simpler to integrate. Tray loading, conveyor picking with consistent part orientation, label verification, and structured packaging operations all fall into this category. A 2D area scan camera with a global shutter (to prevent motion blur when the camera or part is moving) handles these applications reliably at lower cost and with faster processing than a 3D system.

When 3D vision is required - If parts are randomly oriented, stacked in layers, or presented in a bin where depth varies, the robot needs Z-axis data to calculate a valid grasp. 3D vision is the right choice for bin picking (random parts in a container), depalletizing (variable stack heights), and any application where the height or tilt of the part changes the pick strategy. 3D cameras use structured light, stereo vision, or time-of-flight technology to build a depth map of the scene. Processing time is longer than 2D, but the flexibility gain is significant.

For most first-time pick and place automation projects, 2D is the starting point. If your process has consistent part presentation, start there. 3D becomes the right answer when the process genuinely requires it, not as a default upgrade.

Camera Placement: Fixed Mount vs. Eye-in-Hand

Fixed mount (eye-to-hand) - The camera is mounted above the work area on a fixed bracket and looks down at the part field. The robot moves below the camera's field of view to pick. This is the simpler setup: the camera does not move, lighting is stable, and calibration is straightforward. Fixed mount works well for conveyor picking, tray loading, and most structured pick and place applications.

Eye-in-hand - The camera mounts directly to the robot's end effector and moves with the arm. This allows the camera to capture a close-up image of the part immediately before picking, which improves accuracy for small or precision parts. Eye-in-hand adds integration complexity because the cable runs through the arm and calibration must account for the camera's position relative to the gripper. It is the right choice when the field of view from a fixed camera is too wide for the required precision, or when the robot needs to inspect the part during the pick sequence.

For most general pick and place applications, fixed mount is the practical starting point.

Complete System Cost

A complete pick and place vision system includes the robot arm, the camera and lens, a vision processing computer or smart camera, lighting, the end-of-arm tooling (gripper), and integration work to connect the vision system to the robot controller.

For a 2D fixed-mount system built around a cobot arm, here is how the cost stacks up at each tier:

The UFactory Lite 6 ($3,500) is the entry point for light parts under 3 kg. A complete 2D vision-guided pick and place cell built on the Lite 6 with a camera, lighting, gripper, and integration typically runs $12,000 to $20,000 depending on application complexity.

The Fairino FR5 ($6,999) is the workhorse for general manufacturing and packaging pick and place up to 5 kg. Full system cost at this tier runs $18,000 to $35,000.

The Fairino FR10 ($10,199) handles heavier parts up to 10 kg where you need reach and payload without moving to a significantly more expensive industrial system. Full vision-guided cell cost at this tier runs $25,000 to $45,000.

For comparison, a vision-guided pick and place cell built around a FANUC or KUKA industrial robot with comparable payload typically starts at $80,000 to $150,000 before integration.

How Blue Sky Robotics Handles Vision-Guided Pick and Place

Blue Sky Robotics' automation software includes computer vision capabilities for object detection, pose estimation, and coordinate output to the robot arm, built to work across the full lineup without requiring custom code for standard pick and place applications. The Blue Argus computer vision platform combines camera hardware, vision processing, and robot integration into a system designed to deploy without a dedicated vision engineering team.

The Pick and Place use case page covers specific application examples, and the Cobot Selector matches robot arms to your payload and cycle time requirements. Use the Automation Analysis Tool to model ROI before committing, or book a live demo to see a vision-guided system running in real time. To learn more about pick and place vision systems and computer vision for robotics, visit Blue Argus.

FAQ

Does a pick and place robot always need a vision system?

No. If parts arrive in a fixed, known position every cycle, a robot can be programmed to pick from that fixed point without a camera. Vision becomes necessary when part positions vary, when you are picking from a conveyor or bin without precise fixturing, or when your product mix changes frequently and you need the system to adapt without reprogramming.

What is hand-eye calibration in a pick and place vision system?

Hand-eye calibration is the process of teaching the vision system the precise spatial relationship between the camera and the robot's tool center point. It allows the system to correctly translate pixel coordinates in the camera image into robot joint coordinates so the arm moves to the right physical location. Most modern vision platforms include automated calibration routines that reduce this process from hours to minutes.

Can a 2D vision system handle bin picking?

Standard bin picking from randomly oriented parts in a container requires 3D vision because the robot needs depth data to calculate a valid grasp point. A 2D system can handle structured bin picking where parts are always in a single layer and the robot only needs X, Y, and rotation data, but true random bin picking requires 3D.