Friday, March 27, 2026

How ElevenLabs Voice AI Is Changing Screens in Warehouse and Manufacturing Operations


A selecting operation is the method of gathering gadgets from storage places to fulfil buyer orders.

It is likely one of the most labour-intensive actions in logistics, accounting for as much as 55% of whole warehouse working prices.

Instance of warehouse format the place operators want to choose in a number of places – (Picture by Samir Saci)

For every order, an operator receives a listing of things to gather from their storage places.

They stroll to every location, establish the product, choose the fitting amount, and ensure the operation earlier than transferring to the following line.

In most warehouses, operators depend on RF scanners or handheld tablets to obtain directions and ensure every choose.

  • What occurs when operators want each fingers for dealing with?
  • The way to onboard operators who don’t learn the native language?

Voice selecting solves this by changing the display with audio directions: the system tells the operator the place to go and what to choose, and the operator confirms verbally.

Illustration of an operator utilizing voice selecting – (Picture by Samir Saci)

Once I was designing provide chain options in logistics firms, vocalisation was the default alternative, particularly for price-sensitive tasks.

Based mostly on my expertise, with vocalization, operators’ productiveness can attain 250 bins/hour for retail and FMCG operations.

The idea is just not new. {Hardware} suppliers and software program editors have supplied voice-picking options because the early 2000s.

However these methods include important constraints:

  • Proprietary {hardware} at $2,000 to $5,000 per headset
  • Vendor-locked software program with restricted customisation
  • Lengthy deployment cycles of three to six months per web site
  • Inflexible language assist that requires retraining for every new language

For a 50-FTE warehouse, the full funding reaches $150K to $300K, excluding coaching prices.

It’s too costly for my clients.

What for those who might obtain comparable outcomes utilizing a smartphone, a custom-made net software, and trendy AI voice know-how?

On this article, I’ll present how I constructed a minimalist voice-picking module that integrates with Warehouse Administration Methods, utilizing ElevenLabs for text-to-speech and speech recognition.

Instance of screens of this app designed for use on a smartphone with a vocal interface – (Picture by Samir Saci)

This net software has been deployed within the distribution centre of a small grocery store chain with nice outcomes (the client is joyful!).

The target is to not design options that compete with market leaders, however fairly to provide an alternative choice to logistics and manufacturing operations that lack the capability to spend money on costly tools and need customised options.

Downside Assertion

Earlier than we get into voice-picking powered by ElevenLabs, let me introduce the logistic operations this AI-powered net software will assist.

Structure of the distribution centre – (Picture by Samir Saci)

That is the central distribution centre of a small grocery store chain that delivers to 50 shops in Central Europe.

Structure of the warehouse with 10 aisles and 12 pallet positions displayed on the app – (Picture by Samir Saci)

The ability is organised in a grid format with aisles (A by way of L) and positions alongside every aisle:

  • Every location shops a particular merchandise (known as SKU) with a recognized amount in bins.
  • Operators must know the place to go and what to anticipate after they arrive.

What’s the goal? Increase the operators productiveness!

They weren’t joyful concerning the order allocation and strolling paths offered by their outdated system.

Options used to optimise selecting operations for this warehouse – (Picture by Samir Saci)

They first requested to scale back operators’ strolling distance and enhance the variety of bins picked per hour utilizing the options offered on this article.

The answer was an online software related to the Warehouse Administration System (WMS) database that guides the operator by way of the warehouse.

Operators can examine their selecting record but in addition detailed info per location – (Picture by Samir Saci)

This visible format supplies a real-time view of what we have now within the system, with a greater routing resolution.

Our goal is to go from a productiveness of 75 bins/hour to 200 bins/hour with:

  • A greater order allocation of orders with spatial clustering and pathfinding to minimise the strolling distance per field picked
  • Voice-picking to information operators in a flawless method

How the Selecting Stream Works

Earlier than leaping into the vocalisation of the device, let me introuce the method of order selecting.

Three shops despatched orders to the warehouse:

  • Retailer 1 ordered 3 bins of Natural Inexperienced Tea 500g which might be situated in Location A1
  • Retailer 2 ordered 2 bins of Earl Gray Tea 250g which might be situated in Location A3
  • Retailer 3 ordered 5 bins of Arabica Espresso Beans 1kg which might be situated in Location B2

A selecting batch is a gaggle of retailer orders consolidated right into a single work project.

The operator will put together the three orders in a single batch – (Picture by Samir Saci)

The system generates a batch with a number of order strains with directions:

  • The place to go (the storage location)
  • What to choose (the SKU reference)
  • What number of bins to gather
Selecting record (left), format (center), particulars of location (proper) – (Picture by Samir Saci)

The operator simply has to course of every line sequentially.

As soon as they verify a choose, the system advances to the following instruction.

This sequential circulation is important as a result of it determines the strolling path by way of the warehouse utilizing the optimisation algorithms.

Instance of the unique pathfinding resolution (backside) and the optimised (prime)

As this can be a {custom} software, we might implement this optimisation with out counting on an exterior editor.

Why constructing a {custom} resolution? As a result of it’s cheaper and simpler to implement.

Initially, the client deliberate to buy a industrial resolution and needed me to combine the pathfinding resolution.

After investigation, we found that it might have been costlier to combine the app into the seller resolution than to construct one thing from scratch.

What’s the course of with out the AI-based voice characteristic?

Handbook Mode: The Display-Based mostly Baseline

In guide mode, the operator reads every instruction on display and confirms by tapping a button.

Two actions can be found at every step:

  • Affirm Choose: operator collected the fitting amount
  • Report Problem: the situation is empty, the amount doesn’t match, or the product is broken
Our operator has to press the button to substantiate the selecting or report a difficulty – (Picture by Samir Saci)

I constructed the guide mode as a dependable fallback in case we have now points with Elevenlabs.

However it retains the operator’s eyes and one hand tied to the machine at each step.

We have to add vocal instructions!

Voice Mode: Arms-Free with ElevenLabs

Now that you realize why we wish the voice mode to exchange display interplay, let me clarify how I added two AI-powered parts.

Technical structure of this software – (Picture by Samir Saci)

Textual content-to-Speech: ElevenLabs Reads the Directions

When the operator begins a selecting session in voice mode, every instruction is transformed to speech utilizing the ElevenLabs API.

As an alternative of studying “Location A-03-2, choose 4 bins of SKU-1042” on a display, the operator hears a pure voice say:

“Location Alpha Three Two. Choose 4 bins.”

ElevenLabs supplies a number of benefits over fundamental browser-based TTS:

  • Pure intonation that’s straightforward to grasp in a loud warehouse
  • 29+ languages obtainable out of the field, with no retraining
  • Constant voice high quality throughout all directions
  • Sub-second era for brief sentences like choose directions

However what about speech recognition?

Speech-to-Textual content: The Operator Confirms Verbally

After listening to the instruction, the operator walks to the situation, picks the gadgets, and desires to substantiate.

Right here, I made a deliberate design alternative relying on speech recognition and the reasoning capabilities of ElevenLabs.

Utilizing a single endpoint, we seize the response and match it in opposition to anticipated instructions:

  • “Affirm” or “Executed” to validate the choose
  • “Downside” or “Problem” to flag a discrepancy
  • “Repeat” to listen to the instruction once more

The agentic half interprets the operator’s suggestions and tries to match it to the anticipated interactions (CONFIRM, ISSUE, or REPEAT).

The whole course of from left to proper: Step 1 -> Step 2 -> Step 3 – (Picture by Samir Saci)

For a multilingual warehouse, this can be a important profit:

  • A Czech operator and a Filipino operator can each obtain directions of their native language from the identical system, with none {hardware} change.
  • I don’t have to contemplate all of the languages potential within the design of the answer

Why utilizing ElevenLabs?

For an additional characteristic, the stock cycle rely device offered on this video, I’ve used n8n with AI agent nodes to carry out the identical job.

n8n workflow for the voice-powered stock cycle rely instruments – (Picture by Samir Saci)

This was working fairly effectively, nevertheless it required a extra complicated setup

  • Two AI nodes: one for the audio transcription utilizing OpenAI fashions, and one AI agent to format the output of the transcription
  • The system prompts have been assuming that the operator was talking English.

I’ve changed that with a single ElevenLabs endpoint with multi-lingual capabilities.

Placing each parts collectively, a single choose cycle appears to be like like this:

The Full Voice Selecting Cycle – (Picture by Samir Saci)
  1. The app calls ElevenLabs to generate the audio instruction
  2. The operator hears: “Location Alpha Three Two. Choose 4 bins.”
  3. The operator walks to the situation (fingers free, eyes free)
  4. The operator picks the gadgets and says, “Affirm”
  5. The speech recognition endpoint processes the affirmation and strikes to the following selecting location

Your complete interplay takes a couple of seconds of system time.

What concerning the prices?

That is the place the comparability with conventional methods turns into putting.

Comparative examine – (Picture by Samir Saci)

For this mid-size warehouse with 50 FTEs, they estimated that the standard strategy prices roughly $60K to $150K within the first yr.

The AI-powered strategy prices a couple of API calls.

The trade-off is obvious: conventional methods provide confirmed reliability and offline functionality for high-volume operations.

In case of failures, we have now the guide resolution as a rollback.

This AI-powered strategy gives accessibility and velocity for organisations that can’t justify a six-figure funding.

What Does That Imply for Operations Managers and Determination Makers?

Voice selecting is not a know-how reserved for the most important 3PLs and retailers with massive budgets.

In case your warehouse has WiFi and your operators have smartphones, you’ll be able to prototype a voice-guided selecting system in days.

It’s straightforward to check it on an actual batch to measure the impression earlier than committing any important finances for productisation.

Three eventualities the place this strategy makes specific sense:

  • Multilingual services the place operators battle with screen-based directions in a language that’s not their very own
  • Multi-site operations the place deploying proprietary {hardware} to each small warehouse is just not economically viable
  • Excessive-turnover environments the place coaching time on complicated scanning methods immediately impacts productiveness

What about different processes?

Excellent news, the identical structure extends past selecting.

Voice-guided workflows can assist any course of the place an operator wants directions whereas preserving their fingers free.

You’ll find a dwell demo of a list cycle counting device right here:

The way to begin this journey?

As you can simply guess, the entrance finish of those functions has been vibecoded utilizing Lovable and Claude Code.

For the backend, when you’ve got restricted coding capabilities, I might counsel beginning with n8n.

Instance of n8n workflows – (Picture by Samir Saci)

n8n is a low-code automation platform that permits you to join APIs and AI fashions utilizing visible workflows.

The preliminary model of this resolution has been constructed with this device:

  1. I began with a backend related to a Telegram Bot
  2. Customers have been enjoying with the device utilizing this interface
  3. After validation, we moved that to an online software

That is the best technique to begin, even with restricted coding abilities.

I share a step-by-step tutorial with free templates to begin automating from day 1 on this video:

Let me know what you propose to construct utilizing all these good instruments!

About Me

Let’s join on LinkedIn and Twitter. I’m a Provide Chain Engineer who’s utilizing knowledge analytics to enhance logistics operations and scale back prices.

In case you’re in search of tailor-made consulting options to optimise your provide chain and meet sustainability objectives, please contact me.



Related Articles

Latest Articles