“Run it until it breaks” can feel like a practical, budget-friendly maintenance philosophy—especially when production is already stretched thin, and capital approvals are slow. But in modern manufacturing, the true cost isn’t the part that fails. It’s the exposure you create: unplanned downtime, emergency labor, quality risk, safety risk, and long lead-time vulnerability.
This article breaks down where those costs come from, why they’re getting worse in 2026 (tighter supply chains, longer lead times, more interconnected systems), and what a practical, risk-based alternative looks like.
1) Unplanned Downtime Is No Longer a Contained Event
Why this matters: Most plants still estimate downtime as “one machine is down.” In reality, modern lines are tightly coupled—controls, HMIs, drives, and networks link multiple stations. A failed component can stop a cell, blocking upstream flow, starving downstream processes, and triggering a domino effect across the shift.
In many industries, even “small” downtime adds up fast. Depending on the operation, lost production can easily range from thousands to tens of thousands of dollars per hour, once you account for missed throughput, labor inefficiency, and downstream disruption.
-
Cascading stoppage: A single fault can halt a cell, blocking upstream accumulation and downstream packaging/inspection.
-
Slow recovery: Troubleshooting time (diagnosis + validation) often exceeds the actual part swap.
-
Restart losses: Warm-up, re-homing, purge cycles, and first-article verification eat additional time.
Single-point failure examples (real parts):
These are components where a single failure can halt the entire process, even if all other mechanical and electrical systems are healthy.
- PLC CPU: Siemens 6ES7315-2EH14-0AB0
This is the central processor that runs all machine logic, sequences, and interlocks. It coordinates I/O, communication with HMIs and drives, and the machine's overall behavior.
If it fails, the machine loses its ability to think, coordinate, and control, resulting in a complete system stop regardless of the condition of its motors, sensors, or mechanical components.
- Digital output module: Siemens 6ES7322-1BH01-0AA0
This module sends on/off control signals from the PLC to physical devices, including solenoid valves, contactors, relays, and motor starters. It’s how the control system tells the real world to move.
If it fails, the PLC may still be running logically, but nothing in the physical process can actuate, which effectively stops production and can be confusing to diagnose because the software appears normal while the machine is motionless.
2) Emergency Response Is Always the Most Expensive Response
Why this matters: The biggest financial hit from “run-to-failure” often isn’t the replacement part. It’s the forced urgency. Breakdowns turn normal work into a high-pressure event—overtime, expedited logistics, rushed decisions, and a higher likelihood of mistakes.
- Overtime labor and call-ins
- Expedited freight and premium sourcing
- Extended diagnostic and validation time
- Scrap, rework, and secondary damage
3) Degradation Happens Before Failure (And That’s Where Risk Spikes)
Why this matters: Many components don’t fail cleanly. They degrade first—intermittent faults, random resets, communication drops, inconsistent I/O behavior.
- Thermal cycling and solder fatigue
- Capacitor aging in drives and power electronics
- Fan wear, dust ingress, and contamination
- Connector fatigue and vibration damage
- Power quality disturbances
Operator-interface examples:
4) The Waiting Game: Lead Times Turn Simple Failures Into Multi-Week Shutdowns
Why this matters: Availability risk is now uptime risk. A part that’s unavailable can extend downtime from hours to weeks.
- Vendor obsolescence
- Long and unstable lead times
- Compatibility and substitution risk
5) The Hidden Opportunity Cost: Firefighting Kills Improvement
Why this matters: Constant recovery mode prevents optimization, training, and improvement.
- Deferred preventive maintenance
- Delayed controls optimization
- Lower morale and burnout
6) Most Plants Don’t Have a Spares Strategy—They Have a Spares History
Why this matters: Random spares don’t reduce risk. Strategic spares do.
-
Tier 1: CPUs, HMIs, critical I/O, drives
-
Tier 2: Comms cards, expansion modules
-
Tier 3: Low-risk accessories
Examples:
7) A Better Alternative: “Run It Until It’s Risky.”
Why this matters: The goal is not early replacement — it’s controlled risk.
- Downtime impact
- Actual MTTR
- Availability risk
- Substitution complexity
- Failure signals
Drive example:
Conclusion: The Real Cost Isn’t the Part—It’s the Exposure
Run-to-failure doesn’t reduce costs — it concentrates them into the most painful form: unplanned downtime, urgent sourcing, overtime labor, quality risk, and supply chain uncertainty.
- Stock true line-stoppers
- Watch degradation signals
- Plan around lifecycle risk
- Protect engineering time