January 2026
Estimated reading time: 27 min
A service provider core does something that looks contradictory: it
runs one shared packet backbone while delivering different guarantees to
many customers at once. The same routers and fiber carry internet
best-effort, enterprise VPNs, wholesale handoff, mobile backhaul, and
latency-sensitive voice and video. Customers still expect clear
separation, predictable reachability, controlled failure behavior, and
performance that matches the SLA. This post shows how an MPLS backbone
delivers those guarantees in a way you can design, operate, and
troubleshoot.
“Deterministic” means different things to different teams, so this
article stays concrete. It treats guarantees as engineering properties
that you define, measure, and preserve under failure. It uses MPLS
transport, VPN service models, traffic engineering, and end-to-end QoS
to create differentiated behaviors on a shared core, while keeping the
control plane and operations model stable as the network scales.
The design goal is not to pretend the network behaves like a circuit in every situation. The goal is to make behavior predictable under defined conditions,
to make the conditions explicit (traffic profiles, failure scenarios,
maintenance patterns), and to build a system where premium traffic stays
protected when the inevitable happens.
1) Define the guarantees before you design the backbone
Backbone design drifts when teams treat every requirement as “QoS” or
“TE.” In practice, you deliver four distinct categories of guarantees.
You name them explicitly, because each guarantee maps to different
mechanisms and different failure modes.
- Separation: Customer A cannot see or reach
Customer B unless policy allows it. This includes routing separation
(VRFs/RTs), forwarding separation (labels and lookup), and operational
separation (visibility, lawful intercept boundaries, change control).
- Path control: A class of traffic follows a
preferred path, avoids a constrained region, or stays within a latency
envelope. This includes explicit LSPs or SR policies, affinity
constraints, and restoration behavior.
- Performance: Loss, latency, and jitter stay
within target ranges under defined load and defined failure scenarios.
This relies on QoS, capacity engineering, and admission control—not just
queue configuration.
- Operational predictability: Convergence and
restoration behave consistently. The network avoids long brownouts,
micro-loops, and unstable oscillations. Runbooks and telemetry let you
prove why a guarantee fails.
This framing changes the engineering conversation. Instead of asking
for “TE everywhere,” you identify which services truly require path
constraints, which require strict separation, and which tolerate best
effort. You then choose the smallest set of mechanisms that make the
guarantees enforceable.
2) The canonical split: IGP describes physics, BGP describes services
A scalable SP backbone separates transport concerns from service
concerns. The IGP (IS-IS or OSPF) describes topology and computes
shortest paths. A label plane (LDP or Segment Routing) builds transport
LSPs over that topology. BGP—specifically MP-BGP—distributes service
reachability such as VPNv4/VPNv6 routes for L3VPN, EVPN routes for
modern L2VPN, and sometimes labeled-unicast for transport or
inter-domain patterns. Traffic engineering selects which transport path a
given service uses when shortest-path forwarding is not sufficient.
This split matters because it keeps the IGP small, fast, and stable.
If you push service intent into the IGP, you inflate state, increase
churn, and make failures harder to reason about. If you keep services in
BGP and treat the IGP as the topology truth, you gain clean failure
domains and predictable troubleshooting: validate IGP and transport
first, then validate the service layer.
- IGP: topology, metrics, adjacency health, convergence timers, ECMP behavior.
- LDP or SR: transport label programming, loopback reachability, label binding consistency.
- BGP services: VRF routes and policies, route targets, EVPN MAC/IP advertisements, inter-AS option choices.
- TE: explicit path selection, constraint satisfaction, restoration policy, admission control where used.
- QoS: classification, policing, queueing/scheduling, shaping, and end-to-end measurement.
3) Transport choices: LDP, RSVP-TE, SR-TE—and what they actually guarantee
Many design debates treat LDP, RSVP-TE, and Segment Routing as
competing ideologies. In reality they solve different parts of the
problem, and you can use them together if you define clear roles. The
key is to understand what guarantee each technology can enforce, and
what it cannot enforce without additional design work.
LDP creates label-switched paths that follow IGP
shortest paths. It works well when you want simple transport and you
accept that traffic follows metrics and ECMP. LDP provides predictable
forwarding in the sense that it mirrors the IGP, but it does not provide
explicit path control. If you promise “this traffic always takes the
low-latency path,” LDP alone cannot enforce that promise.
RSVP-TE creates explicit TE LSPs, optionally with
bandwidth reservations and constraints. RSVP-TE matches well with
premium services that require deterministic restoration behavior and
bandwidth admission control. It also supports mature fast reroute
models. The trade-off is operational complexity: more LSP state, more
signaling, and more coordination during maintenance.
SR-TE moves path intent to the headend. In SR-MPLS,
the headend encodes a path as a stack of segment identifiers (SIDs), and
the core forwards based on local SID programming tied to the IGP. SR
reduces per-LSP state in the core compared to RSVP-TE and aligns well
with controller-driven policy. SR-TE does not automatically create
determinism; it provides a programmable mechanism to steer traffic and
recover quickly when combined with IGP fast convergence and TI-LFA.
A practical backbone often uses IGP + LDP for baseline transport,
SR-TE policies for premium classes or specific flows, and RSVP-TE in
legacy islands or where reservation semantics remain required. The
design succeeds when each service class maps to the transport mechanism
and that mapping is operationally visible.
4) VPN separation at scale: L3VPN, L2VPN, EVPN, and CSC
Separation is the first guarantee customers notice. In MPLS,
separation comes from forwarding context and policy discipline. You
implement separation differently for L3VPN and L2VPN. Modern EVPN
control plane reduces flooding and makes L2 services more predictable.
CSC raises the bar further by making your customer a provider with their
own VPN architecture.
L3VPN uses VRFs and MP-BGP VPN address families. The
VRF provides forwarding separation on the PE, and route targets (RTs)
control which VPN routes import and export between VRFs. The most common
separation failure in L3VPN is an RT policy mistake. Treat RT design
like security policy: use conventions, avoid ad-hoc RT reuse, and
implement leak patterns (shared services, extranet) as reviewed designs
rather than emergency fixes.
L2VPN separation depends on service instances:
pseudowires (VPWS), VPLS, or EVPN-based services. L2VPN can amplify
unknown-unicast and broadcast behavior. EVPN improves determinism by
advertising MAC and IP information in control plane and reducing
flooding, and it provides clean multihoming semantics that reduce
split-brain behavior during failures.
CSC exists when your customer is also a provider who
wants to run their own VPNs over your core. CSC forces a
separation-of-separation: your backbone transports the customer’s VPN
services without merging their control plane into yours. CSC pushes you
to formalize inter-AS options, label distribution boundaries, and QoS
trust boundaries because wholesale customers care about both
reachability and performance variance.
5) Inter-AS L3VPN: Option A/B/C and what they do to your guarantees
Once a VPN crosses an autonomous system boundary, your guarantees
depend on how you exchange VPN routes and labels. Option A, B, and C
each trade operational clarity for scalability in different ways. The
right choice depends on ownership at the seam and on how you validate
correctness end-to-end.
- Option A (VRF-to-VRF at ASBR): ASBRs behave like
PEs on each side, creating per-VRF interfaces between providers. It
isolates administrative domains strongly, but scales poorly if you have
many VPNs because the ASBR carries per-VRF configuration and state.
- Option B (MP-eBGP between ASBRs): ASBRs exchange
VPNv4/VPNv6 routes directly. This scales better than Option A and keeps
the seam explicit. It introduces more shared VPN route state at the
boundary.
- Option C (MP-eBGP between PEs, ASBRs as labeled transit):
PEs exchange VPN routes across the AS boundary (often multihop), while
the ASBRs provide label-switched transit. This scales well but raises
the importance of transport monitoring because the seam becomes less
visible in configuration terms.
Option choice also affects how you deliver TE and QoS across the
seam. Option A makes class and policy enforcement explicit per VRF at
the boundary, which helps auditing but costs scale. Option C can
preserve scale but requires a stronger transport and monitoring
discipline because the customer perceives the service as end-to-end even
when the seam is operationally distant.
6) Deterministic QoS end-to-end: the design that actually works
Deterministic QoS fails when it becomes a set of queue commands
without an end-to-end model. You achieve deterministic behavior when you
combine classification, policing, scheduling, shaping, capacity
headroom, and failure-mode planning. The backbone must enforce a
contract at the edge, protect itself from untrusted markings, and
maintain consistent per-hop behavior across every node that premium
flows traverse.
6.1 Classification, policing, and contract enforcement
Start with a clear trust boundary. If customers mark traffic, the
provider still decides what those markings mean in the backbone. The PE
enforces the contract by classifying traffic at ingress, remarking into
provider classes, and policing per class. Policing is not punitive; it
prevents one customer from violating the assumptions that keep other
customers within SLA. If you want burst allowances, you define them
explicitly and monitor them.
A backbone often defines a small set of provider classes. For
example: Network Control, Real-Time (voice), Interactive (video),
Critical Data, Business Data, Best Effort, and Scavenger. You map
customer DSCP values into these classes, then police to contracted
rates. The contract lives at the edge, not in the core.
6.2 MPLS QoS models: uniform vs pipe and TC/EXP mapping
MPLS introduces traffic class bits (TC, historically called EXP) in
the label. You decide how IP DSCP maps into MPLS TC and how the core
treats the packet. Two models describe the design intent:
- Uniform model: DSCP copies into MPLS TC (and
often back again). This is simple but risks letting customer markings
influence core behavior unless policing is strict.
- Pipe model: The provider sets MPLS TC at ingress
based on provider policy. The provider class, not the customer marking,
drives core treatment. The VPN payload can still preserve customer DSCP
for customer-internal semantics.
A backbone that promises multiple guarantees typically uses a
pipe-like model. It keeps per-hop behavior consistent and reduces the
chance that mis-marked customer traffic steals priority. It also makes
troubleshooting cleaner: you can reason about provider classes without
decoding each customer’s DSCP story.
6.3 The practical latency budget: where delay actually comes from
If you promise low latency and jitter, you need a budget model that
includes real contributors: serialization delay, propagation delay,
queuing delay, and processing delay. Propagation and serialization are
mostly physics; queuing is the part you control. In best-effort
networks, queuing dominates variance. Deterministic QoS reduces queuing
variance for premium classes by ensuring they experience either minimal
queuing or bounded queuing.
This is where shaping and policing matter. Bursts cause queue spikes,
even when the average rate looks safe. If you shape at the edge, you
convert bursts into smoother flows, which reduces core queue
oscillation. If you police per class, you prevent a burst in one class
from displacing another. If you use a priority queue for Real-Time, you
still protect it with a policer or a strict cap to prevent it from
starving other classes during abnormal events.
6.4 Scheduling and failure-mode capacity
Queue scheduling implements your fairness model. A typical SP
approach uses strict priority for network control and small Real-Time
volumes, then weighted scheduling for the remaining classes. The design
stays honest about failure modes: when a link fails, traffic
concentrates and the network effectively loses capacity. Your SLA either
assumes a failure-mode headroom target, or it accepts that some classes
degrade under failure. Determinism means you state which one you
deliver and you engineer to it.
If you want “Gold stays Gold under single-link failure,” you engineer
headroom so that the Gold class still fits within the reserved or
engineered capacity after reroute. If you do not engineer that headroom,
you write the SLA to reflect degradation behavior. The backbone still
behaves predictably; it simply behaves predictably within realistic
constraints.
6.5 DS-TE and class-aware TE: make bandwidth pools explicit
DiffServ-aware TE (DS-TE) exists because bandwidth is not a single
pool when you sell differentiated services. In RSVP-TE networks, DS-TE
lets you reserve bandwidth per class type and prevent best effort from
consuming capacity that premium services require. DS-TE works by
combining a bandwidth constraint model with TE signaling that marks LSPs
with a class type. The network then admits or rejects LSPs based on
per-class constraints.
Even if you do not use RSVP reservations, the DS-TE mindset is
useful: treat bandwidth per class as an engineering object. If you
deploy SR-TE, you can implement similar intent via policy constraints,
steering, and edge shaping. You keep the principle: premium classes have
an engineered capacity envelope that best effort cannot silently
consume.
7) TE that you can operate: RSVP-TE vs SR-TE in real failures
Traffic engineering is only valuable if it stays predictable under
failure and maintenance. Path control that collapses into oscillation
during reconvergence is worse than no TE at all. Operational TE focuses
on three things: fast restoration, stable reoptimization, and clear
observability.
RSVP-TE provides explicit LSPs and mature FRR behaviors, and it can
reserve bandwidth with admission control. SR-TE shifts complexity toward
headends and controllers, often simplifying the core. SR also pairs
well with topology-aware fast reroute techniques like TI-LFA, which
restore traffic quickly when the topology supports it.
A stable practice separates restoration from optimization: restore
quickly to a safe path, then reoptimize on a slower timer with damping
and validation. This approach avoids repeated churn when the network
flaps or when maintenance is in progress.
8) Design patterns that turn a shared core into multiple service products
A backbone delivers “many guarantees” when it encodes service intent
explicitly and keeps that intent visible. In practice, you do this with a
small set of repeatable patterns rather than with one-off exceptions.
- Per-tier steering: steer premium tiers into TE
policies while letting best effort follow shortest path. This keeps TE
scope bounded and improves predictability.
- Constraint-based policies: express intent as
constraints (latency, affinities, SRLG avoidance) rather than as static
hop lists. Constraints adapt better to failures.
- Class-to-policy mapping: map provider QoS classes
to transport intents. For example, Real-Time maps to low-latency SR
policies; Business Data maps to cost-optimized paths.
- VRF-aware separation: keep VPN separation strict
and implement extranet access as intentional route leaking with audit
trails. Avoid accidental RT reuse.
- Domain seam products: treat inter-AS and
wholesale seams as products with documented behaviors: route policy, TE
behavior, QoS mapping, and troubleshooting ownership.
These patterns also reduce operational risk. When every premium
service uses the same steering model and the same class mapping, you can
test it, simulate it, and automate compliance checks. When each
customer gets a custom variant, the network becomes a museum of
exceptions that fails unpredictably under stress.
9) Worked example: three service tiers on one MPLS core
Consider a backbone with three tiers: Gold (real-time and critical
data), Silver (business data), and Bronze (best effort). The network
offers L3VPN for enterprises, L2VPN for select metro services, and
internet access. The core uses IS-IS with consistent metrics, SR-MPLS
for policy-based path control, and LDP retained for baseline transport
and compatibility.
Gold traffic enters the PE, where the provider classifies and polices
it. The PE maps Gold into provider Real-Time and Critical Data classes.
Real-Time steers into an SR policy constrained by low latency and an
affinity that avoids a congested metro ring. Critical Data steers into a
policy that avoids a high-risk maintenance corridor. Silver follows
shortest path but receives a guaranteed minimum share in weighted
scheduling. Bronze uses remaining capacity and is subject to congestion
drops.
VPN separation remains strict via RT policy. A shared services VRF
provides DNS, authentication, and monitoring, and customers reach it
through an explicit extranet import policy. No accidental import occurs
because RT naming and filters are standardized and validated. L2VPN
metro services run as EVPN instances so flooding stays controlled and
multihoming converges predictably.
Now test a single link failure. ECMP shrinks, and some flows shift.
Gold SR policies activate TI-LFA and maintain low latency because the
alternate path stays within the constraint set. Queue drops remain near
zero for Real-Time because edge shaping and class policing keep bursts
bounded. Silver experiences minor latency increase but stays within
target because the queue share remains stable. Bronze absorbs most
degradation. This is “many guarantees” in practice: you engineer not to
avoid congestion entirely, but to ensure the right traffic degrades
last.
Now test a node maintenance drain. You shift IGP metrics or remove
adjacencies according to a standard procedure. Premium policies
precompute alternates and move with minimal disruption. You verify the
move with telemetry: SR policy path changes, queue depth trends, and
active probes. You also confirm BGP service reachability stays stable
because the procedure preserves loopback reachability and avoids
unnecessary BGP session resets.
10) Control plane interactions that make or break determinism
Deterministic services depend on control plane stability. The
transport layer must converge quickly without creating transient
forwarding loops, and the service layer must remain consistent during
topology changes.
At the transport layer, you tune IGP and link detection so the
network reacts quickly but not noisily. BFD can shorten failure
detection, but it can also amplify instability if the underlay flaps. A
disciplined design couples fast detection with fast reroute so traffic
restores quickly without waiting for full reconvergence.
At the label layer, LDP-IGP synchronization (or equivalent) prevents
the network from advertising IGP reachability before label bindings are
ready, which reduces transient blackholing. In SR, the equivalent
discipline is consistent SID programming and IGP advertisement. You
validate that all core nodes advertise and install the expected SIDs
before you steer premium services into SR policies.
At the service layer, MP-BGP stability relies on route policy and on
controlled churn. Features like BGP PIC (where available) can improve
service restoration by precomputing backup paths for labeled traffic.
Regardless of implementation, the intent remains: preserve VPN
reachability during failures without causing massive BGP churn.
Micro-loops and transient congestion deserve special attention
because they destroy voice and real-time behavior even when the
steady-state design looks perfect. You reduce micro-loops with
consistent IGP tuning, conservative metric strategies, and fast reroute
mechanisms that provide loop-free alternates. You reduce transient
congestion with edge shaping and by avoiding aggressive global
reoptimization that moves too much traffic at once.
11) Observability: prove the guarantees or you do not have them
A guarantee is only as strong as your ability to measure it. For
deterministic services, you need telemetry that ties transport, QoS, and
service layers together. You want per-class loss and queue drops,
per-path latency and jitter, and a clear mapping from customer service
to transport policy.
Flow telemetry helps you understand volume and class behavior. Active
probing helps you measure latency and jitter. Streaming counters help
you detect congestion before customers call. Control-plane telemetry
helps you correlate route churn with performance. The goal is to answer
quickly: Which path does this service take right now? Did it change
recently? Did any node drop premium traffic? Did policing drop bursts at
ingress? Did a maintenance event trigger reoptimization?
Operationally, build dashboards around the guarantee categories. A
separation dashboard highlights RT import/export anomalies and
unexpected route leaks. A path-control dashboard shows SR policy states
and deviations from constraints. A performance dashboard shows per-class
drops and probe results. An operations dashboard correlates failures,
maintenance actions, and convergence metrics. This approach keeps
troubleshooting aligned with customer experience.
12) Guaranteeing across seams: inter-AS, CSC, and multi-domain cores
Customers experience a service end-to-end even when your organization
splits it across domains or ASes. If your design includes seams, you
treat them as engineered objects with explicit rules.
For inter-AS VPNs, decide which AS owns which part of the guarantee.
In Option A, the boundary is explicit per VRF, which makes policy and
QoS mapping straightforward but increases configuration footprint. In
Option B, you enforce RT policies on ASBRs and align QoS mapping on the
interconnect. In Option C, transport correctness becomes the backbone of
the seam, so you instrument labeled reachability and policy compliance
more aggressively.
For CSC, define the contract in terms of what you carry and what you
do not carry. You clarify whether you carry customer VPN routes, whether
you carry their labels, and how you map their QoS classes to your
provider classes. You also define troubleshooting boundaries: which
telemetry you provide, which counters you expose, and which event types
trigger joint investigation. CSC succeeds when both carriers share a
model of the seam rather than a pile of device configs.
For multi-domain cores, avoid pretending a single IGP domain solves
everything. Domains exist for scale, blast radius, and ownership.
Determinism across domains comes from consistent class mapping,
consistent measurement, and controlled TE behavior across the seam. When
you cannot maintain consistent QoS or TE semantics across domains, you
document and productize the limitation so customers do not infer
guarantees you cannot sustain.
13) Practical migration guidance: LDP to SR without breaking services
Many networks want SR benefits but cannot flip a switch. A workable
migration keeps services stable and changes transport in controlled
phases. You start by enabling SR in the IGP and programming SIDs while
keeping LDP active. You validate loopback reachability, label
programming, and ECMP behavior. Then you introduce SR policies for a
limited set of premium services and keep the rest on baseline transport.
You measure outcomes and expand cautiously.
During migration, customer experience stays stable. VPN signaling
remains intact, QoS treatment stays consistent, and every change has a
rollback plan. You also avoid premature complexity: deploy SR where it
delivers clear value—deterministic path control, faster restoration,
simpler TE operations—rather than deploying it everywhere because it is
new.
14) Operational discipline: make guarantees auditable and repeatable
A guarantee becomes real when you can audit it. That means the
backbone configuration is consistent, validated, and explainable.
Template-driven configuration reduces accidental divergence. Pre-change
checks validate that RT policies, class mappings, and TE constraints
match the service catalogue. Post-change checks confirm that label
reachability and policy states remain correct.
Runbooks also encode determinism. A premium-service incident runbook
starts with the guarantee category: separation, path control,
performance, or operational stability. It then maps to the relevant
evidence: RT imports and BGP routes for separation, SR policy state for
path control, per-class drops and probes for performance, and IGP/FRR
events for operational stability. This structure prevents wasted time
and keeps customer communications consistent.
15) Glossary and quick troubleshooting cues
- PE: Provider Edge router that terminates customer services and hosts VRFs.
- P: Core router that switches labels and does not hold customer VRFs.
- VRF: Virtual Routing and Forwarding instance providing L3 separation.
- RD/RT: Route Distinguisher and Route Target used for VPN route uniqueness and import/export policy.
- LSP: Label Switched Path; the transport path in MPLS.
- SR Policy: Headend-defined transport intent using segment routing constraints.
- TI-LFA: Topology-Independent Loop-Free Alternate; fast reroute technique often used with SR.
- DS-TE: DiffServ-aware Traffic Engineering; TE model that considers class types and bandwidth constraints.
When a customer reports jitter, verify classification and policing
first. Confirm the flow lands in the intended provider class at ingress.
Then check per-hop queue drops and scheduling. If the service uses TE,
confirm the active path and whether a recent failure triggered a
reroute.
When a VPN loses reachability, separate transport from service.
Confirm loopback and LSP reachability across the core, then confirm
MP-BGP VPN routes exist and import into the correct VRF. RT mismatches
remain a common cause of partial reachability.
When congestion appears, evaluate failure-mode capacity. Identify
whether congestion results from planned maintenance drain or unplanned
failure. Verify whether premium classes remain within engineered
headroom and whether best effort absorbs the expected degradation.
16) Closing: turn a shared core into a guarantee engine
A shared MPLS backbone delivers many guarantees when you treat it as a
layered system: IGP for topology truth, labels for transport, BGP for
services, TE for path intent, and QoS for resource fairness. The core
stays simple, but behavior becomes rich and predictable through policy
and engineering discipline. That is the essence of “one backbone, many
guarantees”: one set of routers, many controlled outcomes.
17) Control-plane scaling: keep the core small so the guarantees stay stable
Deterministic services require a control plane that stays boring
under stress. “Boring” does not mean slow; it means predictable. You get
predictable behavior by keeping the transport domain lean and by
avoiding designs where service churn spills into the IGP.
IS-IS vs OSPF: both can carry a backbone, but IS-IS
often wins in large SP cores because it scales cleanly with wide
topologies, carries extensions comfortably, and keeps the IGP
operational model consistent across regions. OSPF works well too when
the area design stays disciplined. In either case, the principle stays
the same: keep the IGP focused on loopbacks, core links, and SR/TE
attributes—not customer routes.
Metric strategy: metric strategy is a determinism
lever. If metrics are inconsistent, traffic shifts unexpectedly and your
QoS capacity assumptions break. If metrics are too “clever,” you create
brittle dependency chains where a small change cascades into large
traffic movements. A strong approach uses simple, documented metric
tiers: e.g., set metrics to reflect link capacity and latency in broad
strokes, then use TE policies for premium traffic that needs tighter
control.
Label distribution scaling: LDP scales well for
basic transport, but it introduces a second signaling plane that must
remain aligned with the IGP. You protect determinism by using LDP-IGP
synchronization (or equivalent readiness checks) and by monitoring label
binding health as a first-class signal. SR reduces core signaling state
by tying labels (SIDs) to IGP advertisement, but SR requires consistent
SID allocation and careful validation that all nodes program the same
intent.
BGP scaling: MP-BGP carries the service universe:
VPNv4/VPNv6, EVPN, and potentially labeled-unicast. Scaling is not only
about route count; it is about policy correctness under churn.
Determinism improves when you standardize RT conventions, constrain what
can be imported, enforce maximum-prefix or equivalent guardrails where
appropriate, and keep route reflectors stable with clear redundancy and
graceful maintenance procedures.
One practical trick: treat “core size” as a measurable KPI. Track IGP
LSP/LSA counts, adjacency counts, and update rates. Track BGP route
counts, churn rates, and policy rejects. When these metrics drift upward
without a corresponding product decision, you know the network is
accumulating complexity that eventually undermines guarantees.
18) A deterministic QoS walkthrough: from a customer SLA to per-hop behavior
QoS becomes deterministic when you can trace a customer contract to
concrete behavior at each hop. The mapping does not need to be
complicated, but it must be consistent and enforced.
Step 1 — Define the provider classes: pick a small
set of classes that reflect real products. For example: Network Control,
Real-Time, Interactive, Critical Data, Business Data, Best Effort,
Scavenger. Define each class in plain language: what it carries, what
loss/latency behavior you target, and what happens in failure scenarios.
Step 2 — Define the marking and trust model: at
ingress, classify traffic using customer-facing criteria (interfaces,
VLANs, ACLs, NBAR where appropriate). Map customer markings into
provider classes. Set MPLS TC based on provider class (pipe-like
behavior). Preserve inner DSCP inside the VPN payload if the customer
needs it for their own LAN/WAN policy, but do not let it drive core
scheduling without policing.
Step 3 — Enforce the contract with policing and shaping:
apply per-class policers to enforce contracted rates. Use shaping to
smooth bursts into predictable rate profiles. Bursts are not “bad,” but
they must be engineered. A common failure in premium QoS is allowing a
burst to fill a priority queue, which increases jitter for other premium
flows that arrive milliseconds later.
Step 4 — Implement consistent queuing per hop:
implement the same queue model on every core-facing interface that might
carry premium traffic. If platforms differ, define a
lowest-common-denominator model and document exceptions. Priority queues
remain valuable, but cap them so they cannot starve other classes
during abnormal events. Use weighted scheduling for the remaining
classes so Business and Critical Data remain stable under load.
Step 5 — Validate with active measurement: do not
rely on configuration as proof. Run active probes per class (or per
service tier) across representative paths. Correlate probe performance
with per-hop queue drop counters and utilization. Determinism improves
when you can say: “Real-Time jitter increases because the alternate path
adds one more hop and the egress shaper is too tight,” rather than “QoS
seems wrong.”
Step 6 — Align TE and QoS: premium traffic that
receives priority scheduling still fails if it traverses congested
links. TE prevents the network from accidentally pushing premium traffic
into a hot corridor, especially during failures and maintenance drains.
DS-TE (or an equivalent class-aware capacity model) ensures that
premium capacity exists when best effort expands. The outcome is a
closed loop: classification and QoS protect traffic on a link, TE
reduces the probability that premium traffic lands on an
already-congested path, and measurement validates that the combined
system stays within SLA targets.
19) Architecture checklist: validate the backbone before you sell the guarantee
Use this checklist to sanity-check whether a design truly supports
“one backbone, many guarantees.” It focuses on questions that surface
hidden coupling between layers.
- Separation: Are RT conventions documented? Do you
have an approval workflow for route leaking? Do you audit unexpected RT
imports automatically?
- Transport correctness: Do you monitor loopback
reachability, label bindings/SID programming, and forwarding
consistency? Do you validate that the transport path exists before
advertising service reachability?
- TE intent: Do premium services map to a small set
of policies? Are constraints documented? Do you prevent uncontrolled
reoptimization that moves too much traffic at once?
- QoS determinism: Do you have a provider class
model? Are policers and shapers aligned with contracts? Do you measure
per-class loss/latency/jitter, not just interface utilization?
- Failure-mode engineering: Do you model traffic
under single-link and single-node failures? Do you know which links
become hot? Do premium tiers stay within engineered headroom in those
scenarios?
- Seams: If services cross AS or domain boundaries,
do you document ownership, policy, and QoS mapping at the seam? Can you
observe and troubleshoot the seam without guesswork?
- Operations: Do you have a standard maintenance
drain procedure? Do you validate post-change service health with
automated checks? Do you have runbooks that start from the guarantee
category, not from the protocol list?
If you can answer these questions with evidence, you can sell
differentiated guarantees with confidence. If you cannot, the network
may still work, but the guarantees will remain probabilistic—and
customers will notice when the first real failure hits.
Eduardo Wnorowski is a systems architect,
technologist, and Director. With over 30 years of experience in IT and
consulting, he helps organizations maintain stable and secure
environments through proactive auditing, optimization, and strategic
guidance.
LinkedIn Profile