Goodhart’s Law in Legacy Systems: Why Modernization Metrics Fail

Goodhart’s Law in Legacy Systems: Why Modernization Metrics Fail

Modernization initiatives in mainframe environments are increasingly guided by quantitative signals intended to simplify decision making across vast, multi-decade systems. Metrics related to complexity reduction, performance improvement, security posture, and delivery velocity are often elevated as proxies for progress. In isolation, these indicators appear objective and actionable. In practice, once such measures become explicit targets, they begin to reshape engineering behavior in ways that detach reported improvement from actual system health. This dynamic aligns closely with Goodhart’s Law and exposes a structural weakness in how legacy modernization success is commonly assessed.

Mainframe systems amplify this effect because their behavior emerges from tightly coupled interactions between COBOL programs, JCL job streams, transaction managers, and long-lived data stores. Measurement frameworks rarely capture this full interaction space. Instead, they emphasize localized attributes that are easier to extract through static inspection or runtime sampling. As a result, modernization teams may optimize individual components while unknowingly increasing global fragility, contention, or data inconsistency. What appears as improvement at the metric level often conceals deeper forms of software management complexity that remain invisible until operational failures surface.

Escape Metric Distortion

Smart TS XL enables enterprises to modernize legacy systems with confidence by restoring meaning to measurement.

Explore now

The issue is not the existence of metrics, but their elevation above architectural context. When modernization programs prioritize numeric thresholds without understanding structural dependencies, metrics begin to drive engineering decisions rather than describe system reality. Refactoring efforts become shaped by what is measured instead of what reduces systemic risk. Performance tuning favors visible gains over end-to-end throughput stability. Security remediation focuses on countable findings rather than meaningful exposure reduction. These behaviors mirror challenges observed across broader application modernization initiatives, but they are significantly harder to detect and correct in mainframe environments.

Explaining why modernization metrics fail in legacy systems requires shifting attention from individual numbers to the architectural conditions that undermine them. This includes how dependencies propagate change across batch and online workloads, how data flows traverse subsystem boundaries, and how performance characteristics emerge from shared infrastructure. By examining Goodhart’s Law through the lens of mainframe systems, it becomes possible to clarify why conventional optimization strategies repeatedly underperform and why modernization efforts require deeper, system-aware insight to remain valid under operational pressure.

Table of Contents

How Goodhart’s Law Manifests in Metric-Driven Legacy Modernization

Legacy modernization programs often begin with a well-intentioned push to introduce clarity and control into environments that have grown opaque over decades. Quantitative metrics promise comparability, progress tracking, and executive visibility across sprawling mainframe estates. Measures such as complexity reduction, defect density, test coverage, or batch duration improvements are adopted to translate deeply technical change into digestible indicators. In early phases, these metrics can reveal genuine problem areas and help prioritize intervention.

As modernization efforts mature, however, the role of metrics subtly shifts. What began as descriptive signals increasingly become performance targets tied to funding decisions, delivery milestones, or leadership reporting. At that point, the measurement framework starts exerting pressure on engineering behavior. In mainframe environments, where system behavior is highly emergent and dependencies are deeply layered, this pressure accelerates the conditions predicted by Goodhart’s Law. Metrics cease to reflect system health and instead begin shaping it in unintended ways, often masking new forms of risk.

Metric Targets as Behavioral Constraints in Mainframe Teams

When modernization metrics become explicit targets, they act as constraints that shape how engineering teams allocate effort and manage risk. In mainframe environments, where delivery cycles are conservative and production stability is paramount, teams naturally gravitate toward changes that satisfy measurement criteria with minimal perceived disruption. This often leads to localized optimizations that improve reported metrics without addressing the underlying causes of complexity or fragility.

For example, complexity reduction targets frequently encourage superficial restructuring of COBOL programs. Large programs may be split mechanically into smaller units to lower reported complexity scores, even when execution paths and data dependencies remain unchanged. While dashboards show improvement, the operational reality often becomes harder to reason about as control flow is distributed across additional modules with implicit coupling. Over time, this behavior erodes the analytical value of metrics derived from static-code-analysis techniques, because the structure they measure no longer correlates with runtime behavior.

The same pattern appears in defect and quality metrics. When thresholds are enforced, teams may prioritize suppressing or reclassifying findings rather than resolving systemic causes. In environments where change carries significant operational risk, this behavior is rational from a local optimization perspective. It minimizes immediate exposure while satisfying external reporting requirements. From a system perspective, however, it creates blind spots where genuine risk accumulates outside the measurement model.

Mainframe teams are particularly susceptible to this effect because institutional knowledge often substitutes for formal documentation. Engineers rely on experience to navigate edge cases that metrics cannot capture. When metrics override this contextual understanding, teams adapt by optimizing what is visible rather than what is structurally important. Over time, the measurement framework becomes a behavioral governor that limits meaningful modernization rather than enabling it.

Local Optimization Versus System-Level Outcomes

One of the most damaging manifestations of Goodhart’s Law in legacy environments is the tension between local optimization and system-level outcomes. Mainframe systems are composed of interdependent batch streams, online transactions, shared datasets, and scheduling constraints that interact in non-linear ways. Metrics, by necessity, abstract away much of this interaction. When targets are enforced at the component level, they incentivize decisions that improve local indicators while degrading global behavior.

A common example appears in performance-focused modernization. Teams may be tasked with reducing batch execution time or lowering CPU consumption for specific jobs. In response, they tune individual programs, adjust scheduling priorities, or introduce caching mechanisms that deliver measurable improvements for the targeted workload. These changes often succeed in isolation but can shift contention to other jobs, extend downstream processing windows, or introduce timing sensitivities that were previously absent.

Because metrics rarely account for cross-stream dependencies, these side effects remain invisible until failures occur. The system appears healthier according to reported indicators, yet its operational margin shrinks. This dynamic is exacerbated when impact-analysis techniques are applied selectively rather than across the full dependency graph. Without a system-wide view, optimization efforts unintentionally trade visible improvements for hidden instability.

Over time, organizations may respond by introducing additional metrics to capture newly observed issues. This compounds the problem. Each new target adds another constraint that teams must satisfy, further encouraging tactical optimization over structural improvement. The result is a modernization program that produces impressive metric trends while delivering diminishing returns in resilience, predictability, and operational confidence.

The Erosion of Metric Meaning Over Modernization Timelines

Metrics rarely fail immediately. Their degradation is gradual, which makes Goodhart effects difficult to detect in long-running modernization initiatives. In early phases, improvements are often genuine because obvious inefficiencies and redundancies are addressed. As these opportunities are exhausted, continued metric improvement requires increasingly contrived interventions that preserve numerical progress without corresponding system benefit.

In mainframe environments, this erosion is accelerated by the longevity of both code and measurement frameworks. Metrics selected at the outset of a multi-year program often persist long after their original rationale has expired. Teams learn how to satisfy them efficiently, and institutional memory reinforces these behaviors. Over time, the metric becomes a ritualized artifact rather than an informative signal.

This phenomenon is particularly visible in complexity and maintainability measures. As teams learn how these metrics are calculated, they adapt coding patterns to minimize scores rather than clarify intent or reduce coupling. The metric continues to change, but its semantic connection to maintainability weakens. Decision makers may interpret steady improvement as evidence of progress, unaware that the measurement has been decoupled from the property it was meant to represent.

The long lifespan of mainframe systems amplifies this effect. Changes accumulate slowly, and feedback cycles are long. By the time metric distortion becomes apparent, reversing it requires rethinking both the modernization approach and the measurement strategy. Without deeper forms of software intelligence that preserve system context, organizations risk spending years optimizing numbers that no longer describe the systems they depend on.

Why Measurement Pressure Outpaces Understanding in Legacy Systems

At the core of Goodhart’s Law in mainframe modernization is an imbalance between measurement pressure and system understanding. Metrics are easy to mandate and report, while deep comprehension of legacy systems is costly and time-consuming to acquire. In environments where expertise is scarce and documentation incomplete, organizations often default to measurement as a substitute for understanding.

This substitution creates a feedback loop. As metrics drive decisions, less emphasis is placed on building shared mental models of system behavior. Engineers focus on satisfying targets rather than exploring dependencies, edge cases, or failure modes that fall outside the measurement framework. Over time, the organization becomes increasingly dependent on metrics precisely as their reliability declines.

The problem is not that metrics are inherently flawed, but that they are applied without sufficient grounding in structural reality. In mainframe environments, where behavior emerges from the interaction of many loosely documented components, this grounding cannot be assumed. It must be actively constructed through analysis that respects control flow, data lineage, and execution context.

When modernization initiatives fail to invest in this understanding, Goodhart’s Law becomes an inevitability rather than a risk. Metrics become the map, not the territory, and decisions follow the map even as it diverges from reality. Recognizing this dynamic is the first step toward modernization strategies that resist metric distortion and remain aligned with actual system behavior under operational conditions.

Why Mainframe Architectures Magnify Metric Distortion Effects

Mainframe environments possess structural characteristics that fundamentally alter how metrics behave under pressure. Unlike modern greenfield systems, these platforms evolved incrementally, accumulating layers of logic, data contracts, and operational assumptions over decades. As a result, system behavior emerges from the interaction of many components rather than from isolated modules. When modernization programs apply metric targets to such environments, the architectural reality amplifies the divergence between what is measured and what actually matters.

This amplification occurs because mainframe systems were not designed with continuous measurement in mind. Execution paths span batch and online workloads, data is reused across unrelated functions, and performance characteristics depend on shared infrastructure and scheduling policies. Metrics extracted from individual artifacts capture only fragments of this reality. When these fragments become targets, Goodhart’s Law manifests more aggressively than in loosely coupled systems, accelerating the loss of alignment between reported improvement and operational outcomes.

Tight Coupling and Emergent Behavior in Mainframe Systems

One of the primary reasons mainframe architectures magnify metric distortion is the degree of tight coupling embedded in their design. COBOL programs frequently share copybooks, datasets, and global control structures that implicitly bind their behavior together. JCL job streams coordinate execution order and resource allocation across entire processing windows. Transaction managers such as CICS orchestrate thousands of concurrent interactions against shared state. These relationships are often implicit, undocumented, and only partially understood even by experienced teams.

When metrics are applied to individual components within this environment, they fail to account for emergent behavior that arises from these couplings. A program-level metric may indicate reduced complexity or improved performance, yet the change may alter execution timing or data access patterns in ways that ripple across dependent jobs. Because these effects occur outside the measured scope, they are invisible to the metric framework until failures or regressions appear.

This dynamic undermines the validity of many commonly used modernization indicators. Metrics derived from static inspection may suggest improvement while runtime behavior becomes less predictable. Performance indicators may improve for a single transaction while overall throughput degrades due to contention elsewhere. The tighter the coupling, the greater the gap between local measurement and global outcome.

In such systems, the absence of comprehensive dependency awareness transforms metrics into misleading signals. Without understanding how changes propagate across tightly bound components, teams are effectively optimizing in the dark. The resulting distortion is not a marginal error but a systemic consequence of applying reductionist measures to systems whose behavior cannot be reduced without loss of meaning.

Batch and Online Workload Interference Under Metric Pressure

Mainframe environments uniquely combine batch and online workloads within the same operational ecosystem. Batch jobs process large volumes of data on fixed schedules, while online transactions demand low latency and high availability throughout the day. These workloads compete for CPU, I/O, memory, and locking resources, and their interaction is governed by scheduling policies refined over years of operational tuning.

Metric-driven modernization often targets one workload class at a time. For example, batch window reduction initiatives may focus on shortening execution times for specific jobs. Teams may optimize file access patterns, introduce parallelism, or adjust job priorities to achieve measurable gains. While these changes improve reported batch metrics, they can increase contention during overlap periods or starve online transactions of resources.

Because metrics are typically scoped narrowly, such interference remains unmeasured. Online performance degradation may not be attributed to batch optimization efforts until user-facing incidents occur. Conversely, online tuning initiatives may shift load into batch windows, extending processing times and increasing operational risk. In both cases, metrics capture local success while masking system-level tradeoffs.

This interaction illustrates why performance indicators such as those used in software performance metrics analysis lose reliability under target pressure in mainframe environments. The shared nature of resources means that improvements cannot be evaluated in isolation. Without accounting for workload interference, metric optimization becomes a zero-sum game where gains in one area are offset by losses elsewhere.

Data Reuse and Hidden Dependency Chains

Data reuse is a defining feature of long-lived mainframe systems. Files, tables, and records created for one purpose are often repurposed by downstream processes over time. These secondary uses may be undocumented or only known to a small subset of experts. As modernization initiatives progress, metrics related to data access efficiency, redundancy reduction, or schema simplification are frequently introduced to rationalize data structures.

Under metric pressure, teams may consolidate datasets, eliminate seemingly redundant fields, or optimize access paths to satisfy measurable objectives. While these changes improve local data metrics, they can disrupt hidden dependency chains that rely on legacy data semantics. Batch jobs may consume data in undocumented formats, reconciliation processes may assume specific ordering, and exception handling paths may depend on legacy field values.

Because these dependencies are rarely captured by measurement frameworks, their disruption does not immediately register as metric regression. Instead, failures emerge later as data inconsistencies, reconciliation errors, or subtle logic faults. The metric-driven change appears successful until its side effects propagate through the system.

This pattern underscores the limits of measurement without comprehensive impact awareness. In mainframe environments, data is not merely a passive asset but a coordination mechanism across processes. Metrics that ignore this role incentivize changes that weaken system integrity while signaling progress.

Infrastructure Sharing and Metric-Induced Contention

Mainframe platforms derive efficiency from extensive infrastructure sharing. CPU pools, I/O channels, buffer caches, and locking mechanisms are optimized to support diverse workloads concurrently. Performance characteristics emerge from how these shared resources are scheduled and consumed, not solely from application logic. Modernization metrics often abstract away this infrastructure layer, focusing instead on application-level indicators.

When metrics such as CPU usage reduction or transaction latency targets are enforced, teams may implement changes that shift resource consumption patterns. For instance, caching strategies may reduce CPU cycles for one application while increasing memory pressure globally. Parallelization may shorten individual execution times while increasing contention for shared locks or I/O bandwidth.

Because infrastructure metrics are often aggregated at a coarse level, these shifts remain invisible to application-focused measurement frameworks. The system appears more efficient according to targeted indicators, yet its stability margin narrows as contention patterns intensify. This is a classic manifestation of Goodhart’s Law, where optimizing measured variables degrades unmeasured but critical properties.

Addressing this distortion requires analysis that spans application logic and infrastructure interaction. Without such visibility, metric optimization in shared environments inevitably trades short-term gains for long-term fragility. In mainframe systems, where infrastructure sharing is foundational rather than incidental, this tradeoff is especially pronounced and costly.

Architectural Opacity and the Limits of Measurement

The final factor that magnifies metric distortion in mainframe environments is architectural opacity. Decades of incremental change have produced systems whose structure is only partially understood. Documentation is incomplete, ownership is fragmented, and execution behavior is inferred rather than observed. Metrics offer an appealing substitute for this missing understanding, but they cannot replace it.

As measurement pressure increases, organizations rely more heavily on metrics precisely because deeper analysis appears impractical. This reliance accelerates Goodhart effects. Metrics become authoritative despite their limited scope, and decisions follow them even as their explanatory power erodes. The system’s true behavior drifts further from what the metrics describe.

Without architectural transparency supported by techniques such as cross-system impact analysis, metrics inevitably overreach their explanatory capacity. In mainframe modernization, this overreach is not an edge case but a structural condition. Recognizing it is essential for understanding why metric-driven approaches repeatedly fail to deliver sustainable improvement in legacy environments.

The Failure of Code Quality Metrics in Multi-Decade Codebases

Code quality metrics are often positioned as neutral indicators that reveal structural weaknesses in aging systems. In legacy mainframe environments, these metrics are commonly used to justify refactoring investment, prioritize remediation, and demonstrate modernization progress to stakeholders. Measures such as complexity scores, duplication ratios, and maintainability indices promise to translate decades of accumulated logic into actionable signals that can be tracked over time.

In multi-decade codebases, however, the relationship between these metrics and actual system behavior is fragile. The longevity of the code, combined with evolving business rules and platform constraints, means that many quality indicators capture surface characteristics rather than functional reality. Once these indicators are elevated to targets, Goodhart’s Law takes hold. Code quality metrics begin to reflect compliance with measurement criteria instead of meaningful improvements in reliability, clarity, or change safety. This disconnect is especially pronounced in environments shaped by long-term architectural drift and incremental change.

Cyclomatic Complexity as a Misleading Modernization Signal

Cyclomatic complexity is frequently used as a proxy for code understandability and risk. In principle, high complexity indicates numerous execution paths that are difficult to reason about and test. In practice, applying this metric to multi-decade mainframe codebases introduces distortions that undermine its usefulness once it becomes a modernization target.

Legacy COBOL programs often encode business logic that evolved in response to regulatory changes, market shifts, and operational exceptions. Complexity accumulates not because of poor design choices, but because the program serves as a historical ledger of business behavior. When modernization initiatives mandate complexity reduction targets, teams are incentivized to restructure control flow to satisfy the metric without altering the underlying logic. Conditional logic may be extracted into auxiliary programs or flattened through mechanical transformations that reduce reported scores.

While these changes improve complexity indicators, they often degrade conceptual clarity. Execution paths become distributed across additional modules, increasing cognitive load for maintainers. Debugging and impact assessment become harder because logic is no longer localized. The metric suggests improvement, yet the system becomes more difficult to reason about under change.

This distortion is exacerbated by how complexity is calculated. Many tools count decision points without considering semantic intent or execution frequency. Rarely executed error paths carry the same weight as core business logic. Teams responding to metric pressure may refactor low-risk paths to achieve numeric gains while leaving high-risk interactions untouched. Over time, the metric drifts further from its original purpose.

The persistence of this pattern illustrates how a once-informative measure loses meaning when treated as a target. In multi-decade systems, complexity is often a symptom rather than a cause. Reducing the number without addressing why the logic exists produces cosmetic change rather than modernization.

Maintainability Indices and the Illusion of Structural Health

Maintainability indices attempt to combine multiple factors into a single score that represents long-term code health. These indices typically aggregate complexity, size, and comment density into a normalized value. In legacy environments, such scores are attractive because they promise a high-level view of structural quality across vast codebases.

The problem arises when these indices are used to guide modernization decisions without understanding their limitations. In long-lived systems, maintainability is not solely a function of code shape. It is deeply influenced by stability of interfaces, predictability of behavior, and the presence of implicit contracts that are not visible in the source. A program with a low maintainability score may be operationally stable and well understood by its maintainers, while a refactored alternative with a higher score may introduce uncertainty.

When maintainability indices become targets, teams adapt their behavior to optimize the formula. Comment density may increase without improving explanatory value. Functions may be split or merged to influence size calculations. These changes improve scores while leaving the underlying maintenance burden unchanged or even increased. The metric becomes an exercise in optimization rather than insight.

This phenomenon has been observed repeatedly in analyses comparing maintainability measures with actual failure rates, such as those discussed in maintainability versus complexity metrics. In multi-decade codebases, the gap between measured maintainability and real-world change risk widens over time as teams learn how to satisfy scoring models.

As a result, maintainability indices lose credibility among experienced engineers while remaining influential in reporting contexts. This split reinforces Goodhart’s Law. The metric continues to drive decisions even as those closest to the system recognize its declining relevance.

Code Coverage Targets and the Dilution of Test Meaning

Test coverage metrics are often introduced into legacy modernization programs to demonstrate improved verification and reduced risk. Achieving higher coverage percentages is seen as evidence that code behavior is better understood and more resilient to change. In mainframe environments, however, coverage targets frequently produce outcomes that undermine this assumption.

Legacy systems often lack comprehensive automated test suites because behavior is validated through operational stability rather than isolated tests. Introducing coverage targets in such contexts incentivizes teams to create tests that execute code paths without asserting meaningful outcomes. Simple invocation tests inflate coverage numbers while providing little assurance about correctness under realistic conditions.

As coverage targets tighten, this behavior intensifies. Teams focus on maximizing executed lines rather than validating business rules. Error handling paths may be triggered artificially, while complex data interactions remain untested. The metric improves steadily, but the system’s susceptibility to regression remains unchanged.

This dilution of test meaning is difficult to detect through coverage statistics alone. The number increases, but the semantic value of the tests decreases. Over time, coverage becomes a compliance artifact rather than a quality signal. Engineers may lose trust in the metric, yet it continues to influence modernization narratives.

In multi-decade codebases, where behavior is tightly coupled to data state and execution context, coverage metrics are particularly vulnerable to this distortion. Without complementary analysis of data flow and execution semantics, coverage targets encourage activity that looks productive while delivering limited risk reduction.

Duplication Metrics and the Risk of Over-Aggressive Consolidation

Code duplication metrics are commonly used to identify opportunities for consolidation and reuse. In legacy systems, duplication is often interpreted as technical debt that increases maintenance cost and inconsistency risk. While this interpretation holds in some cases, it becomes problematic when duplication metrics are treated as modernization targets in isolation.

In multi-decade codebases, duplicated logic may exist for valid reasons. Slight variations in business rules, regulatory requirements, or operational context can necessitate parallel implementations that appear similar syntactically but differ semantically. Duplication metrics rarely capture these nuances. They identify structural similarity without understanding intent.

Under metric pressure, teams may consolidate duplicated code to reduce reported duplication percentages. This consolidation can introduce conditional logic to handle variations, increasing complexity and coupling. Alternatively, shared modules may be created that serve multiple contexts with subtle differences. While duplication metrics improve, the resulting code becomes harder to modify safely.

The risk is compounded when downstream dependencies are not fully understood. Consolidated code may be invoked by a wider range of processes than anticipated, amplifying the impact of future changes. What appears as a reduction in redundancy becomes an increase in blast radius.

This pattern demonstrates how duplication metrics, when optimized as targets, can erode system resilience. In legacy environments, duplication is not always a flaw. Treating it as such without contextual analysis leads to structural changes that satisfy measurement goals while increasing modernization risk.

Why Code Quality Metrics Lose Meaning Over Time

The common thread across code quality metrics in multi-decade codebases is their gradual loss of semantic connection to the properties they were designed to measure. Early in a modernization initiative, these metrics can highlight genuine issues. As they become targets, teams adapt, tools are tuned, and behaviors shift. The metrics continue to change, but their explanatory power diminishes.

This erosion is not accidental. It is a predictable outcome of applying simplified measures to complex, historically evolved systems. In mainframe environments, where logic, data, and execution context are inseparable, code quality cannot be reduced to static attributes alone. Metrics that ignore this reality invite Goodhart effects.

Recognizing this failure does not imply abandoning measurement. It highlights the need to interpret metrics as indicators rather than objectives, and to ground them in a deeper understanding of system behavior. Without that grounding, code quality metrics in legacy systems will continue to signal progress while concealing the very risks modernization seeks to eliminate.

Performance Optimization Metrics That Degrade End-to-End Throughput

Performance metrics occupy a central role in mainframe modernization programs because they offer tangible evidence of improvement in environments where change is inherently risky. Indicators such as CPU utilization, batch duration, transaction response time, and throughput are commonly used to justify refactoring efforts and infrastructure investment. These measures appear especially relevant in cost-sensitive mainframe contexts, where performance gains are often equated with financial efficiency and operational success.

The challenge emerges when these metrics are transformed from diagnostic tools into fixed optimization targets. In tightly coupled mainframe systems, performance characteristics arise from the interaction of workloads, data access patterns, and shared infrastructure rather than from isolated code paths. When optimization efforts focus narrowly on improving individual performance indicators, they often degrade end-to-end throughput and system stability. This is a textbook manifestation of Goodhart’s Law, where the pursuit of measurable improvement undermines the property the metric was meant to represent.

CPU Reduction Targets and the Redistribution of Bottlenecks

CPU reduction initiatives are among the most common performance-driven modernization goals in mainframe environments. Organizations frequently establish targets to lower MIPS consumption in order to control licensing costs and delay hardware upgrades. At first glance, this approach appears rational. CPU usage is measurable, auditable, and directly tied to cost models. However, once CPU reduction becomes a target rather than an indicator, it reshapes optimization behavior in ways that distort overall performance.

Teams responding to CPU targets often refactor code to minimize instruction counts in frequently executed paths. Loop unrolling, caching of computed values, and aggressive reuse of in-memory structures can all reduce CPU cycles for specific programs. While these changes succeed in lowering measured CPU consumption, they frequently increase memory pressure, I/O contention, or lock duration. The result is a redistribution of bottlenecks rather than their elimination.

Because CPU metrics are typically tracked at the job or program level, secondary effects remain invisible. Increased I/O wait times or longer lock holds may slow downstream processes or online transactions without triggering CPU alarms. Throughput declines even as CPU metrics improve. Over time, the system becomes more sensitive to workload variation, with small spikes in demand causing disproportionate slowdowns.

This dynamic is particularly damaging in batch-heavy environments where job streams are carefully balanced to meet processing windows. CPU-focused optimization may shorten individual job runtimes while extending overall batch completion due to increased contention. Without holistic analysis, teams continue to pursue CPU reductions, unaware that they are eroding the very throughput they seek to improve.

Latency Metrics and the Fragmentation of Execution Paths

Transaction latency is another metric frequently targeted in modernization efforts, especially for customer-facing workloads. Reducing response times is intuitively associated with better user experience and system efficiency. In mainframe environments, however, latency metrics often capture only a narrow slice of execution behavior.

To meet latency targets, teams may refactor transaction logic to minimize synchronous processing. This can involve deferring work to asynchronous routines, splitting transactions into multiple stages, or bypassing validation steps deemed non-critical. These changes often succeed in reducing measured response times for individual transactions, yet they fragment execution paths across multiple components and processing phases.

The fragmentation introduces new coordination overhead. Deferred processing must be tracked, retried, and reconciled. Error handling becomes more complex, and failure modes multiply. While front-end latency improves, backend throughput may suffer as asynchronous workloads accumulate and contend for shared resources.

Latency metrics rarely account for these downstream effects. They report success at the transaction boundary while obscuring the growing backlog behind it. Over time, systems optimized for latency become brittle under sustained load, exhibiting unpredictable performance degradation that is difficult to diagnose. This tradeoff highlights the limits of optimizing responsiveness without considering throughput, a tension explored in analyses of throughput versus responsiveness monitoring.

When latency becomes a target, it ceases to represent overall performance health. It instead drives architectural decisions that privilege immediate response over sustainable processing capacity.

Batch Window Compression and Hidden Contention

Batch window compression is a common modernization objective in mainframe environments that support continuous or near-continuous online operations. Shortening batch windows promises greater availability and flexibility, enabling systems to process data with less disruption to online workloads. Metrics related to batch duration and completion time are therefore heavily emphasized.

To achieve these targets, teams may parallelize batch jobs, adjust scheduling priorities, or optimize file access patterns. While these techniques can reduce measured batch durations, they often introduce hidden contention. Parallel jobs may compete for the same datasets or database resources, increasing lock contention and I/O wait times. Scheduling changes may starve lower-priority processes that perform critical housekeeping functions.

Because batch window metrics focus on completion time rather than resource interaction, these side effects are not immediately visible. The batch window appears shorter, yet the system operates closer to its contention thresholds. Minor variations in data volume or workload timing can trigger cascading delays or failures.

This effect is amplified when batch optimization is performed without comprehensive analysis of data access patterns. For example, reducing the execution time of one job may increase contention for shared datasets used by others. Over time, the batch ecosystem becomes less tolerant of change, even as metrics suggest improvement. This pattern mirrors issues identified in studies of noisy query contention patterns, where localized optimization amplifies global instability.

Throughput Degradation from Exception Handling Optimization

Exception handling logic is often targeted during performance optimization because it is perceived as overhead. Metrics may highlight the frequency or cost of exception paths, prompting teams to streamline error handling to reduce execution time. In legacy systems, where exception logic evolved alongside business rules, this optimization can have unintended consequences.

Simplifying exception handling may reduce the cost of rare error paths, improving average performance metrics. However, it can also remove safeguards that prevent error conditions from propagating. When exceptions occur, they may now trigger broader failures or require more expensive recovery actions. The system appears faster under normal conditions but becomes significantly slower and less predictable when stressed.

Metrics focused on average performance fail to capture this degradation. They reward the elimination of perceived inefficiencies without accounting for worst-case behavior. Over time, systems optimized in this way exhibit sharp performance cliffs when encountering abnormal conditions, undermining throughput during peak demand or failure scenarios.

The performance impact of such changes is often only recognized after incidents, when postmortems reveal that exception paths were altered to satisfy optimization goals. This highlights the danger of treating performance metrics as absolute targets rather than contextual indicators, especially in systems where reliability and throughput are tightly coupled.

Why Performance Metrics Lose System-Level Meaning

The recurring pattern across performance optimization efforts in mainframe environments is the gradual decoupling of metrics from system-level outcomes. Early optimizations yield genuine gains, reinforcing confidence in the measurement framework. As targets become more aggressive, teams resort to changes that satisfy metrics while shifting costs elsewhere in the system.

This erosion of meaning is not due to flawed metrics alone, but to their application without sufficient system context. Performance in mainframe systems is emergent, shaped by interactions that cannot be captured by single-dimensional indicators. When these indicators are elevated to targets, Goodhart’s Law ensures that optimization behavior will eventually undermine the property being measured.

Recognizing this dynamic is critical for modernization efforts that seek sustainable improvement. Performance metrics remain valuable as signals, but only when interpreted through an understanding of dependencies, contention, and execution flow. Without that understanding, performance optimization becomes an exercise in moving bottlenecks rather than removing them, delivering impressive metrics alongside declining throughput and resilience.

Hidden Risk Introduced by Compliance-Oriented Refactoring Metrics

Compliance requirements introduce a distinct class of pressure into legacy modernization efforts. Unlike performance or quality initiatives, compliance-driven programs are often anchored to externally defined criteria that carry regulatory or audit consequences. Metrics related to security findings, control coverage, data handling conformity, and remediation counts are introduced to demonstrate alignment with mandated standards. In mainframe environments, these metrics are frequently applied retroactively to systems that were never designed to satisfy modern compliance frameworks.

As with other metric-driven initiatives, the problem emerges when compliance indicators are treated as definitive measures of system safety rather than partial signals. Once compliance metrics become targets, engineering behavior adapts to satisfy audit expectations, sometimes at the expense of architectural integrity. In legacy systems, where logic paths, data lineage, and exception handling are deeply intertwined, this adaptation can introduce new forms of risk that remain invisible to the very metrics intended to prevent them.

Security Finding Counts and Superficial Risk Reduction

One of the most common compliance metrics in modernization programs is the number of identified and resolved security findings. Static analysis tools, scanning frameworks, and rule-based detectors generate lists of vulnerabilities that are tracked, prioritized, and closed to demonstrate progress. In principle, reducing the number of findings should correlate with improved security posture. In practice, once remediation counts become targets, the relationship weakens.

In mainframe environments, many reported findings relate to legacy patterns that are technically non-compliant but operationally constrained. For example, shared service programs may trigger repeated findings across multiple contexts, or legacy input validation logic may not align cleanly with modern threat models. Under metric pressure, teams often pursue the fastest path to closure. This may involve suppressing findings, narrowing detection rules, or applying minimal changes that silence alerts without altering execution behavior.

While these actions reduce reported risk, they can obscure genuine exposure. More concerning is the way remediation efforts can alter code paths without full understanding of downstream impact. Security-related refactoring may introduce additional validation layers, logging, or exception handling that affect performance and control flow. If these changes are scoped narrowly to satisfy specific findings, their interaction with existing logic may not be fully analyzed.

Over time, the metric suggests steady improvement while the system accumulates subtle behavioral changes. The security posture appears stronger on paper, yet the system may become more fragile due to increased complexity in critical paths. This pattern reflects a broader challenge in managing static code security findings when metrics incentivize closure over comprehension.

Data Handling Metrics and Unintended Exposure Paths

Compliance initiatives frequently introduce metrics focused on data handling. These may include counts of sensitive fields protected, instances of encryption applied, or paths audited for proper access control. In legacy mainframe systems, where data reuse is pervasive and implicit contracts are common, applying such metrics is inherently complex.

When data protection metrics become targets, teams may implement changes that satisfy formal criteria without addressing how data actually flows through the system. Encryption may be added at specific access points while leaving intermediate transformations untouched. Masking logic may be applied at output boundaries without considering internal reuse. These changes improve metric scores but can create inconsistencies in how data is handled across execution paths.

More subtly, compliance-driven refactoring can introduce new exposure paths. For example, adding logging for audit purposes may inadvertently capture sensitive data in clear text. Introducing data validation layers may duplicate data into temporary structures with different access controls. Because compliance metrics typically track whether controls exist rather than how they interact, these side effects remain unmeasured.

In multi-decade codebases, data semantics are often encoded implicitly in program structure rather than documentation. Refactoring data handling logic without full lineage analysis risks breaking these semantics. The system continues to meet compliance metrics while drifting further from a coherent data model. This disconnect highlights the limitations of metrics that focus on control presence rather than data behavior.

Control Coverage Metrics and the Proliferation of Conditional Logic

Control coverage metrics aim to demonstrate that required checks and safeguards are applied consistently across the system. These metrics often track whether specific validations, authorizations, or logging actions are present in relevant code paths. In modernization programs, increasing control coverage is frequently framed as evidence of reduced risk.

In legacy mainframe systems, achieving higher coverage often involves inserting additional conditional logic into existing programs. Each new control introduces branches that interact with legacy conditions, error handling, and recovery logic. While coverage metrics improve, the complexity of execution paths increases. This added complexity can obscure the original business logic and make reasoning about behavior more difficult.

As control logic accumulates, the likelihood of unintended interactions grows. Edge cases that were previously rare may become more common due to additional branching. Error paths may intersect in unexpected ways, complicating recovery scenarios. These effects are rarely captured by coverage metrics, which treat each control as an independent success.

The result is a system that appears more controlled but behaves less predictably. Engineers may struggle to trace how a transaction flows through layers of controls, especially when documentation is incomplete. The metric-driven pursuit of coverage inadvertently undermines the clarity and stability that controls were meant to provide.

This pattern is particularly problematic when controls are applied uniformly without regard to execution context. In mainframe environments, the same program may serve multiple business processes with different risk profiles. Applying identical controls everywhere satisfies metrics but ignores contextual differences, increasing the risk of over-control and unintended behavior.

Audit Readiness Metrics and Architectural Drift

Audit readiness is often measured through indicators such as remediation completeness, documentation coverage, or alignment with prescribed standards. These metrics are designed to demonstrate that systems can withstand external scrutiny. In legacy environments, achieving audit readiness frequently requires retrofitting documentation and controls onto systems that evolved organically.

When audit metrics become targets, teams may prioritize changes that are easily demonstrable over those that improve architectural coherence. Documentation may be updated to reflect desired states rather than actual behavior. Interfaces may be formalized on paper while remaining loosely enforced in code. These actions improve audit scores but widen the gap between documented and operational reality.

Architectural drift accelerates as a result. The system’s conceptual model diverges from its implementation, making future change riskier. Engineers rely on documentation that no longer accurately describes execution behavior, increasing the likelihood of errors during maintenance or further modernization.

Because audit metrics rarely capture this divergence, the drift remains hidden. The organization appears compliant while the system becomes harder to understand and evolve. This illustrates how compliance-oriented metrics can inadvertently erode the very transparency they are intended to ensure.

Why Compliance Metrics Create Invisible Risk in Legacy Systems

The hidden risk introduced by compliance-oriented refactoring metrics stems from a common source. Metrics focus on observable artifacts such as findings closed, controls added, or documents produced. Legacy systems, however, derive their behavior from complex interactions that are not easily observable. When metrics substitute for understanding, Goodhart’s Law ensures that optimization behavior will target appearances rather than substance.

In mainframe environments, this substitution is especially dangerous because small changes can have outsized effects. A control added to satisfy a metric may alter execution timing, data handling, or error propagation in ways that remain undetected until failure. By the time issues surface, they are often disconnected from the original compliance initiative.

Recognizing this dynamic does not diminish the importance of compliance. It underscores the need to treat compliance metrics as partial indicators rather than definitive proof of safety. Without system-level insight into how refactoring changes interact with legacy behavior, compliance-driven modernization risks creating new vulnerabilities while claiming success.

Dependency Blindness as the Core Enabler of Goodhart Effects

Across legacy modernization initiatives, metric distortion does not arise solely from poor metric selection. It is enabled by a more fundamental limitation: the inability to see how behavior propagates through the system. In mainframe environments, dependencies span programs, datasets, job schedules, transaction flows, and infrastructure layers. These dependencies define how change actually behaves once deployed, yet they are rarely visible in a unified way.

When dependency awareness is incomplete, metrics are interpreted in isolation. Improvements in one area are assumed to be beneficial without understanding their downstream effects. This blind spot creates ideal conditions for Goodhart’s Law. As soon as metrics become targets, optimization behavior exploits what is visible while unintentionally destabilizing what is hidden. Dependency blindness does not merely amplify metric distortion; it makes it structurally unavoidable in complex legacy systems.

Hidden Control Flow Dependencies and Metric Misinterpretation

Control flow in mainframe systems is rarely confined to a single program. Execution paths traverse COBOL modules, call external routines, branch through configuration-driven logic, and re-enter shared services. JCL orchestrates execution order across jobs, while transaction managers route requests dynamically based on runtime conditions. Much of this control flow is implicit rather than explicit, inferred through convention rather than formal structure.

Metrics that focus on individual programs or transactions assume that control flow boundaries align with code boundaries. In practice, they do not. A change that optimizes one program’s execution path may alter the timing or invocation frequency of downstream components. Because these dependencies are not visible in the metric model, the reported improvement is misinterpreted as system-wide benefit.

When such metrics become targets, teams optimize aggressively within the visible boundary. Control flow is refactored to reduce measured complexity or latency without understanding how execution paths are reused elsewhere. Over time, the control flow graph becomes increasingly fragmented, with logic distributed across modules in ways that satisfy metrics but obscure behavior.

This fragmentation undermines diagnostic capability. When incidents occur, tracing execution paths requires reconstructing control flow from partial evidence. Engineers struggle to correlate symptoms with changes because the metric-driven refactoring obscured the original structure. The metric continues to indicate success, even as operational understanding degrades.

The absence of comprehensive control flow visibility is therefore not a secondary issue. It is a primary reason metrics lose meaning. Without knowing how execution actually unfolds across the system, measurement cannot distinguish between local optimization and systemic degradation.

Data Flow Blindness and the Illusion of Safe Change

Data flow dependencies are among the most underappreciated sources of risk in legacy systems. Mainframe applications often share datasets across batch and online workloads, reuse record layouts through copybooks, and depend on implicit data invariants enforced by convention rather than schema. These flows define how information moves and transforms across the system.

Metrics rarely capture this dimension. Code quality indicators focus on structure. Performance metrics focus on resource consumption. Compliance metrics focus on control presence. None of these reveal how data flows across components or how changes alter data semantics downstream.

When modernization metrics become targets, teams refactor code that appears self-contained while unknowingly modifying data flow characteristics. A field transformation optimized for one consumer may break assumptions in another. A performance improvement that reorders processing may alter data availability timing. Because data flow dependencies are invisible, these changes appear safe according to metrics.

The resulting failures are often subtle. Data inconsistencies emerge slowly, reconciliation processes drift, and reports lose accuracy without triggering immediate alarms. By the time issues are detected, they are disconnected from the original metric-driven change.

This dynamic illustrates why data flow blindness is a powerful enabler of Goodhart effects. Metrics reward visible improvements while concealing changes to data behavior that define system correctness. Without insight into how data propagates, optimization decisions are made on incomplete information, guaranteeing distortion once metrics are enforced.

Understanding this problem requires more than static inspection. It requires analysis that traces data across execution contexts, an approach discussed in work on inter procedural data flow. Without such analysis, metrics cannot reliably guide modernization decisions.

Cross-Module Dependency Chains and Expanding Blast Radius

Legacy systems are characterized by long dependency chains that span modules, jobs, and subsystems. A single change may affect dozens of downstream components through shared services, reused utilities, or common data structures. These chains define the true blast radius of change, yet they are rarely represented in metric frameworks.

When metrics are applied at the module or job level, they implicitly assume that dependencies are shallow or well understood. In multi-decade codebases, this assumption is false. Dependency chains have grown organically, often without documentation. Engineers rely on experience and caution to manage them.

Metric-driven modernization disrupts this balance. When targets incentivize aggressive refactoring, teams make changes without full awareness of downstream impact. A refactored utility may now be invoked by more contexts than before. A consolidated function may become a single point of failure. The blast radius expands even as metrics improve.

Because dependency chains are not visible, this expansion remains unmeasured. The system appears cleaner and more efficient according to indicators, while the consequences of failure grow more severe. This is particularly dangerous in mainframe environments, where recovery from widespread failure is costly and slow.

Over time, the organization experiences a paradox. Metrics suggest reduced risk, yet incidents become harder to isolate and resolve. Each failure affects more components, and root cause analysis becomes more complex. This paradox is a direct result of optimizing without dependency awareness.

The importance of understanding dependency chains has been emphasized in discussions of dependency impact visualization. Without such visibility, metrics provide a false sense of safety that erodes resilience.

Temporal Dependencies and the Misreading of Stability

Not all dependencies are structural. Many are temporal, defined by execution order, timing assumptions, and scheduling behavior. Batch jobs rely on data produced by earlier jobs. Online transactions assume that certain updates have completed. Cleanup processes expect resources to be released at specific times. These temporal dependencies are critical to system stability.

Metrics rarely account for timing relationships. Performance indicators measure duration and latency, but they do not capture sequencing assumptions. When optimization targets encourage changes to execution timing, temporal dependencies are easily violated.

For example, reducing batch job duration may cause a downstream job to start earlier than expected, accessing data before it is fully prepared. Optimizing transaction latency may increase concurrency, triggering contention in processes designed for serialized access. These effects may not immediately manifest as failures, but they introduce race conditions and intermittent errors.

Because metrics focus on averages and totals, temporal instability remains invisible. The system appears stable until edge cases accumulate. When failures occur, they are difficult to reproduce and diagnose because they depend on timing interactions rather than deterministic logic.

This form of dependency blindness is especially pernicious because it undermines confidence in the system. Engineers lose trust in test results and struggle to predict behavior under load. Yet metrics continue to signal improvement, reinforcing the illusion of control.

Addressing temporal dependencies requires understanding execution flow over time, not just code structure. Without this understanding, performance and efficiency metrics will continue to misrepresent stability, driving optimization behavior that erodes predictability.

Why Dependency Blindness Makes Metric Failure Inevitable

Dependency blindness is not a tooling flaw but a structural condition of legacy systems. Decades of incremental change have produced environments where dependencies are numerous, implicit, and poorly documented. Metrics offer a tempting shortcut, providing numeric clarity where understanding is difficult to achieve.

Goodhart’s Law explains what happens next. Once metrics become targets, behavior adapts to satisfy what is measured. In the absence of dependency awareness, this adaptation inevitably exploits blind spots. Optimization improves indicators while destabilizing unseen relationships.

This dynamic makes metric failure predictable rather than accidental. As long as dependencies remain invisible, metrics cannot reliably represent system health under pressure. Recognizing dependency blindness as the root enabler of Goodhart effects reframes the modernization challenge. The problem is not that metrics exist, but that they are applied without sufficient understanding of the systems they attempt to describe.

Until modernization efforts address this blind spot, metric-driven initiatives in mainframe environments will continue to produce impressive numbers alongside growing operational risk.

Smart TS XL and System-Level Insight Beyond Metric Optimization

The repeated failure of modernization metrics in mainframe environments points to a gap that cannot be closed through better targets alone. Metrics fail not because they are inaccurate in isolation, but because they are detached from system behavior. Addressing Goodhart effects therefore requires shifting focus from metric optimization to structural understanding. This shift is particularly critical in legacy systems, where behavior emerges from dependencies that span languages, platforms, and execution contexts.

Smart TS XL is positioned precisely at this intersection between measurement and understanding. Rather than replacing metrics with new ones, it provides system-level insight that explains why metrics change and what those changes actually mean. By modeling control flow, data flow, and dependency propagation across legacy and cross-platform environments, Smart TS XL enables organizations to interpret metrics as signals within a broader behavioral context rather than as targets that drive distortion.

Moving from Metric Chasing to Behavioral Interpretation

Traditional modernization programs often treat metrics as objectives to be achieved. Complexity must be reduced, performance must improve, risks must be lowered, and progress must be demonstrated numerically. Smart TS XL reframes this approach by treating metrics as observations that require interpretation rather than optimization. This distinction is subtle but fundamental.

Instead of asking whether a metric has improved, Smart TS XL supports analysis of why it changed and what other parts of the system were affected as a result. For example, a reduction in reported complexity can be examined alongside changes in call graphs, execution paths, and dependency density. If complexity decreases while dependency fan-out increases, the apparent improvement is revealed as a tradeoff rather than a net gain.

This behavioral interpretation is especially valuable in mainframe environments, where local improvements often conceal global consequences. Smart TS XL correlates metric movement with structural changes, allowing teams to identify when optimization behavior is producing Goodhart effects. Rather than discouraging measurement, it restores meaning to metrics by grounding them in system reality.

This approach aligns with broader discussions of software intelligence platforms that emphasize understanding over reporting. By contextualizing metrics within dependency-aware models, Smart TS XL helps organizations avoid the trap of optimizing indicators that no longer describe system health.

System-Wide Dependency Mapping as a Counterbalance to Goodhart’s Law

Goodhart’s Law thrives in environments where dependencies are hidden. When teams cannot see how changes propagate, they optimize what is visible and inadvertently destabilize what is not. Smart TS XL addresses this imbalance by constructing comprehensive dependency maps that span programs, data stores, batch jobs, and transaction flows.

These maps provide a shared reference point for evaluating change. Before acting on a metric-driven initiative, teams can assess which components are connected, how data moves, and where execution paths converge. This visibility makes it possible to anticipate side effects that metrics alone would obscure.

For example, performance optimization efforts can be evaluated not only in terms of local gains but also in terms of their impact on downstream jobs and shared resources. Compliance-driven refactoring can be assessed for its effect on control flow and exception propagation. Cross-platform migration steps can be analyzed for dependency expansion rather than just completion status.

By exposing these relationships, Smart TS XL reduces the incentive to game metrics. Optimization decisions become informed by potential impact rather than numeric targets. In this way, dependency mapping functions as a structural counterweight to Goodhart effects, ensuring that improvements reflect real system change.

The importance of such visibility has been highlighted in analyses of enterprise dependency mapping, where understanding relationships is shown to be critical for risk reduction. Smart TS XL operationalizes this insight in legacy modernization contexts.

Preserving Metric Meaning Through Impact-Aware Analysis

Metrics lose meaning when their movement cannot be explained. Smart TS XL restores interpretability by linking metric changes to specific structural transformations. This impact-aware analysis allows teams to distinguish between healthy optimization and metric distortion.

When a code quality metric improves, Smart TS XL can reveal whether the improvement corresponds to reduced coupling, clearer execution paths, or simplified data flow. If the improvement is instead driven by mechanical restructuring that increases fragmentation, this discrepancy becomes visible. Metrics regain their diagnostic value because they are no longer interpreted in isolation.

The same principle applies to performance and compliance metrics. Rather than accepting improvements at face value, Smart TS XL enables examination of how changes affect throughput, contention, and failure modes. Compliance-related refactoring can be assessed for its impact on execution complexity and data handling consistency, preventing the introduction of hidden risk.

This interpretive capability is essential in environments where metrics persist over long modernization timelines. As systems evolve, the meaning of a metric can drift. Impact-aware analysis anchors interpretation in current system structure, preventing outdated metrics from driving inappropriate decisions.

Such an approach complements established practices in impact analysis for testing, extending them beyond validation into strategic modernization decision making.

Supporting Decision-Making Under Metric Pressure

Modernization initiatives operate under constant pressure to demonstrate progress. Metrics are often required to justify investment, guide prioritization, and satisfy oversight expectations. Smart TS XL does not remove this pressure, but it equips decision makers to respond to it without sacrificing system integrity.

By providing evidence of how changes affect system behavior, Smart TS XL enables more nuanced narratives around progress. Instead of reporting isolated metric improvements, organizations can explain tradeoffs, risks mitigated, and dependencies stabilized. This shifts the conversation from numeric targets to informed decision-making.

In practice, this means that teams can resist counterproductive optimization without appearing resistant to measurement. They can demonstrate why certain metric movements are misleading and propose alternative actions grounded in system insight. This capability is particularly valuable in mainframe environments, where change aversion is often reinforced by opaque risk.

Smart TS XL thus serves as an enabler of responsible modernization under metric pressure. It allows organizations to engage with metrics critically rather than reactively, preserving their usefulness while avoiding Goodhart-driven distortion.

Why System Insight Outlasts Metric Targets

Metrics are inherently transient. Targets change, priorities shift, and measurement frameworks evolve. System insight, by contrast, accumulates value over time. Each analysis deepens understanding of how the system behaves and how it responds to change.

Smart TS XL invests in this enduring asset. By building and maintaining a living model of system structure and behavior, it supports modernization efforts that remain robust even as metrics evolve. Goodhart’s Law becomes less threatening because optimization behavior is guided by understanding rather than by numeric thresholds alone.

In legacy environments, where modernization is a multi-year journey, this distinction is decisive. Metrics will come and go, but the need to understand dependencies, flows, and impact remains constant. Smart TS XL aligns modernization strategy with this reality, offering a way to move beyond metric optimization toward sustainable system evolution.

Measuring What Still Matters in Legacy Modernization

The repeated failure of metric-driven modernization does not imply that measurement itself is futile. It reveals that many commonly used indicators are poorly aligned with the properties that actually determine system resilience, change safety, and long-term viability. In legacy mainframe environments, what matters most is rarely captured by surface-level metrics. Instead, it resides in structural characteristics that remain stable even under optimization pressure.

Measuring what still matters requires reframing the role of metrics from targets to lenses. Rather than asking whether a number improved, the focus shifts to whether the system’s ability to absorb change, recover from failure, and evolve predictably has increased. These qualities are harder to quantify, but they are also far more resistant to Goodhart effects. In legacy modernization, durable progress depends on indicators that reflect system behavior rather than compliance with predefined thresholds.

Change Propagation Scope as a Stability Indicator

One of the most meaningful indicators in legacy systems is the scope of change propagation. When a modification is made to a program, dataset, or job, the number of downstream components affected reveals far more about system stability than isolated quality scores. A system in which small changes have limited, predictable impact is fundamentally healthier than one where minor modifications ripple unpredictably across the landscape.

Unlike traditional metrics, change propagation scope does not incentivize superficial optimization. Reducing it requires structural improvement, such as clarifying interfaces, reducing unnecessary coupling, and isolating responsibilities. These changes are difficult to fake and tend to produce lasting benefits. As a result, this indicator remains meaningful even under measurement pressure.

In multi-decade mainframe environments, uncontrolled propagation is often the primary source of modernization risk. Engineers hesitate to change code not because it is complex in isolation, but because they cannot confidently predict what will be affected. Measuring propagation scope directly addresses this concern by making impact explicit.

This concept aligns closely with practices described in measuring code volatility impact, where volatility is evaluated in terms of downstream effect rather than frequency alone. By focusing on how widely change spreads, organizations gain insight into the true cost and risk of evolution.

Tracking propagation scope over time reveals whether modernization efforts are actually reducing systemic fragility. A shrinking blast radius indicates progress that cannot be easily gamed, making it a powerful countermeasure to Goodhart-driven distortion.

Dependency Density and Structural Concentration

Another property that continues to matter under pressure is dependency density. This refers to how many responsibilities and relationships converge on a single component. High dependency density signals structural concentration, where failure or change in one area has disproportionate consequences.

Legacy systems often evolve toward higher concentration as shared utilities, data structures, and services accumulate responsibilities over time. Traditional metrics may overlook this trend because individual components appear small or simple. Dependency density exposes the hidden risk by highlighting where the system is structurally brittle.

Measuring dependency density discourages cosmetic refactoring. Splitting code without reducing dependencies does not improve the indicator. Genuine improvement requires redistributing responsibilities and clarifying boundaries. These actions align with long-term modernization goals and resist manipulation.

In mainframe environments, dependency density is especially relevant because shared components frequently underpin both batch and online workloads. Identifying and reducing over-concentration can significantly improve resilience and simplify future change.

This approach reflects insights from work on dependency concentration analysis, emphasizing that risk is often a function of structure rather than size or complexity alone. By tracking where dependencies cluster, organizations measure something that directly affects failure impact and recovery effort.

Mean Time to Recovery as a Behavioral Measure

Mean time to recovery is often treated as an operational metric, but in legacy modernization it serves as a powerful proxy for structural health. Recovery time reflects how understandable, observable, and controllable a system is under stress. Systems that recover quickly tend to have clearer execution paths, better isolation, and more predictable behavior.

Unlike many performance metrics, recovery time is difficult to optimize superficially. Improving it requires investments in clarity, tooling, and structural simplification. These changes typically reduce Goodhart effects because they improve real behavior rather than appearances.

In mainframe environments, recovery is often prolonged by hidden dependencies and opaque execution flow. Measuring recovery time exposes these weaknesses indirectly. If incidents take longer to resolve despite apparent metric improvement elsewhere, it signals that modernization is not addressing core issues.

The relationship between recovery and structure is explored in discussions of reduced mean time recovery, where dependency simplification is shown to be central to operational resilience. Tracking recovery trends alongside structural change provides a grounded view of progress.

Because recovery time reflects actual operational experience, it remains meaningful even when other metrics are optimized. It captures the system’s ability to respond to the unexpected, a quality that cannot be fully anticipated or gamed.

Observability of Execution Paths Under Change

Another enduring indicator is the observability of execution paths when changes are introduced. This refers to how easily teams can trace what happens when a modification is deployed. High observability means execution paths are understandable, traceable, and explainable. Low observability indicates opacity, where behavior must be inferred through trial and error.

Metrics that focus on observability resist Goodhart effects because they depend on human experience rather than numeric thresholds. If engineers struggle to explain behavior after a change, observability is low regardless of what other metrics report.

In legacy systems, observability is often limited by fragmented logic and implicit control flow. Measuring improvements in traceability and clarity directly addresses this challenge. Tools and practices that illuminate execution paths reduce reliance on tribal knowledge and increase confidence in modernization decisions.

The role of observability in modernization has been discussed in the context of telemetry driven impact analysis, highlighting how visibility supports safer evolution. By treating observability as a first-class outcome, organizations focus on understanding rather than optimization.

This indicator remains robust under pressure because it cannot be satisfied through superficial change. Improved observability reflects genuine progress in making the system knowable and manageable.

Why These Measures Resist Goodhart’s Law

The common characteristic of these indicators is their resistance to manipulation. They measure properties that emerge from structure and behavior rather than from isolated artifacts. Improving them requires changes that align with the underlying goals of modernization, such as reduced fragility, increased clarity, and safer change.

Goodhart’s Law thrives where metrics are easy to optimize without altering reality. Measures like propagation scope, dependency density, recovery time, and observability are difficult to improve without real progress. As a result, they maintain their meaning even when tracked over long timelines.

In legacy mainframe environments, where modernization is incremental and risk tolerance is low, these measures provide a more reliable compass. They shift attention away from numeric targets and toward system qualities that determine whether modernization will succeed in practice.

By focusing on what still matters, organizations can measure progress without falling into the trap of metric-driven distortion. The result is a modernization strategy grounded in system behavior rather than in the illusion of control.

When Metrics Stop Measuring Reality

Legacy modernization in mainframe environments consistently exposes the same structural failure mode. Metrics that begin as helpful signals gradually lose their connection to system behavior once they are elevated to targets. Goodhart’s Law does not emerge as an abstract economic principle applied after the fact. It manifests directly in engineering decisions, refactoring strategies, performance tuning efforts, and cross-platform migration plans. The result is a widening gap between reported progress and operational reality.

What makes this failure particularly persistent in legacy systems is not poor intent or lack of discipline. It is the nature of the systems themselves. Decades of incremental change have produced architectures where behavior emerges from dependency networks rather than isolated components. Metrics that ignore this reality inevitably oversimplify. When pressure is applied, optimization behavior follows the metric rather than the system, producing improvements that are numerically convincing but structurally hollow.

Across code quality, performance, compliance, and migration initiatives, the same pattern repeats. Local optimization undermines global stability. Improvements in one dimension shift risk into another. Dependency blindness allows distortion to accumulate until incidents surface that metrics never predicted. By the time failures occur, the connection between cause and effect has often been erased by layers of metric-driven change.

The path forward is not to abandon measurement, but to demote it from its role as a decision driver. Metrics remain valuable as indicators, but only when interpreted through system-level understanding. Structural insight into control flow, data propagation, dependency concentration, and execution behavior restores meaning to numbers that would otherwise drift. In this context, progress is no longer defined by whether a metric moved, but by whether the system became more predictable, resilient, and understandable.

Legacy modernization succeeds when organizations recognize that what matters most cannot always be reduced to a dashboard. The systems that endure are those whose behavior can be explained, whose changes can be anticipated, and whose failures can be recovered from quickly. Metrics may support that goal, but they can never substitute for it.