Inherited Circuits, Realized Semantics: How Safety High-quality-Tuning Can Create Hidden Evasion Threat

June 29, 2026

14

High-quality-tuning is a course of that lets us steer a general-purpose giant language mannequin towards a particular activity by coaching it on focused examples. In cybersecurity, that is typically helpful for issues like classifying phishing emails, suspicious URLs, or PowerShell scripts. A fine-tuned mannequin can turn into far more helpful in a safety workflow as a result of it learns the language, construction, and labels that matter for that area.

In our newest analysis, we discovered that fine-tuning can enhance baseline classification habits whereas additionally introducing a brand new sort of brittleness. The fine-tuned mannequin performs higher on customary held-out examples however turns into extra susceptible to behavior-preserving variants of the identical underlying script. In different phrases, the mannequin seems to be stronger underneath customary analysis but turns into simpler to idiot underneath sensible transformations that protect what the code does.

Our work traces the habits to its mechanistic supply, offering insights and concrete suggestions for safety groups on the way to handle and monitor modifications launched through fine-tuning.

Overview

We studied malicious/benign PowerShell script classification utilizing a pure base + fine-tuned mannequin pair: Llama-3.1-8B-Instruct and Basis-Sec-8B-Instruct. Basis-Sec performs higher on the baseline classification activity (+4.7% accuracy), however it additionally develops transformation-sensitive misses that the bottom Llama mannequin doesn’t share. Basis-Sec was not explicitly fine-tuned for PowerShell classification, however for information of the cybersecurity area general.

The important thing consequence isn’t just that some obfuscation works. The fascinating discovering is mechanistic: the fine-tuned mannequin inherits the identical underlying classification circuit from the bottom mannequin, however fine-tuning modifications how later elements of the community interpret that circuit’s sign. In profitable evasion instances, the malicious proof is commonly nonetheless current internally. The failure occurs as a result of fine-tuned feed-forward parts can suppress, redirect, or invert that proof earlier than the ultimate resolution.

That offers us a sensible lesson: post-fine-tuning robustness isn’t just a matter of check accuracy. A mannequin can turn into extra correct on canonical examples whereas turning into extra brittle to transformations that safety groups ought to anticipate attackers to make use of.

Inherited Circuit, Specialised Semantics

Mechanistic interpretability is a set of instruments for asking how a mannequin computes a habits internally. As a substitute of treating the mannequin as a black field, we search for the particular parts that causally drive the output. In transformer fashions, these parts are sometimes consideration heads, MLP layers, and the residual stream, which is the working illustration handed from layer to layer.

For this undertaking, we used PowerShell classification as a concrete safety setting. PowerShell is a helpful case research as a result of many suspicious indicators should not malicious by themselves. Tokens like IEX, DownloadString, Invoke-WebRequest, and -EncodedCommand can seem in malicious scripts, however they will additionally seem in benign administrative code. A very good classifier can not merely memorize {that a} token is suspicious. It wants to make use of surrounding context.

We in contrast Basis-Sec in opposition to its Llama base mannequin with the query: Did safety fine-tuning create a brand new classification circuit, or did it reshape a circuit that was already current within the base mannequin?

Our causal interventions help the second reply. Basis-Sec’s classification route is inherited from Llama. The identical broad circuit skeleton is already current within the base mannequin (annotated as Layers [L] and a focus heads [H] within the following determine):

High-quality-tuning doesn’t seem to create a brand new PowerShell detector from scratch. As a substitute, it concentrates and specializes an inherited route. That specialization is helpful. It helps the mannequin classify canonical safety examples. But it surely additionally creates a sharper dependence on indicator-token semantics. The fine-tuned mannequin turns into extra delicate to the precise floor type of sure instructions and indicators.

Stress-Testing the Circuit

Normal analysis often asks whether or not the mannequin classifies held-out examples appropriately. That’s essential, however it isn’t sufficient for safety. Attackers don’t have to protect the precise floor type of a script. They solely have to protect the habits.

To check this hole, we constructed a three-tier evasion benchmark. Every benchmark row begins with a malicious seed script that the mannequin classifies appropriately. We then apply a behavior-preserving rewrite and ask whether or not the mannequin nonetheless classifies the variant as malicious. This retains the attribution clear: the mannequin correctly handles the unique script, so a miss on the reworked script could be tied to the transformation.

Every accepted variant should protect essential attributes akin to URLs, command targets, arguments, encoded-command equivalence the place related, and process-launch habits.

The three tiers are:

direct_v1: direct syntax-preserving rewrites
reconstructive_v2: runtime command or string reconstruction
case_mutation_v3: casing modifications that protect PowerShell semantics

Tier 1: Direct Rewrites

Direct rewrites change a visual command type with out altering habits. The only instance is alias substitution. In PowerShell, widespread instructions typically have shorter aliases. An attacker doesn’t want to alter the payload. They will change the command floor, e.g.:

# seed sample
Invoke-WebRequest -Uri -OutFile

# behavior-preserving variant
iwr -Uri -OutFile

The speculation for this class is {that a} mannequin skilled closely on canonical command types could deal with the complete command token in another way from the alias, even when PowerShell treats them as equal. In our benchmark, this class produced constant Basis-Sec misses within the Invoke-WebRequest alias household. Llama didn’t share these misses on the identical evaluated variants.

Tier 2: Command and String Reconstruction

The second-tier checks whether or not the mannequin relies on a command or technique showing actually within the textual content. Many scripts reconstruct strings or command names at runtime. A classifier that chubby’s literal indicators can miss the habits when the identical command is assembled not directly, e.g.:

# seed sample
Invoke-Expression

# behavior-preserving variant
& ((‘{0}{1}’ -f ‘Invoke-‘,‘Expression’))

This sort of rewrite preserves the command’s position whereas altering the textual proof accessible to the mannequin. It checks whether or not the classifier understands the operation or merely acknowledges the literal command string. In our outcomes, Basis-Sec produced misses on a centered Invoke-Expression reconstruction sample, whereas the bottom Llama mannequin didn’t share the identical misses.

Tier 3: Case Mutation

PowerShell command names are case-insensitive. That makes case mutation a very sharp check. Not like reconstruction, it doesn’t conceal the command from a human reader. Not like alias substitution, it doesn’t substitute the command with a special phrase. It preserves the identical command id and argument construction whereas altering the token floor that the mannequin sees, e.g.:

# seed sample
Invoke-Expression

# behavior-preserving variant
InVoKe-ExPrEsSiOn

We additionally examined alias-form case mutation:

# canonical alias type
IEX

# behavior-preserving variant
iEx

This tier is essential as a result of it factors to token-surface sensitivity. If the mannequin misses a script after a case-only change, the difficulty is unlikely to be semantic ambiguity in PowerShell. The habits, command id, and argument construction are preserved. What modified is the illustration the mannequin builds from the textual content.

Basis-Sec produced misses whereas Llama produced none on the identical evaluated set. The strongest misses concentrated round full-command Invoke-Expression case mutation (4/4 missed) and case-mutated IEX alias variants (4/4 missed):

Immediate Fixes Can Be Uneven

One tempting response is to repair the difficulty with a greater immediate. For instance, we are able to inform the mannequin to categorise based mostly on general function quite than particular person constructs.

That helps in some locations. In our checks, a prompt-level change fastened the Invoke-WebRequest alias misses. But it surely additionally opened or amplified misses in different households, together with Invoke-Expression, IEX, and DownloadString transformations.

This reveals that immediate remediation can redistribute the failure floor, quite than eradicate it. Safety groups shouldn’t assume {that a} immediate that fixes one evasion household makes the mannequin globally extra sturdy.

Why This Is Not Simply “Obfuscation Fooling a Classifier”

At a excessive stage, it’s straightforward to say: “A classifier overfit to indicators could be fooled by altering the indications”, however the true clarification is extra delicate. The fascinating half is what modified via fine-tuning.

Basis-Sec and Llama share the identical underlying structure and inherit an identical classification circuit. Basis-Sec is best on the baseline activity, however it is usually extra brittle underneath particular transformations. This implies the vulnerability shouldn’t be merely a generic weak spot of the bottom structure. It’s tied to how fine-tuning reshaped the inherited circuit.

In profitable evasion instances, the interior malicious sign doesn’t merely vanish. The late consideration route can nonetheless carry proof that the script is malicious. The failure seems in feed-forward computation close to the classification boundary: fine-tuned parts change how that proof is used. In some instances, the proof is successfully reversed, turning what ought to help a malicious classification into help for a benign one.

That is why we describe the failure as realized semantics on prime of inherited circuits. The inherited route nonetheless exists. High-quality-tuning modifications the which means and weighting of the indications that feed into the ultimate resolution.

A Pre-Deployment Monitoring Methodology

The sensible query is: can we determine the dangerous command households earlier than producing a big evasion benchmark? Our reply is sure, on the household stage.

1. Linear Probe for Illustration Drift

First, we practice a easy linear probe on a hidden activation close to the mannequin’s classification boundary. In our research, circuit evaluation instructed us the place to look: the residual stream simply earlier than Layer 13. However the broader technique shouldn’t be tied to that precise layer. The essential concept is to decide on a steady inside web site the place classification proof is readable, practice a light-weight linear readout on the bottom mannequin, and reuse that readout after fine-tuning.

The probe works nicely in our setting, with correlations round r = 0.80-0.87. This implies the mannequin’s inside classification proof could be monitored with an affordable linear projection.

A crew can then run the bottom and fine-tuned fashions on canonical inputs, apply the identical projection, and evaluate the consequence by command household. Households whose projected sign shifts probably the most turn into the primary red-team targets.

2. Indicator-Token Signal Check

The second sign is extra focused. For every command household, we take away or neutralize the canonical indicator tokens and measure whether or not malicious confidence goes up or down.

If eradicating a token reduces malicious confidence, the token was performing as a driver of the malicious resolution. If eradicating it will increase malicious confidence, the token is performing like a suppressor.

The dangerous sample is an indication flip between the bottom and fine-tuned fashions. If the bottom mannequin treats an indicator as a malicious driver, however the fine-tuned mannequin treats it as a suppressor, then that household has undergone a task reversal. That could be a sturdy sign that behavior-preserving transformations of that indicator deserve red-team consideration. The output shouldn’t be a prediction for particular person scripts. It’s a ranked checklist of command households to pink crew.

What This Means for Safety Groups

High-quality-tuning could be priceless. The lesson is to not keep away from fine-tuning safety fashions. The lesson is to guage what fine-tuning modifications.

Safety fine-tuning modifications greater than activity efficiency. It modifications how the mannequin internally represents and makes use of proof. In our research, Basis-Sec inherited a helpful detection circuit from Llama, then specialised in a manner that improved baseline habits however launched transformation-sensitive failures.

Normal held-out accuracy tells us whether or not the mannequin performs nicely on acquainted examples. It doesn’t inform us whether or not the mannequin has turn into brittle to behavior-preserving variants. For safety classification, that hole issues as a result of attackers can change floor type whereas preserving habits.

The sensible suggestion is simple: deal with fine-tuning as a possible supply of illustration drift. Earlier than deployment, evaluate the bottom and fine-tuned fashions on canonical inputs, determine which command households modified most, and red-team these households with behavior-preserving variants. The objective is to not predict each evasion. The objective is to search out the elements of the duty the place fine-tuning could have made the mannequin semantically brittle.

Llama is a trademark of Meta Platforms. PowerShell is a trademark of Microsoft. All different emblems are the property of their respective house owners.

Inherited Circuits, Realized Semantics: How Safety High-quality-Tuning Can Create Hidden Evasion Threat

Overview

Inherited Circuit, Specialised Semantics

Stress-Testing the Circuit

Tier 1: Direct Rewrites

Tier 2: Command and String Reconstruction

Tier 3: Case Mutation

Immediate Fixes Can Be Uneven

Why This Is Not Simply “Obfuscation Fooling a Classifier”

A Pre-Deployment Monitoring Methodology

1. Linear Probe for Illustration Drift

2. Indicator-Token Signal Check

What This Means for Safety Groups

Related Articles

Spaceflight Nears Its Steamship Period

Adara, Aurora pitch DOCSIS and PON upgrades to smaller operators

The Tokens You Can’t Wait For – O’Reilly

LEAVE A REPLY Cancel reply

Latest Articles

Spaceflight Nears Its Steamship Period

Adara, Aurora pitch DOCSIS and PON upgrades to smaller operators

The Tokens You Can’t Wait For – O’Reilly

One World Dialogue, A number of Competing Rulebooks |

‘The Java Story’ involves YouTube

ABOUT US