Measuring Malice: When Being ‘Almost Right’ Is Exactly Wrong

If you’ve spent any time writing detection rules for process masquerading, you know the game: an attacker uses scvhost.exe instead of svchost.exe, you write a (hopefully?) good regex to catch it. Then they use svch0st.exe (zero instead of ‘o’), so you update your regex. Then svchost1.exe appears. Then svchostt.exe with a double ‘t’.
Every variation requires manual updates. Every new typosquatting attempt means another rule update. The pattern becomes increasingly complex, harder to maintain, and still manages to miss novel variations.
There’s a better way.
Instead of trying to enumerate every possible character swap, what if we measured how close a process name is to a legitimate binary? What if we could say: “This process name is suspiciously similar to svchost.exe but not quite right—alert on it"?
This is where Levenshtein distance comes in.
What Is Levenshtein Distance?
Levenshtein distance is deceptively simple: it’s the minimum number of single-character edits needed to transform one string into another.
Those edits can be:
- Substitution: Changing one character to another (svchost → svch0st - substituting ‘o’ with ‘0’)
- Insertion: Adding a character (svchost → svchostt - adding an extra ‘t’)
- Deletion: Removing a character (svchost → scvhost - deleting the ‘v’)
Let’s look at concrete examples using our favorite Windows binary:
Notice the pattern? The malicious masquerading attempts cluster in the 1–2 edit distance range. They’re close enough to fool a casual human glance, but mathematically distinct.
The Math
https://en.wikipedia.org/wiki/Levenshtein_distance
Translation: If the strings are empty, distance is the length of the other string. If the first characters match, recurse on the rest. Otherwise, try all three operations (delete, insert, substitute) and take the minimum cost.
You don’t need to implement this yourself — libraries exist in every language. But understanding the concept is key: we’re measuring similarity, not matching patterns.
The “Uncanny Valley” of Process Names
The “uncanny valley” describes the unsettling feeling when something looks almost human but not quite right. A cartoon is fine. A photograph is fine. But a near-perfect CGI face with slightly wrong eyes? somewhat disturbing.
Process masquerading operates in the same psychological space.
When you see chrome.exe, your brain registers “Chrome browser.” When you see malware.exe, your brain registers “obviously suspicious.” But when you see svch0st.exe in a process list scrolling by? Your brain might just register “svchost” and move on. It’s in the uncanny valley—close enough to pass a quick glance, wrong enough to be malicious.
This gives us our detection strategy.
The Threshold of Malice
When comparing an unknown process name against known legitimate binaries, the Levenshtein distance reveals intent:
Distance = 0 (Exact Match)
- The process name is identical to a legitimate binary
- Example: svchost.exe → svchost.exe
- Interpretation: Likely benign (though you still need path validation)
Distance = 1–3 (The Danger Zone)
- The process name is almost right but slightly wrong
- Example: svch0st.exe (distance 1), scvhost.exe (distance 2)
- Interpretation: Extremely suspicious. This is deliberate masquerading
- No legitimate software accidentally names itself one character away from a Windows system binary
Distance > 4 (Different)
- The process name bears little resemblance
- Example: chrome.exe vs svchost.exe (distance 11)
- Interpretation: Different program entirely. Benign in this context.
Attackers want their malware to blend in. They’re not trying to be completely different — they’re trying to be close enough to evade casual inspection. This means they’ll likely operate in that 1–3 edit distance range.
A distance of 4+ defeats the purpose of masquerading. svcXYZ.exe doesn’t look like svchost.exe anymore. The attacker might as well use a completely different name.
The mathematical sweet spot is distances 1–2. That’s where masquerading lives.
Implementation
Let’s get a few things straight first:
- There are a bunch of different ways to do what Levenshtein does.
- Finding which one is the most efficient and fits your org, is something that falls on you.
- Using Levenshtein in the SIEM is not something new. Splunk’s got a blog post about it from 2023, and this Reddit post from 5 years ago is a good discussion about the topic (and why it’s not always practical).
- This method is not perfect, and like everything else, got its nuances.
In the Splunk blog post that is mentioned above, they have Levenshtein implemented as part of their URL Toolbox plugin, which I have also used for the purpose of this demonstration, for simplicity’s sake.
After installing the plugin, a simple search can be implemented as such:
In a production you would first have to find your “golden list” of processes that you would like to monitor, that fits your environment. Consider adding:
- Critical enterprise applications
- EDR/security tools that attackers might mimic
- Any binaries you see frequently targeted in threat intelligence
LUMEN Implementation
If you haven’t had a chance to read my post about the recent tool I developed, LUMEN, take a few minutes to do it now. But the tl;dr is that LUMEN is a browser-based EVTX companion:
I think Levenshtein distance is a great addition to the toolkit, so you can now expect to find it shipped with LUMEN, inside the “Process Execution Analysis” card:


You can find LUMEN at https://lumen.koifsec.me or if you prefer to self-host at https://github.com/Koifman/LUMEN
Conclusion
Detection engineering is fundamentally about asymmetry. Attackers need to get creative once. Defenders need to predict and prepare for every possible variation (which is also where immutable artifacts comes into play).
Instead of enumerating attack patterns, we define what “close but wrong” means mathematically. The attacker can swap characters, double letters, use numbers instead of letters — it doesn’t matter. If they’re trying to masquerade as a legitimate binary, the math will catch them (unless we’re talking about injection, which is a whole different story).
This approach gives you:
- One detection logic catches unlimited variations
- No more updating regex patterns for every new typosquatting technique
- Future-proof against creative character substitutions
The concept extends beyond masquerading detection:
- DLL side-loading: Detect DLLs with names close to legitimate system DLLs
- Domain typosquatting: Find C2 domains similar to legitimate services
- File paths: Identify malicious files in directories that mimic system paths
- Registry keys: Catch persistence mechanisms using near-identical key names
Regex isn’t going away — it still has its place for precise pattern matching. But for problems where attackers benefit from being “close but not exact,” mathematical similarity scoring is simply better.
The detection rules we write reveal our assumptions about how attackers think. This one assumes they want to blend in. And that assumption, unlike most, is backed by mathematics.
If you enjoyed the article, feel free to connect with me!
https://www.linkedin.com/in/daniel-koifman-61072218b/
https://x.com/KoifSec
https://koifsec.me
https://bsky.app/profile/koifsec.bsky.social
Measuring Malice: When Being ‘Almost Right’ Is Exactly Wrong was originally published in Detect FYI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Introduction to Malware Binary Triage (IMBT) Course
Looking to level up your skills? Get 10% off using coupon code: MWNEWS10 for any flavor.
Enroll Now and Save 10%: Coupon Code MWNEWS10
Note: Affiliate link – your enrollment helps support this platform at no extra cost to you.
Article Link: https://detect.fyi/measuring-malice-when-being-almost-right-is-exactly-wrong-abbdbe2ca7c7?source=rss----d5fd8f494f6a---4
1 post - 1 participant
Malware Analysis, News and Indicators - Latest topics