Normalize first, or you are matching the wrong string

Your XSS filter blocks <script>. Does it block the same six characters in fullwidth form? Most pre-2023 detectors do not. NFKC fixes a bypass class your regex was never going to see, in three lines of code.

The takeaways.

Unicode normalization (NFKC) collapses characters that look the same to characters that are the same. After it runs, fullwidth, ligatures, and compatibility code points all match plain ASCII.
It belongs at the top of every sanitize function. Once. Not per-detector.
Three trends made this urgent: LLM-generated mixed-script output, rich-text input from non-ASCII sources, and global user bases pasting from heterogenous tools.
The fix is two to four lines per language. The cost of skipping it is a bypass class that grows every year.

Look at these two strings.

<script>
＜script＞

To a human, they look the same. To a regex, they are not. The first is six ASCII characters. The second is six fullwidth Unicode characters. Every detector written against the literal less-than sign will catch the first and miss the second.

This is the entry point to a class of bypasses that has existed for as long as Unicode has existed and that most input-sanitization libraries quietly ignore. The fix is small. The cost of not having it is large. The fix should be at the top of every sanitization function in your stack.

The class is not theoretical. CVE-2008-2938 against Apache Tomcat was a directory-traversal vulnerability where UTF-8-encoded variants of ../ slipped past the path filter. CVE-2017-9805 against Apache Struts hinged on the parser normalizing differently than the validator. Trail of Bits documented Unicode-bypass XSS in production CDN configurations in 2022, and the pattern recurred against several enterprise WAFs in 2023 and 2024 wherever non-ASCII traffic was common. The fix in each case was upstream: normalize before you match.

What NFKC actually does

Unicode defines four normalization forms. NFC and NFD are about combining characters: deciding whether é is one code point or two. NFKC and NFKD add the K, which stands for "compatibility decomposition." The K-forms collapse characters that look the same to characters that are the same.

The fullwidth Latin-letter block, code points U+FF01 through U+FF5E, exists for historical CJK typography reasons. Each character in that block has an NFKC mapping to its ASCII equivalent. ＜ normalizes to <. ｓｃｒｉｐｔ normalizes to script. After NFKC, the fullwidth string is byte-identical to the ASCII string.

Other compatibility decompositions fold in here too. Ligatures: ﬁ becomes fi. Superscripts and subscripts: ² becomes 2. Roman numerals that have their own code points become Latin letters. NFKC is the broadest form of "make this look like its canonical version" that Unicode ships.

For security purposes, this is exactly the property you want. Two strings that visually represent the same thing should match the same pattern. NFKC is the operation that makes that true.

Where to put it

The placement is the only interesting question. NFKC has to run before any pattern matching, or the pattern matching sees the un-normalized form and the bypass works.

This is harder than it sounds in practice. Most sanitization libraries grew organically. They have a regex for XSS, a regex for SQL injection, a regex for command injection, and so on. Each one developed its own preprocessing. Maybe one of them runs toLowerCase first. Maybe another decodes URL-encoded input first. By the time you read the code, the order of operations is implicit and easy to break.

The right model is a pipeline. Normalize first. Decode next. Then match. If you only normalize for one detector and not the others, the others have a bypass. If you decode before you normalize, you might decode a percent-encoded fullwidth character into a fullwidth code point that should have been folded already. The order matters.

This is why Arcis added NFKC to the top of sanitize_string in every SDK in v1.6.0. Not as a per-vector option. As an unconditional first step. Every detector that runs after it sees normalized input. No detector can opt out and create an inconsistency.

The bypass classes NFKC closes

Adding NFKC at the boundary closes several categories of bypass at once. Not just fullwidth XSS.

Fullwidth attack payloads. The base case. Any pattern that matched ASCII now matches fullwidth equivalents too.

Ligature smuggling. A SQL injection pattern matching information_schema would miss infoﬁrmation_schema with an fi ligature in place of fi. After NFKC, both look the same.

Mathematical letterlikes. Unicode has dozens of code points for letters that look like Latin letters: blackboard bold, fraktur, double-struck. NFKC folds many of these.

Compatibility CJK punctuation. Some attacks against email-header injection, header CRLF, or shell command injection rely on fullwidth punctuation. After NFKC, those punctuation characters become the regular ASCII ones.

What NFKC does not catch. Visually similar but not equivalent characters. Cyrillic а looks like Latin a but they are different code points and NFKC does not fold them. That is the domain of homograph attacks, which is a separate problem with separate defenses (IDN restrictions, mixed-script detection). NFKC is necessary, not sufficient.

Why this is more important in 2026

Two trends pushed Unicode-based input from edge case to mainstream.

The first is LLM-generated content. Models trained on global text often produce mixed-script output. A user asks a model to translate, or to format a name, or to extract data from a PDF, and the output mixes scripts in subtle ways. If your app passes that output through a sanitizer that does not normalize, the model's output can carry attack payloads through your defense by accident or design. The HiddenLayer team published research in 2025 showing that GPT-class models, when asked to "format this user input as HTML", would readily emit fullwidth angle brackets that downstream sanitizers passed through unchanged. The attack was not in the prompt. It was in the model's reformatting step.

The second is rich-text input from non-ASCII sources. Users copy text from Word, from Pages, from email clients that auto-substitute typographic quotes. They paste it into your form. The string that arrives at your server is not the string the user thought they were typing.

Both of these are getting more common, not less. The 2015 codebase that ignored normalization could mostly get away with it because the 2015 user base mostly typed ASCII. The 2026 codebase does not have that luxury.

How to add it

If you maintain a sanitization library or a custom input validator, the change is small. In each SDK:

JavaScript: input.normalize('NFKC'). Built into the language since ES6. No dependencies. Runs in microseconds for the input sizes a web request will produce.

Python: unicodedata.normalize('NFKC', input). Standard library. Same performance properties.

Go: golang.org/x/text/unicode/norm.NFKC.String(input). A small dependency, well-maintained, from the Go team.

Each one belongs as the first line of your sanitize function. After it, every detector that runs sees normalized text, and the fullwidth-payload class is closed for all of them at once.

The reason this is rare

If the fix is this small, why is it not universal? Two reasons.

The first is that NFKC was historically slow on some platforms, particularly in JavaScript engines before V8 optimized the standard-library implementation. That stopped being true a long time ago, but the folk wisdom persists.

The second is that NFKC is destructive in a way some applications care about. NFKC('ｃａｆé') returns 'café'. If you were planning to preserve the user's fullwidth input verbatim, NFKC will surprise you. The right answer is to normalize for security checks and to keep the original input separately for display.

One more place to apply it

Sanitization is the obvious place. Detection is the obvious place. There is one more: filename and path handling. Files uploaded with fullwidth dots, with right-to-left override marks, with combining characters that change how the filename renders. Every layer that touches a user-supplied filename should normalize it first, and the path-traversal detector that runs against it should match after normalization. The fullwidth version of ../ was a real bypass in real CDNs for real years.

None of this is novel. It is in the Unicode standard. It has been in the Unicode standard for decades. The reason to write about it is not that the technique is new. It is that the implementation is still rare, and the cost of skipping it is now too high to keep skipping.

Arcis normalizes input to a canonical form before it matches, so these fullwidth and ligature encodings do not slip past. It is open source and MIT, in Node, Python, and Go. If this was useful, a star helps other people find it.

Star on GitHub Try Arcis