Input validation of free-form Unicode text in Python

Input validation is one of the most important application security controls and still, there’s a huge gap as it comes to implementation of one of the most popular types of user input — free-form text with Unicode characters. This article demonstrates a simple way of dealing with Unicode text using Python. There’s quite a lot of pretty useful guidance on implementing input validation for various data types — for example OWASP Input Validation Cheat Sheet, which discusses various strategies of validating common data types in detail and with code samples. As it comes to free-form text, however, the guidance becomes rather vague and rather discouraging, if you think about it:

The most difficult fields to validate are so called 'free text' fields, like blog entries. However, even those types of fields can be validated to some degree. For example, you can at least exclude all non-printable characters (except acceptable white space, e.g., CR, LF, tab, space), and define a maximum length for the input field.

The same pattern can be found in other sources on that subject (see References), including published infosec book, CWE and other relatively high-profile publications. It seems like the contrast between the ability to perform a very strict validation of structured data (e.g. date or SSN) as opposed to apparently less strict validation of unstructured data (like free-form text) seems to be somehow frustrating to most application security authors. And this gets even more pessimistic if Unicode is involved.

But it shouldn’t — the concern with highly structured data is that because it’s directly used in business logic, even small deviations from the assumed format may result in high-impact consequences (e.g. sending money elsewhere or giving away goods too cheap). In contrast, in many applications the free-form text inputs aren’t really critical for their operations — no business logic depends on the text and the apps are only expected to reliably display it back to users (e.g. forum comments, order notes etc).

Why worry?

In such a input-display data flow the primary threat are browser-level code injections, such as Cross-Site Scripting. But is it really so difficult to prevent XSS in a modern web application? I would argue it’s actually quite difficult to write an application vulnerable to XSS these days, with all the automated escaping in both server-side (e.g. Django) and client-side frameworks (e.g. AngularJS).

If you are using parametrized SQL queries, context-aware output escaping and output encoding, you might theoretically not have any input validation and still have an application that is not vulnerable to XSS (“might” doesn’t mean you should, that’s just for the sake of the argument). Websites such as StackOverflow do this all the time — they are full of user-submitted evil XSS vectors and yet somehow maintain to display them rather than rendering.

Just reiterating from a slightly different angle: you should not be using input validation to prevent XSS or SQLi, as your database API should be already completely transparent to the data stored, and your output should apply context-aware encoding.

However, if you are persisting data for long time, you may want to validate the input free-text to only allow data that makes sense from business point of view. Some usage scenarios to consider:

Limit the occurence of confusing Unicode homoglyphs (example: ՍnᎥⅽоԁе, generated with Homoglyph attack generator)
Your client base is expected to write in Chinese, Japanese or Korean text, rather than Arabic, Polish or Russian, for example
Text (letters or ideographs) is expected in the input rather than emoji or mathematical symbols
No input in Linear-B, Ugaritic, Klingon or other exotic alphabets is expected </ul> Not that you have to apply any of these validation patterns — your application may be perfectly OK without any of these, just as is this blog, where I may fancy switching to Klingon one day (  ). Rule of thumb: just ask your business folks what they want to read!
Validating Unicode free-form text
In Python, you can easily validate Unicode free-form text by whitelisting specific Unicode character categories such as lowercase letters, uppercase letters, ideographs etc. These categories are defined in the Unicode standard and implemented in a number of libraries such as xregexp (JavaScript) and unicodedata (Python). An example using the latter: > import unicodedata > unicodedata.category('A') 'Lu' > unicodedata.category('1') 'Nd' > unicodedata.category('a') 'Ll' > unicodedata.category('陳') 'Lo' > unicodedata.category('>') 'Sm' > unicodedata.category('\'') 'Po' > unicodedata.category('\u202e') 'Cf' The characters A 1 a 陳 > ' are classified as Uppercase_Letter (Lu), Decimal_Number (Nd), Lowercase_Letter (Ll), Other_Letter (Lo) and Math_Symbol (Sm), Other_Punctuation (Po) and Format (Cf) respectively. The last one RIGHT-TO-LEFT OVERRIDE is especially tricky as it's being actively used in attacks, where an attachment shown as file.exe.txt (really file.\u202etxt.exe) is really file.txt.exe and executable. > print('file.\u202etxt.exe') # if your console displays file.txt.exe, copy and paste it into a web browser file.‮txt.exe So if you business assumptions require that only letters and ideographs are accepted in human name field, you might come up with such a validation function: def only_letters(s): """ Returns True if the input text consists of letters and ideographs only, False otherwise. """ for c in s: cat = unicodedata.category(c) # Ll=lowercase, Lu=uppercase, Lo=ideographs if cat not in ('Ll','Lu','Lo'): return False return True > only_letters('Bzdrężyło') True > only_letters('He7lo') # we don't allow digits here False You may also whitelist individual characters rather than the whole categories — for example, to accept Irish names you may just want to allow apostrophe (') rather than whitelisting the whole Po class. For readability I recommend using Unicode names of these characters: > unicodedata.name("'") 'APOSTROPHE' So the validation function becomes: allowed_cats = ('Ll','Lu','Lo') allowed_chars = ('APOSTROPHE',) def only_letters(s): for c in s: # is it whitelisted by category? cat = unicodedata.category(c) if cat in allowed_cats: continue # is it whitelisted by character? name = unicodedata.name(c) if name in allowed_chars: continue # found a non-whitelisted character return False # all characters were whitelisted return True > only_letters("Mc'Namara") True > only_letters("Mc'Namara 3") # we allow neither spaces nor digits here False
Drop or sanitize?
In general, input validation should be a all-or-nothing gate — either your input is correct and is accepted, or it's invalid and rejected from entering the data processing workflow. Preserving the original, unmodified user input is good for data integrity on its whole life cycle — so after a message was stored in database, then sent through a REST API to an analytics system, and then displayed to an analyst, it's still the original message entered by the user who does not need to wonder if the user wrote & or &, or was it changed by the application. If you don't allow some characters in the input, just tell it to the user straight away. There may be a lot of good reasons for that. As explained above, you may not wish to see user names in Klingon script because you won't be able to process them anyway. One exception may be Unicode normalization but it's slightly different case — it changes encoding but not the visual and semantic meaning of the user input, at least in theory.
Normalization
A recommended step in Unicode text processing and input validation is normalization. Per Unicode Consortium technical note the preferred normalization method when security is considered is NFKCbut you need to be careful here. Normalization changes the original text and while it promises that non-canonical characters will be replaced with their canonical equivalents, this really means one character may be replaced with another character from a differen category as shown below: > s='ՍnᎥⅽоԁе' # a tricky homograph composed of rare characters from different scripts > for c in s: print('{} {:>35} {}'.format(c, unicodedata.name(c), unicodedata.category(c))) ... Ս ARMENIAN CAPITAL LETTER SEH Lu n LATIN SMALL LETTER N Ll Ꭵ CHEROKEE LETTER V Lu ⅽ SMALL ROMAN NUMERAL ONE HUNDRED Nl о CYRILLIC SMALL LETTER O Ll ԁ CYRILLIC SMALL LETTER KOMI DE Ll е CYRILLIC SMALL LETTER IE Ll > for c in unicodedata.normalize('NFKC',s): print('{} {:>35} {}'.format(c, unicodedata.name(c), unicodedata.category(c))) ... Ս ARMENIAN CAPITAL LETTER SEH Lu n LATIN SMALL LETTER N Ll Ꭵ CHEROKEE LETTER V Lu c LATIN SMALL LETTER C Ll о CYRILLIC SMALL LETTER O Ll ԁ CYRILLIC SMALL LETTER KOMI DE Ll е CYRILLIC SMALL LETTER IE Ll
D̡҉̛͉̥̠C̦L̯͍͓̗̰̰̥X̧̨͔̠V͙̦̘̝̺̩͢͠I͖͕̰̟̮̣͖̮̮
. . . What about this? I made this string a section header above only because it's a beautiful demo of Unicode capabilities :) > s = 'D̡҉̛͉̥̠C̦L̯͍͓̗̰̰̥X̧̨͔̠V͙̦̘̝̺̩͢͠I͖͕̰̟̮̣͖̮̮ ' > for c in s: print(c, unicodedata.name(c)) D LATIN CAPITAL LETTER D COMBINING PALATALIZED HOOK BELOW COMBINING CYRILLIC MILLIONS SIGN COMBINING HORN COMBINING LEFT ANGLE BELOW COMBINING RING BELOW COMBINING MINUS SIGN BELOW C LATIN CAPITAL LETTER C COMBINING COMMA BELOW L LATIN CAPITAL LETTER L COMBINING INVERTED BREVE BELOW COMBINING LEFT RIGHT ARROW BELOW COMBINING X BELOW COMBINING ACUTE ACCENT BELOW COMBINING TILDE BELOW COMBINING TILDE BELOW COMBINING RING BELOW X LATIN CAPITAL LETTER X COMBINING CEDILLA COMBINING OGONEK COMBINING LEFT ARROWHEAD BELOW COMBINING MINUS SIGN BELOW V LATIN CAPITAL LETTER V COMBINING DOUBLE RIGHTWARDS ARROW BELOW COMBINING DOUBLE TILDE COMBINING ASTERISK BELOW COMBINING COMMA BELOW COMBINING LEFT TACK BELOW COMBINING UP TACK BELOW COMBINING INVERTED BRIDGE BELOW COMBINING VERTICAL LINE BELOW I LATIN CAPITAL LETTER I COMBINING RIGHT ARROWHEAD AND UP ARROWHEAD BELOW COMBINING RIGHT ARROWHEAD BELOW COMBINING TILDE BELOW COMBINING PLUS SIGN BELOW COMBINING BREVE BELOW COMBINING DOT BELOW COMBINING RIGHT ARROWHEAD AND UP ARROWHEAD BELOW COMBINING BREVE BELOW COMBINING BREVE BELOW SPACE
Input escaping
Context-aware output encoding is a requirement for all applications. Once again, that's output escaping. Doing it on output makes sense, because this is where you know what context you're going to encode/escape for — HTML, HTML attributes, JavaScript, CSS etc — and each of these contexts has different encoding rules. If you know your data is now going to be rendered in CSS context, you apply CSS encoding — and you can be certain you will encode only the characters that can mess up with your CSS syntax. There's also a concept of input escaping which is a rather desperate workaround, typically applied as a "quick fix" to prevent SQLi and XSS reported in a pentest just as you're approaching go-live deadline. If you see http://php.net/htmlspecialchars (PHP) or django.utils.html.escape (Python) used on input data, it should ring the alarm bell. Try to avoid it as much as possible. The primary problem is that this modified (escaped) input ends up persisted in the database usually encoded for HTML and then two things would happen when you render it:
- It's being rendered in HTML context with encoding, but your data in database is already escaped so each & in & will be encoded again. This results in your content being rendered a bit like this: So you need to click File > Import & Export or even better File &gt; if the code encodes both input and output.
- it's being rendered in any other context (e.g. HTML attribute) but since the data was escaped for HTML already the characters dangerous for the new context are not encoded properly </ul> This workaround requires further workaround (such as django.utils.html.conditional_escape) and it creates a sequence of problems for data integrity. Just go and read There's more to HTML escaping than &, <, >, and ".
  What characters make XSS vectors?
  Out of curiosity, I have analyzed the characters in Mario Heiderich's collection of XSS vectors to see what Unicode categories they would fall into and a few of them quite stick out on the below diagram. Lowercase_Letter and Uppercase_Letter are obviously most prevalent and legal, but Other_Punctuation is the next most frequent category after the letters as well as Math_Symbol. As I have noted above, input validation should not be used to prevent XSS, but with legacy applications just remember to be careful about whitelisting whole categories. Characters in XSS vectors by Unicode category ############################################################################### 4 No 16 Sc 23 Pc 38 Sk 158 Cc █ 212 Pd █ 313 Ps █ 315 Pe ████ 854 Zs ████ 865 Lu █████ 946 Nd ██████████ 1818 Sm ███████████████ 2738 Po ████████████████████████████████████████████████████████████████████ 12274 Ll And then what specific characters are used in these vectors: {'Cc': ['\n', '\t'], 'Ll': ['f', 'o', 'r', 'm', 'i', 'd', 't', 'e', 's', 'b', 'u', 'n', 'a', 'c', 'j', 'v', 'p', 'l', 'h', 'x', 'z', 'g', 'y', 'k', 'q', 'w'], 'Lu': ['X', 'A', 'D', 'G', 'E', 'C', 'H', 'M', 'I', 'O', 'U', 'B', 'K', 'Q', 'P', 'S', 'T', 'R', 'W', 'N', 'Z', 'J', 'F', 'Â', 'L', 'V', 'Y'], 'Nd': ['1', '4', '7', '0', '8', '9', '6', '3', '2', '5'], 'No': ['¼', '¾'], 'Pc': ['_'], 'Pd': ['-'], 'Pe': [')', '}', ']'], 'Po': ['"', '/', ':', '&', ';', '?', '#', '.', "'", ',', '*', '!', '%', '@', '\\'], 'Ps': ['(', '{', '['], 'Sc': ['$'], 'Sk': ['^', '`'], 'Sm': ['<', '=', '>', '+'], 'Zs': [' ']}
  References
  This selection of quotes from various application security sources demonstrates somewhat vague and/or pessimistic guidance on validating free-form text fields.
  However, input validation is not always sufficient, especially when less stringent data types must be supported, such as free-form text. Consider a SQL injection scenario in which a last name is inserted into a query. The name "O'Reilly" would likely pass the validation step since it is a common last name in the English language. However, it cannot be directly inserted into the database because it contains the "'" apostrophe character, which would need to be escaped or otherwise handled. In this case, stripping the apostrophe might reduce the risk of SQL injection, but it would produce incorrect behavior because the wrong name would be recorded. MITRE, CWE-20
  
  The outcome of this is that input validation is inherently unreliable. Input validation works best with extremely restricted values, e.g. when something must be an integer, or an alphanumeric string, or a HTTP URL. Such limited formats and values are least likely to pose a threat if properly validated. Other values such as unrestricted text, GET/POST arrays, and HTML are both harder to validate and far more likely to contain malicious data. PHP Security, Input Validation
  
  The most difficult fields to validate are so called 'free text' fields, like blog entries. However, even those types of fields can be validated to some degree. For example, you can at least exclude all non-printable characters (except acceptable white space, e.g., CR, LF, tab, space), and define a maximum length for the input field. OWASP, Input Validation Cheatsheet
  
  Whitelist validation is less prevalent and often complex to configure because defining an exact match (i.e. whitelist) for every request parameter is a daunting task. This is especially true for inputs that accept free-form text, such as textboxes. SQL Injection Attacks and Defense, Justin Clarke

Why worry?

Validating Unicode free-form text

Drop or sanitize?

Normalization

D̡҉̛͉̥̠C̦L̯͍͓̗̰̰̥X̧̨͔̠V͙̦̘̝̺̩͢͠I͖͕̰̟̮̣͖̮̮

Input escaping

What characters make XSS vectors?

References