Input validation of free-form Unicode text in Python

2017-06-13 00:00:00 +0100


Input validation is one of the most important application security controls and still, there’s a huge gap as it comes to implementation of one of the most popular types of user input — free-form text with Unicode characters. This article demonstrates a simple way of dealing with Unicode text using Python. There’s quite a lot of pretty useful guidance on implementing input validation for various data types — for example OWASP Input Validation Cheat Sheet, which discusses various strategies of validating common data types in detail and with code samples. As it comes to free-form text, however, the guidance becomes rather vague and rather discouraging, if you think about it:

The most difficult fields to validate are so called 'free text' fields, like blog entries. However, even those types of fields can be validated to some degree. For example, you can at least exclude all non-printable characters (except acceptable white space, e.g., CR, LF, tab, space), and define a maximum length for the input field.

The same pattern can be found in other sources on that subject (see References), including published infosec book, CWE and other relatively high-profile publications. It seems like the contrast between the ability to perform a very strict validation of structured data (e.g. date or SSN) as opposed to apparently less strict validation of unstructured data (like free-form text) seems to be somehow frustrating to most application security authors. And this gets even more pessimistic if Unicode is involved.

But it shouldn’t — the concern with highly structured data is that because it’s directly used in business logic, even small deviations from the assumed format may result in high-impact consequences (e.g. sending money elsewhere or giving away goods too cheap). In contrast, in many applications the free-form text inputs aren’t really critical for their operations — no business logic depends on the text and the apps are only expected to reliably display it back to users (e.g. forum comments, order notes etc).

Why worry?

In such a input-display data flow the primary threat are browser-level code injections, such as Cross-Site Scripting. But is it really so difficult to prevent XSS in a modern web application? I would argue it’s actually quite difficult to write an application vulnerable to XSS these days, with all the automated escaping in both server-side (e.g. Django) and client-side frameworks (e.g. AngularJS).

If you are using parametrized SQL queries, context-aware output escaping and output encoding, you might theoretically not have any input validation and still have an application that is not vulnerable to XSS (“might” doesn’t mean you should, that’s just for the sake of the argument). Websites such as StackOverflow do this all the time — they are full of user-submitted evil XSS vectors and yet somehow maintain to display them rather than rendering.

Just reiterating from a slightly different angle: you should not be using input validation to prevent XSS or SQLi, as your database API should be already completely transparent to the data stored, and your output should apply context-aware encoding.

However, if you are persisting data for long time, you may want to validate the input free-text to only allow data that makes sense from business point of view. Some usage scenarios to consider: