Tag Archives: data types

Size of MAX != Max of Size

How’s that for a title?

What I’m actually referring to here is the VAR* data types in SQL Server (i.e. VARBINARY, VARCHAR and NVARCHAR). For these data types you need to specify a maximum size for that column, such as VARBINARY(20) (which would be a binary array that is, at most, 20 bytes long). The largest maximum size permitted is 8000 for VARCHAR and VARBINARY and 4000 for NVARCHAR. You can also specify a size of ‘MAX’ (e.g. VARCHAR(MAX)), however this does not set the maximum size to 8000 or 4000, rather it sets the maximum for the column to 2^31-1 bytes.

Hence, the Size of MAX (2^31-1) != Max of Size (4000 or 8000)

Pick a size, any size

So, the question then becomes “Why not just use MAX for everything?” A few reasons: Firstly is performance, from a connectivity point of view (since that’s where I work), MAX data types need to be sent in chunks, meaning that we need to read additional metadata concerning the size of each chunk (although this is likely to be quite small compared to the total amount of data being sent). From an storage point of view, if the data is larger than 8000 bytes then it is stored “out of row”, meaning that a pointer to the data is stored in the row storage and must be dereferenced in order to read the data. This also means that the query engine* can not simply assume that all of the data it requires is in row storage, nor can it assume that it can load the all of the data from the column into memory (since they may be up to 2Gb of data per MAX column per row).

In terms of maintenance, you can not do online index operations on MAX columns. Additionally, if you have a lot of data that increases over time to be above the 8000 byte limit and is taken “out of row” or shrinks to below 8000 bytes and is taken into the row, then this will greatly increase the amount of fragmentation your database has.

However, the most important reason to limit the size of VAR* columns is for security. For instance, imagine that you are running a website and permit users to create accounts, but you also allow them to change their username once they are registered. You also decide that you will have the ‘username’ column in your database to be NVARCHAR(MAX), and that you will limit the size of the username in your business logic. All of this would be fine, so long as your code is bug free. If, however, you have a bug that allows a user to bypass your business logic and set a username of any size, then it becomes quite easy for a malicious user to stage a denial of service attack on your website – they can simply create a few users with very long usernames and fill up your database (remember that SQL Azure only allows 50Gb database size, which is 25 completely filled MAX columns). If you also have a page that displays usernames (e.g. for high scores, list of users online, searches) then your other users won’t be able to use those pages as they will be attempting to download the attacker’s massive username (and the bandwidth that is used in the process may be costly as well). So, while the correct response to this scenario is to fix the bugs in your website, you should also be following the “Defense in Depth” principle and have protections all the way from client-side scripting through to the business logic and underlying database schema.

There are, however, some places where limiting to 4000 bytes may be unreasonable, for instance blog posts, forums or content stored in a CMS. But, where possible, try to choose an actual size for your VAR* columns.

Note on legacy types

You may also notice that there are the IMAGE, TEXT and NTEXT types in SQL Server, these are legacy types and you should be using VARBINARY, VARCHAR and NVARCHAR types instead.

* I have not seen the engine’s code so I can’t confirm that it does make these assumptions, although there is some evidence to suggest that specifying a size does help performance.

Advertisements
Tagged , , ,