Wednesday, July 27, 2011

Proper URL Encoding and Decoding

This is a reference post, on issues around encoding and decoding URLs. I've tried to summarize everything I know into a clear and concise set of guidelines. I hope you find them useful!

The character coding of all URLs is, as specified in RFC-3986, a subset of US-ASCII. US-ASCII is a seven-bit code, with characters in the range 0x00 to 0x7F. Here is a summary of the characters that are allowed and not allowed in URLs:

Excluded
0x00 - 0x1F, 0x7F (control characters),
0x20 (space),
these punctionation characters: " < > \ ^ ` { | }
All non-ASCII characters
General delimiters
these characters: % # / : ? @ [ ]
Sub delimiters
these characters: ! $ & ' ( ) * + , ; =
Unreserved
digits, letters, and these characters: - . _ ~
When dealing with URLs, Postel’s Law should apply: “Be conservative in what you send; be liberal in what you accept.”

Generating URLs (what you send)

When generating URLs for others’ consumption:
  • Percent encoded values should use uppercase A-F for hex digits.
  • Excluded characters within the US-ASCII range should be percent encoded as a single byte. So, for example, a space character should always be encoded as “%20”. When the space character appears in a query string, it can be encoded either as “%20” or as the plus sign “+”.
  • Excluded characters outside the US-ASCII range must be first encoded in UTF-8, and then percent-encoded. For example, “Ä”, Unicode U+00C4, “Latin capital letter A with diaresis”, should be encoded as “%C3%84” and not as %C4, because it is first encoded as a two-byte UTF-8 sequence.
  • General delimiters, when used literally (i.e. not as delimiters), must be percent-encoded, with certain exceptions described below.
  • Sub delimiters should be percent encoded when their literal use would conflict with their use as delimiters. For example, in a query string, the ampersand “&” must be percent-encoded, but when in a path segment, it is okay to use it literally. When in doubt, percent-encode, because it can’t do any harm.
The two main portions of a URL that are of concern to developers are the path segments and the query string.
  • Path segments can include literal characters in the sets unreserved, sub delimiters, and the two general delimiters “:” and “@”. All other characters must be percent-encoded. Note that path segments should never include a slash characters “/”, even if it is percent encoded, since user agents and servers both have problems with these.
  • Query strings can include all of the same characters as path segments, and additionally the two general delimiters “/” and “?”.
So, for example, the following URL is legal and unambiguous:
http://example.com/doc@1:5/?back_uri=http://user:password@example.com/?foo%3Dbar
Note how, within the query string, the equals sign within the back_uri is percent-encoded as “%3D”.

Receiving URLs (what you accept)

When parsing a URL, applications should follow these guidelines:
  • Within a query string, convert “+”, wherever it occurs, into a space character. (Note that this must be done before percent-decoding).
  • Accept either upper or lowercase for hex digits within percent-encoded values
  • Convert percent encoded substrings into sequences of bytes, and then interpret those as UTF-8.

    If the byte sequence is not valid UTF-8, then the application should either drop it completely, or throw an exception. For example, for this URL
    http://www.ncbi.nlm.nih.gov/pubmed?term=%C4rzteblatt
    the user incorrectly encodes Ä as %C4, but the single-byte sequence 0xC4 is not valid UTF-8. So the application should either drop this invalid value completely (interpret the term as “rzteblatt”), or throw an exception. The preferred behavior is to throw an exception.
  • If the resultant character string contains characters outside the accepted range for the application, they should cause an exception. For example,
    http://www.ncbi.nlm.nih.gov/pubmed/?term=%E8%B5%B7%E5%8F%B8%E5%A0%A1
    which is valid UTF-8, but where the term decodes as “起司堡”, should either cause “no items found” or should result in an exception page.

0 comments:

Post a Comment

Comments welcome!

If you are new here, and don't have a Google account (or would rather not use it), then please use the "Name/URL" profile (next to "Comment as" below). You con't have to give your real name -- any nickname will do. And you can leave the URL field blank if you want.

If you want to be notified of comment updates, then you can either: use your Google account, and, after you have signed in, click "Subscribe by email"; or subscribe to the comment feed by clicking on "Subscribe to: Post Comments (Atom)" below.