Wednesday, July 27, 2011

Proper URL Encoding and Decoding

This is a reference post, on issues around encoding and decoding URLs. I've tried to summarize everything I know into a clear and concise set of guidelines. I hope you find them useful!

The character coding of all URLs is, as specified in RFC-3986, a subset of US-ASCII. US-ASCII is a seven-bit code, with characters in the range 0x00 to 0x7F. Here is a summary of the characters that are allowed and not allowed in URLs:

Excluded
0x00 - 0x1F, 0x7F (control characters),
0x20 (space),
these punctionation characters: " < > \ ^ ` { | }
All non-ASCII characters
General delimiters
these characters: % # / : ? @ [ ]
Sub delimiters
these characters: ! $ & ' ( ) * + , ; =
Unreserved
digits, letters, and these characters: - . _ ~
When dealing with URLs, Postel’s Law should apply: “Be conservative in what you send; be liberal in what you accept.”

Generating URLs (what you send)

When generating URLs for others’ consumption:
  • Percent encoded values should use uppercase A-F for hex digits.
  • Excluded characters within the US-ASCII range should be percent encoded as a single byte. So, for example, a space character should always be encoded as “%20”. When the space character appears in a query string, it can be encoded either as “%20” or as the plus sign “+”.
  • Excluded characters outside the US-ASCII range must be first encoded in UTF-8, and then percent-encoded. For example, “Ä”, Unicode U+00C4, “Latin capital letter A with diaresis”, should be encoded as “%C3%84” and not as %C4, because it is first encoded as a two-byte UTF-8 sequence.
  • General delimiters, when used literally (i.e. not as delimiters), must be percent-encoded, with certain exceptions described below.
  • Sub delimiters should be percent encoded when their literal use would conflict with their use as delimiters. For example, in a query string, the ampersand “&” must be percent-encoded, but when in a path segment, it is okay to use it literally. When in doubt, percent-encode, because it can’t do any harm.
The two main portions of a URL that are of concern to developers are the path segments and the query string.
  • Path segments can include literal characters in the sets unreserved, sub delimiters, and the two general delimiters “:” and “@”. All other characters must be percent-encoded. Note that path segments should never include a slash characters “/”, even if it is percent encoded, since user agents and servers both have problems with these.
  • Query strings can include all of the same characters as path segments, and additionally the two general delimiters “/” and “?”.
So, for example, the following URL is legal and unambiguous:
http://example.com/doc@1:5/?back_uri=http://user:password@example.com/?foo%3Dbar
Note how, within the query string, the equals sign within the back_uri is percent-encoded as “%3D”.

Receiving URLs (what you accept)

When parsing a URL, applications should follow these guidelines:
  • Within a query string, convert “+”, wherever it occurs, into a space character. (Note that this must be done before percent-decoding).
  • Accept either upper or lowercase for hex digits within percent-encoded values
  • Convert percent encoded substrings into sequences of bytes, and then interpret those as UTF-8.

    If the byte sequence is not valid UTF-8, then the application should either drop it completely, or throw an exception. For example, for this URL
    http://www.ncbi.nlm.nih.gov/pubmed?term=%C4rzteblatt
    the user incorrectly encodes Ä as %C4, but the single-byte sequence 0xC4 is not valid UTF-8. So the application should either drop this invalid value completely (interpret the term as “rzteblatt”), or throw an exception. The preferred behavior is to throw an exception.
  • If the resultant character string contains characters outside the accepted range for the application, they should cause an exception. For example,
    http://www.ncbi.nlm.nih.gov/pubmed/?term=%E8%B5%B7%E5%8F%B8%E5%A0%A1
    which is valid UTF-8, but where the term decodes as “起司堡”, should either cause “no items found” or should result in an exception page.

6 comments:

  1. thanks for sharing! i found interesting article like yours.

    ayumi
    www.brfe.net

    ReplyDelete
  2. Thanks for putting an effort to publish this information and for sharing this with us.

    Cindy
    www.gofastek.com

    ReplyDelete
  3. I'm impressed. You're truly well informed and very intelligent. You wrote something that people could understand and made the subject intriguing for everyone. I'm saving this for future use.

    Vivian
    Marks Web
    www.imarksweb.org

    ReplyDelete
  4. Greets! I have been searching the net when I found this domain.
    I quickly saw the thing I had been looking around.
    I completely like your domain! Pages with such a correct text are much more easier to read.
    I could recommend you to keep it up. It was my pleasure to see your article!
    See my page and download totally freeware top eleven cheat!
    Bye! :P

    ReplyDelete
  5. I’m impressed. Very informative and trustworthy blog does exactly what it sets out to do. I’ll bookmark your weblog for future use.

    Pebbles
    www.joeydavila.net

    ReplyDelete
  6. spot on with this write-up, i like the way you discuss the things. i'm impressed, i must say. i'll probably be back again to read more. thanks for sharing this with us.

    Lee Shin
    www.trendone.net

    ReplyDelete

Comments welcome!

If you are new here, and don't have a Google account (or would rather not use it), then please use the "Name/URL" profile (next to "Comment as" below). You con't have to give your real name -- any nickname will do. And you can leave the URL field blank if you want.

If you want to be notified of comment updates, then you can either: use your Google account, and, after you have signed in, click "Subscribe by email"; or subscribe to the comment feed by clicking on "Subscribe to: Post Comments (Atom)" below.