The character coding of all URLs is, as specified in RFC-3986, a subset of US-ASCII. US-ASCII is a seven-bit code, with characters in the range 0x00 to 0x7F. Here is a summary of the characters that are allowed and not allowed in URLs:
Excluded
0x00 - 0x1F, 0x7F (control characters),General delimiters
0x20 (space),
these punctionation characters: " < > \ ^ ` { | }
All non-ASCII characters
these characters: % # / : ? @ [ ]Sub delimiters
these characters: ! $ & ' ( ) * + , ; =Unreserved
digits, letters, and these characters: - . _ ~When dealing with URLs, Postel’s Law should apply: “Be conservative in what you send; be liberal in what you accept.”
Generating URLs (what you send)
When generating URLs for others’ consumption:- Percent encoded values should use uppercase A-F for hex digits.
- Excluded characters within the US-ASCII range should be percent encoded as a single byte. So, for example, a space character should always be encoded as “%20”. When the space character appears in a query string, it can be encoded either as “%20” or as the plus sign “+”.
- Excluded characters outside the US-ASCII range must be first encoded in UTF-8, and then percent-encoded. For example, “Ä”, Unicode U+00C4, “Latin capital letter A with diaresis”, should be encoded as “%C3%84” and not as %C4, because it is first encoded as a two-byte UTF-8 sequence.
- General delimiters, when used literally (i.e. not as delimiters), must be percent-encoded, with certain exceptions described below.
- Sub delimiters should be percent encoded when their literal use would conflict with their use as delimiters. For example, in a query string, the ampersand “&” must be percent-encoded, but when in a path segment, it is okay to use it literally. When in doubt, percent-encode, because it can’t do any harm.
- Path segments can include literal characters in the sets unreserved, sub delimiters, and the two general delimiters “:” and “@”. All other characters must be percent-encoded. Note that path segments should never include a slash characters “/”, even if it is percent encoded, since user agents and servers both have problems with these.
- Query strings can include all of the same characters as path segments, and additionally the two general delimiters “/” and “?”.
http://example.com/doc@1:5/?back_uri=http://user:password@example.com/?foo%3DbarNote how, within the query string, the equals sign within the back_uri is percent-encoded as “%3D”.
Receiving URLs (what you accept)
When parsing a URL, applications should follow these guidelines:- Within a query string, convert “+”, wherever it occurs, into a space character. (Note that this must be done before percent-decoding).
- Accept either upper or lowercase for hex digits within percent-encoded values
- Convert percent encoded substrings into sequences of bytes, and then interpret those as UTF-8.
If the byte sequence is not valid UTF-8, then the application should either drop it completely, or throw an exception. For example, for this URL
http://www.ncbi.nlm.nih.gov/pubmed?term=%C4rzteblatt
the user incorrectly encodes Ä as %C4, but the single-byte sequence 0xC4 is not valid UTF-8. So the application should either drop this invalid value completely (interpret the term as “rzteblatt”), or throw an exception. The preferred behavior is to throw an exception.
- If the resultant character string contains characters outside the accepted range for the application, they should cause an exception. For example,
http://www.ncbi.nlm.nih.gov/pubmed/?term=%E8%B5%B7%E5%8F%B8%E5%A0%A1
which is valid UTF-8, but where the term decodes as “起司堡”, should either cause “no items found” or should result in an exception page.

0 comments:
Post a Comment
Comments welcome!
If you are new here, and don't have a Google account (or would rather not use it), then please use the "Name/URL" profile (next to "Comment as" below). You con't have to give your real name -- any nickname will do. And you can leave the URL field blank if you want.
If you want to be notified of comment updates, then you can either: use your Google account, and, after you have signed in, click "Subscribe by email"; or subscribe to the comment feed by clicking on "Subscribe to: Post Comments (Atom)" below.