Today I needed to convert some URL segments that are stored in a MySQL database from raw UTF-8 strings (many of which contain multi-byte characters in several hundred languages) into URL-encoded strings. This is for some work I’m doing using the latest version of SilverStripe (see Making SiteTree Fields Longer in SilverStripe 3.X for more info). Since I could not find any good examples of MySQL functions already built to do this, I ended up writing my own. Code and explanations posted below the fold…
Note first of all that I did find this example on DZone. However, it was lacking in a few ways:
- The main thing: it does not handle multi-byte characters
- It encodes characters that it doesn’t need to (i.e. hyphen, period, underscore, and tilde).
- It doesn’t encode spaces correctly (according to RFC they should be %20 and not the plus symbol like the old-school days)
Because of this, I needed to write my own. I tried starting with the one from DZone as a base, but ended up basically rewriting the entire thing. This should (hopefully) be compliant with RFC 3986 that defines percent-encoding of URL segments. You can also see the Wikipedia Percent Encoding article for a little bit easier-to-read definition. Here is the entire piece of code.
Here are some interesting things to take note of regarding this implementation:
I tested it against 127,465 URL segments I have in my database in several hundred languages to see if it returned the exact same results as PHP’s
rawurlencode. It did.
- It’s probably worth mentioning that this obviously should not be used on entire URLs since it will encode reserved characters like slashes, etc. It can only be used on, for instance, individual segments (between slash separators) or individual query string values (not the entire query string or key/value pairs).
- There’s one obvious flaw that’s worth mentioning so I don’t get flamed for it: If you really do take in a string that is 4096 characters long and it contains even a single character that has to be encoded, your return value will end up botched. I suppose it will end up with a MySQL truncation warning, but I haven’t tested this. Since I’m encoding individual segments of a URL, none of them should ever be anywhere near that long. If your URLs are that long you are going to definitely have other issues unrelated to this, so I don’t feel this is a major flaw.
And now to the code itself:
DELIMITER ; DROP FUNCTION IF EXISTS urlencode; DELIMITER | CREATE FUNCTION URLENCODE(str VARCHAR(4096) CHARSET utf8) RETURNS VARCHAR(4096) CHARSET utf8 DETERMINISTIC CONTAINS SQL BEGIN -- the individual character we are converting in our loop -- NOTE: must be VARCHAR even though it won't vary in length -- CHAR(1), when used with SUBSTRING, made spaces '' instead of ' ' DECLARE sub VARCHAR(1) CHARSET utf8; -- the ordinal value of the character (i.e. ñ becomes 50097) DECLARE val BIGINT DEFAULT 0; -- the substring index we use in our loop (one-based) DECLARE ind INT DEFAULT 1; -- the integer value of the individual octet of a character being encoded -- (which is potentially multi-byte and must be encoded one byte at a time) DECLARE oct INT DEFAULT 0; -- the encoded return string that we build up during execution DECLARE ret VARCHAR(4096) DEFAULT ''; -- our loop index for looping through each octet while encoding DECLARE octind INT DEFAULT 0; IF ISNULL(str) THEN RETURN NULL; ELSE SET ret = ''; -- loop through the input string one character at a time - regardless -- of how many bytes a character consists of WHILE ind <= CHAR_LENGTH(str) DO SET sub = MID(str, ind, 1); SET val = ORD(sub); -- these values are ones that should not be converted -- see http://tools.ietf.org/html/rfc3986 IF NOT (val BETWEEN 48 AND 57 OR -- 48-57 = 0-9 val BETWEEN 65 AND 90 OR -- 65-90 = A-Z val BETWEEN 97 AND 122 OR -- 97-122 = a-z -- 45 = hyphen, 46 = period, 95 = underscore, 126 = tilde val IN (45, 46, 95, 126)) THEN -- This is not an "unreserved" char and must be encoded: -- loop through each octet of the potentially multi-octet character -- and convert each into its hexadecimal value -- we start with the high octect because that is the order that ORD -- returns them in - they need to be encoded with the most significant -- byte first SET octind = OCTET_LENGTH(sub); WHILE octind > 0 DO -- get the actual value of this octet by shifting it to the right -- so that it is at the lowest byte position - in other words, make -- the octet/byte we are working on the entire number (or in even -- other words, oct will no be between zero and 255 inclusive) SET oct = (val >> (8 * (octind - 1))); -- we append this to our return string with a percent sign, and then -- a left-zero-padded (to two characters) string of the hexadecimal -- value of this octet) SET ret = CONCAT(ret, '%', LPAD(HEX(oct), 2, 0)); -- now we need to reset val to essentially zero out the octet that we -- just encoded so that our number decreases and we are only left with -- the lower octets as part of our integer SET val = (val & (POWER(256, (octind - 1)) - 1)); SET octind = (octind - 1); END WHILE; ELSE -- this character was not one that needed to be encoded and can simply be -- added to our return string as-is SET ret = CONCAT(ret, sub); END IF; SET ind = (ind + 1); END WHILE; END IF; RETURN ret; END; | DELIMITER ;
The heart of this is the MySQL ORD function which converts the multi-byte character into an integer representation. Then I just do some bitwise operations to get each respective byte as an integer, convert that to hexadecimal, and then concatenate the string. Note that while UTF-8 can have four-byte characters (or, for the nit-picky: pre-2003, it could have six-byte characters), in this case we will only ever encounter three-byte characters since we are not using MySQL’s utf8mb4 encoding.
For you math or computer science majors out there – I’m sure you’ll know more efficient or elegant ways of doing the actual encoding – especially the bitwise operators for handling each byte of a multi-byte character. Anyway, if anyone is able to use this and it helps you, or if you notice an improvement, please leave a comment or email me. I’d be happy to get your feedback.