Tuesday, September 3, 2013

Strings? Those are easy!

Let me start with The Problem: Create random Strings, including all UTF-8 characters.

These are not exactly problems but rather testing requirements. I need to be able to test my assumptions using some sort of oracle to determine the number of bytes in a particular element of a String. The tools I have are IntelliJ, Java and the Internet.

First things first, lets research UTF-8; what is it and how does it work? The wiki article is really useful and having read Joel's blog entry on Unicode, I think I can define it in a simple, if slightly inaccurate way. It is a super set of character sets using a addressing system that can go up to 6 bytes. Those other blogs can hit the deeper details.

Now a bigger question is how does it work in Java? Well, that is also a complex set of technology which I thought I didn't need to know about. So I tried a simple little program (that I have simplified even more):

//generateString
StringBuilder sb = new StringBuilder();
for(int i=0; i!=length;i++) {
  sb.append(Character.toString((char) (random.nextInt(characterRangeMax) + characterRangeMin)));
}
String testValue = sb.toString();

It seemed to work and I thought I was all good. Now granted I also was using a separate API for most of my random strings, but this was for cases where the API I had didn't work. However, like all software development, requirements change. I was asked to solve a new problem...

The Problem v2: Create four byte Unicode strings randomly.

The thing is, problems are rarely singular. Let me see what else might be a problem:
  1. How do I verify a string is in fact four bytes?
  2. How do I create mixed byte characters?
  3. How do I get the hex version of the character? For that matter, can I test with the Unicode (UTF-8 in this case) hex value?
Since I knew I could do a range and I had some idea of what the starting integer of the range should be, I thought this should be pie. All I have to do is verify the byte length. I happened to know that URL encoding converted this into bytes from to %FF so all I had to do was convert a single character and look at how many "bytes" the character had... by dividing the length of the URL encoded string by 4 and look for a length of 4. So here I go again, but this time I'm going to pseudo code it:

int length = 1;
for(int rangeMin = 33 to Int.Max) {//Starting at 33 due to ASCII's limits.
  String s = generateString(length, rangeMin, rangeMin + 1);
  if(URLEncoder.encode(s).length()>11) {
    print(s)
    return;
  }
}

Guess what... 15 mins and I got NOTHING. Gulp. Ummm....? I know I got Unicode characters...? I got all sorts of Chinese characters, so... why you no work? Ok, well I guess I best study Java's underlying Strings...

Having now studied Strings in Java for some time, I have come to appreciate some of the leakiness of the abstraction. It seems clear to me now that Strings are a compromise between keeping up with modern times and keeping compatibility. In particular, I would say Java seems to have sided with compatibility. So when you create a string, say "Hello World", what Java basically does behind the scenes is create a Array of characters known as "char"s. These chars all point to a specific character using an integer value. The problem is characters go from 0-65535, but UTF-8 hits 110,000+/- characters. So if the number is too big, they internally use 2 characters, something my code did not handle. What I was doing was casting a number to a char, which means 65536 would be the value 0 (I believe, this is untested). Instead I wanted to do this:

String character = new String(Character.toChars(i));

This creates a set of characters, into a single displayed character. The good news is that accomplishes what I desire, however for my testing it does create a few interesting side effects. If you see the rough guide, you will start to see that the length of characters doesn't even make sense because it doesn't deal with what is displayed but rather the number of 'char' values in the array that make the string. Also, the open source library we use doesn't really take this into account. Or perhaps I should say, it does, but not in the way you would likely think of it. They generate a set of chars and convert them into a string. From reading their code, it appears they don't take into account that the word 'length' has 3 different meanings and so when you ask for a String with a length of 10, they will provide you 10 characters, but it might only have 5 displayed characters, depending on the value it generates.

This means I have to create my own random generator, but that is a blog post for a future time. However, I still have a few things in my list in my initial list. Generating mixed character sets is relatively easy now. I can create 4 byte characters as long as my random number is high enough. I can test the number of bytes a character has via my URL encoding method. I can generate characters via the convention "\uXXXX" which will create a Unicode character. This leaves only one question left. How do I convert a character back into the hex style string. To be honest, this was a less important thing, and it turns out from the research I made (admittedly limited), that this is difficult. Since it is difficult and less important, I followed the 80-20 rule and skipped that step.

No comments:

Post a Comment