Working with Unicode in Ruby

Note: This guide is UTF-8 oriented and does not cover any other unicode variants.

I’m presenting this very brief guide to help ruby users deal with unicode, more specifically with UTF-8.

First of all you need to setup Ruby to work with UTF-8 strings:

$KCODE = 'u'
require 'jcode'

The first line indicates the encoding the ruby interpreter will use — in our case we use ‘u’ which stands for UTF-8.

The second line requires the ruby library to deal with multibyte strings such as UTF-8 ones.

The jcode library extends String with some new methods such as:

  • jcount
  • jlength
  • jsize (alias for jlength)

Let’s try counting the ocurrences of “ü” in Ouvertüre using count and jcount:

irb(main):005:0> "Ouvertüre".count("ü")
=> 2
irb(main):006:0> "Ouvertüre".jcount("ü")
=> 1

As you can see, jcount gives a reliable result whereas count returns 2.

This library also overrides some String methods:

  • each_char
  • chop, chop!
  • delete, delete!
  • squeeze, squeeze!
  • succ, succ!
  • tr, tr!
  • tr_s, tr_s!

And adds a new one:

  • mbchar?

Wondering what this one does?

Let’s find out:

irb(main):007:0> "Ouvertüre".mbchar?
=> 6
irb(main):008:0> "Schön".mbchar?
=> 3

Uh huh! Seems to be indicating where the first multibyte char is: 7th place for the ü in Ouvertüre and 4th place for ö in Schön, which looks right if you consider that the index is zero based.

What about using upcase and downcase with UTF-8 strings?

Let’s try:

irb(main):009:0> "Ouvertüre".upcase
=> "OUVERTüRE"

Ooops, doesn’t look right!

Time to present a new friend:

It’s a Ruby gem and it’s called… tzashaaam: unicode. Easy eh?

Ok, so let’s require it and see what happens.

(of course you will need to install the gem first, use `gem install unicode`)

irb(main):010:0> require 'rubygems'
=> true
irb(main):011:0> require 'unicode'
=> true

(Make sure jcode has been previously loaded, otherwise it will refuse to load)

Now we have a few more methods to use:

  • Unicode::downcase
  • Unicode::upcase
  • Unicode::normalize

Let’s see them working:

irb(main):012:0> Unicode.upcase "Ouvertüre"
=> "OUVERTÜRE"

Mmm, that’s better!

irb(main):013:0> Unicode.downcase "OUVERTÜRE"
=> "ouvertüre"

Great!

But… what’s the normalize method for?

Let’s say Unicode normalization is something out of the scope of this brief guide.

You can read all about it at the UAX #15: Unicode Normalization Forms page (you are asking for a severe headache though!)

Ok, that’s all for now!

Category: Ruby | Tags: , , ,


Leave a Reply