2018-05-13T09:47:06-07:00

Go CBOR encoder: Episode 5, strings: bytes & unicode characters

CBOR strings are more complex than the types we already implemented, they come in two flavors: byte string, and unicode string. Byte strings are meant to encode binary content like images, while Unicode strings are for human-readable text.

We’ll start with byte string, here’s what the spec says:

Major type 2: a byte string. The string’s length in bytes is represented following the rules for positive integers (major type 0). For example, a byte string whose length is 5 would have an initial byte of 0b010_00101 (major type 2, additional information 5 for the length), followed by 5 bytes of binary content. A byte string whose length is 500 would have 3 initial bytes of 0b010_11001 (major type 2, additional information 25 to indicate a two-byte length) followed by the two bytes 0x01f4 for a length of 500, followed by 500 bytes of binary content.

If we encoded the 5 bytes “hello” as a CBOR byte string we’d have something like that:

0x45    // header for byte string of size 5: (2 << 5) | 5 → 0x45
0x68 0x65 0x6C 0x6C 0x6f   // The 5 byte string hello

To encode byte strings we’ll encode a regular CBOR integer with major type 2, and then we’ll write the byte string itself right after. The header has the type and the size of the string as a positive integer —we implemented this in episode 3—; and the data sized by the integer in the header. Before we can write the special header we’ll change the function writeInteger we wrote in episode 3 to add a parameter for the major type so it is now configurable by the caller and we modify the call to writeInteger() in Encode() to work with the new call:

func (e *Encoder) writeInteger(major byte, i uint64) error {
    switch {
    case i <= 23:
        return e.writeHeader(major, byte(i))
    case i <= 0xff:
        return e.writeHeaderInteger(major, minorPositiveInt8, uint8(i))
    case i <= 0xffff:
        return e.writeHeaderInteger(major, minorPositiveInt16, uint16(i))
    case i <= 0xffffffff:
        return e.writeHeaderInteger(major, minorPositiveInt32, uint32(i))
    default:
        return e.writeHeaderInteger(major, minorPositiveInt64, uint64(i))
    }
}

...
case reflect.Uint, reflect.Uint8, reflect.Uint16, reflect.Uint32, reflect.Uint64:
    // we pass major type to writeInteger
    return e.writeInteger(majorPositiveInteger, x.Uint())

This change will be useful later when we implement more complex types: it’s common to write an integer with the header for variable sized CBOR types.

Now we have to figure out how to match byte string types with reflect. Go has two distict types that match CBOR byte strings: byte slices, and byte arrays. If you don’t know the different between a slice and an array, I recommend the splendid article from the Golang blog: Go Slices: usage and internals. We’ll focus on slices first then arrays.

We start by adding tests based on the examples from the CBOR spec:

func TestByteString(t *testing.T) {
    var cases = []struct {
        Value    []byte
        Expected []byte
    }{
        {Value: []byte{}, Expected: []byte{0x40}},
        {Value: []byte{1, 2, 3, 4}, Expected: []byte{0x44, 0x01, 0x02, 0x03, 0x04}},
        {
            Value:    []byte("hello"),
            Expected: []byte{0x45, 0x68, 0x65, 0x6c, 0x6c, 0x6f},
        },
    }

    for _, c := range cases {
        t.Run(fmt.Sprintf("%v", c.Value), func(t *testing.T) {
            testEncoder(t, c.Value, nil, c.Expected)
        })
    }
}

Slices have their own reflect kind: reflect.Slice. We only handle slices of bytes, so we’ll have to check the slice elements’ type like this:

var exampleSlice = reflect.ValueOf([]byte{1, 2, 3})

if exampleSlice.Type().Elem().Kind() == reflect.Uint8 {
    fmt.Println("Slice of bytes")
}

We use reflect.Uint8 in the if clause, because the byte type is an alias to uint8 in Go.

We add another case clause in Encode’s switch statement for slices and we check the slice’s elements’ type like this:

case reflect.Slice:
    if x.Type().Elem().Kind() == reflect.Uint8 {
        // byte string
    }

Now all we have left to do is write the header and the byte string into the output, we’ll add the writeByteString method to tuck all the boilerplate code away from our main switch statement:

// we add the major type for byte string
majorByteString      = 2

...

func (e *Encoder) writeByteString(s []byte) error {
    if err := e.writeInteger(majorByteString, uint64(len(s))); err != nil {
        return err
    }
    _, err := e.w.Write(s)
    return err
}

... In Encode() ...

case reflect.Slice:
    if x.Type().Elem().Kind() == reflect.Uint8 {
        return e.writeByteString(x.Bytes())
    }

A quick run of go test confirms byte slices work, but we’re not done with byte strings yet, we still have to handle arrays. It’s easier to work with slices in general, so we’ll convert arrays to slices to avoid writing array specific code and re-use what we just wrote. We add the following code to our existing test TestByteString:

// for arrays
t.Run("array", func(t *testing.T) {
	a := [...]byte{1, 2}
	testEncoder(t, &a, nil, []byte{0x42, 1, 2})
})

Let’s add another case clause right before the case clause matching reflect.Slice:

case reflect.Array:
	// turn x into a slice
    x = x.Slice(0, x.Len())
	fallthrough
case reflect.Slice:
    ...

We create a slice from our backing array with Value.Slice(), then we run the tests and we get a surprise:

$ go test -v .
...
=== RUN   TestByteString/array
panic: reflect.Value.Slice: slice of unaddressable array [recovered]
    panic: reflect.Value.Slice: slice of unaddressable array
...

It turns out we have an “unaddressable” array, and we cannot create a slice on it with Value.Slice() according to the doc. How are we going to get out of this? reflect doesn’t let us reference the array directly, we need to turn the array into something addressable: a pointer to the array. We create a pointer to it with reflect.New, then we use the pointer with reflect.Indirect to create our slice:

case reflect.Array:
    // Create slice from array
    var n = reflect.New(x.Type())
    n.Elem().Set(x)
    x = reflect.Indirect(n).Slice(0, x.Len())
    fallthrough
case reflect.Slice:
    ...

A quick run of go test confirms this solved our issue with the unaddressable array. All TestByteString tests now pass! We’re done with byte strings, unicode strings are next.

Text strings are like byte strings with a different major type. We have the header with the length of the string in bytes, and the data at the end. Text data is encoded in UTF-8 —Go’s native string encoding— so there’s no need to re-encode it: we can just write the string to the output as it is. Like we did for byte strings we add examples from the CBOR spec in a new test called TestUnicodeString:

func TestUnicodeString(t *testing.T) {
    var cases = []struct {
        Value    string
        Expected []byte
    }{
        {Value: "", Expected: []byte{0x60}},
        {Value: "IETF", Expected: []byte{0x64, 0x49, 0x45, 0x54, 0x46}},
        {Value: "\"\\", Expected: []byte{0x62, 0x22, 0x5c}},
        {Value: "\u00fc", Expected: []byte{0x62, 0xc3, 0xbc}},
        {Value: "\u6c34", Expected: []byte{0x63, 0xe6, 0xb0, 0xb4}},
    }

    for _, c := range cases {
        t.Run(fmt.Sprintf("%s", c.Value), func(t *testing.T) {
            testEncoder(t, c.Value, nil, c.Expected)
        })
    }
}

We add a case clause for the kind reflect.String, then we write the header with the size of our string, and finally we write the string to the output:

majorUnicodeString   = 3
...
func (e *Encoder) writeUnicodeString(s string) error {
    if err := e.writeInteger(majorUnicodeString, uint64(len(s))); err != nil {
        return err
    }
    _, err := io.WriteString(e.w, s)
    return err
}
...
case reflect.String:
    return e.writeUnicodeString(x.String())

And we are done with CBOR strings. Check out the code for this episode.

In the next episode we’ll implement signed integers, and our first composite type: array.