Rune and string types

In order to start our discussion about the rune and string types, some background context is in order. Go can treat character and string literal constants in its source code as Unicode. It is a global standard whose goal is to catalog symbols for known writing systems by assigning a numerical value (known as code point) to each character.

By default, Go inherently supports UTF-8 which is an efficient way of encoding and storing Unicode numerical values. That is all the background needed to continue with this subject. No further detail will be discussed as it is beyond the scope of this book.

The rune

So, what exactly does the rune type have to do with Unicode? The rune is an alias for the int32 type. It is specifically intended to store Unicode integer values encoded as UTF-8. Let us take a look at some rune literals in the following program:

The rune

golang.fyi/ch04/rune.go

Each variable in the previous program stores a Unicode character as a rune value. In Go, the rune may be specified as a string literal constant surrounded by single quotes. The literal may be one of the following:

  • A printable character (as shown with variables char1, char2, and char3)
  • A single character escaped with backslash for non-printable control values as tab, linefeed, newline, and so on
  • u followed by Unicode values directly (u0369)
  • x followed by two hex digits
  • A backslash followed by three octal digits (45)

Regardless of the rune literal value within the single quotes, the compiler compiles and assigns an integer value as shown by the printout of the previous variables:

$>go run runes.go
8
9
10
632
2438
35486
873
250
37              

The string

In Go, a string is implemented as a slice of immutable byte values. Once a string value is assigned to a variable, the value of that string is never changed. Typically, string values are represented as constant literals enclosed within double quotes as shown in the following example:

The string

golang.fyi/ch04/string.go

The previous snippet shows variable txt being assigned a string literal containing seven characters including two embedded Chinese characters. As referenced earlier, the Go compiler will automatically interpret string literal values as Unicode characters and encode them using UTF-8. This means that under the cover, each literal character is stored as a rune and may end up taking more than one byte for storage per visible character. In fact, when the program is executed, it prints the length of txt as 11, instead of the expected seven characters for the string, accounting for the additional bytes used for the Chinese symbols.

Interpreted and raw string literals

The following snippet (from the previous example) includes two string literals assigned to variable txt2 and txt3 respectively. As you can see, these two literals have the exact same content, however, the compiler will treat them differently:

var ( 
   txt2 = "u6C34x20bringsx20x6cx69x66x65." 
   txt3 = ` 
   u6C34x20 
   bringsx20 
   x6cx69x66x65. 
   ` 
) 

golang.fyi/ch04/string.go

The literal value assigned to variable txt2 is enclosed in double quotes. This is known as an interpreted string. An interpreted string may contain normal printable characters as well as backslash-escaped values which are parsed and interpreted as rune literals. So, when txt2 is printed, the escape values are translated as the following string:

Interpreted and raw string literals

Each symbol, in the interpreted string, corresponds to an escape value or a printable symbol as summarized in the following table:

Interpreted and raw string literals

<space>

brings

<space>

life

.

u6C34

x20

brings

x20

x6cx69x66x65

.

On the other hand, the literal value assigned to variable txt3 is surrounded by the grave accent characters ``. This creates what is known as a raw string in Go. Raw string values are uninterpreted where escape sequences are ignored and all valid characters are encoded as they appear in the literal.

When variable txt3 is printed, it produces the following output:

u6C34x20bringsx20x6cx69x66x65.

Notice that the printed string contains all the backslash-escaped values as they appear in the original string literal. Uninterpreted string literals are a great way to embed large multi-line textual content within the body of a source code without breaking its syntax.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.29.119