3.5 Strings

A string is an immutable sequence of bytes. Strings may contain arbitrary data, including bytes with value 0, but usually they contain human-readable text. Text strings are conventionally interpreted as UTF-8-encoded sequences of Unicode code points (runes), which we’ll explore in detail very soon.

The built-in len function returns the number of bytes (not runes) in a string, and the index operation s[i] retrieves the i-th byte of string s, where 0 ≤ i < len(s).

s := "hello, world"
fmt.Println(len(s))     // "12"
fmt.Println(s[0], s[7]) // "104 119"  ('h' and 'w')

Attempting to access a byte outside this range results in a panic:

c := s[len(s)] // panic: index out of range

The i-th byte of a string is not necessarily the i-th character of a string, because the UTF-8 encoding of a non-ASCII code point requires two or more bytes. Working with characters is discussed shortly.

The substring operation s[i:j] yields a new string consisting of the bytes of the original string starting at index i and continuing up to, but not including, the byte at index j. The result contains j-i bytes.

fmt.Println(s[0:5]) // "hello"

Again, a panic results if either index is out of bounds or if j is less than i.

Either or both of the i and j operands may be omitted, in which case the default values of 0 (the start of the string) and len(s) (its end) are assumed, respectively.

fmt.Println(s[:5]) // "hello"
fmt.Println(s[7:]) // "world"
fmt.Println(s[:])  // "hello, world"

The + operator makes a new string by concatenating two strings:

fmt.Println("goodbye" + s[5:]) // "goodbye, world"

Strings may be compared with comparison operators like == and <; the comparison is done byte by byte, so the result is the natural lexicographic ordering.

String values are immutable: the byte sequence contained in a string value can never be changed, though of course we can assign a new value to a string variable. To append one string to another, for instance, we can write

s := "left foot"
t := s
s += ", right foot"

This does not modify the string that s originally held but causes s to hold the new string formed by the += statement; meanwhile, t still contains the old string.

fmt.Println(s) // "left foot, right foot"
fmt.Println(t) // "left foot"

Since strings are immutable, constructions that try to modify a string’s data in place are not allowed:

s[0] = 'L' // compile error: cannot assign to s[0]

Immutability means that it is safe for two copies of a string to share the same underlying memory, making it cheap to copy strings of any length. Similarly, a string s and a substring like s[7:] may safely share the same data, so the substring operation is also cheap. No new memory is allocated in either case. Figure 3.4 illustrates the arrangement of a string and two of its substrings sharing the same underlying byte array.

The string "hello, world" and two substrings.

Figure 3.4. The string "hello, world" and two substrings.

3.5.1 String Literals

A string value can be written as a string literal, a sequence of bytes enclosed in double quotes:

"Hello, Image"

Because Go source files are always encoded in UTF-8 and Go text strings are conventionally interpreted as UTF-8, we can include Unicode code points in string literals.

Within a double-quoted string literal, escape sequences that begin with a backslash can be used to insert arbitrary byte values into the string. One set of escapes handles ASCII control codes like newline, carriage return, and tab:

a “alert” or bell
 backspace
f form feed
newline
carriage return
tab
v vertical tab
' single quote (only in the rune literal ''')
" double quote (only within "..." literals)
\ backslash

Arbitrary bytes can also be included in literal strings using hexadecimal or octal escapes. A hexadecimal escape is written xhh, with exactly two hexadecimal digits h (in upper or lower case). An octal escape is written ooo with exactly three octal digits o (0 through 7) not exceeding 377. Both denote a single byte with the specified value. Later, we’ll see how to encode Unicode code points numerically in string literals.

A raw string literal is written `...`, using backquotes instead of double quotes. Within a raw string literal, no escape sequences are processed; the contents are taken literally, including backslashes and newlines, so a raw string literal may spread over several lines in the program source. The only processing is that carriage returns are deleted so that the value of the string is the same on all platforms, including those that conventionally put carriage returns in text files.

Raw string literals are a convenient way to write regular expressions, which tend to have lots of backslashes. They are also useful for HTML templates, JSON literals, command usage messages, and the like, which often extend over multiple lines.

const GoUsage = `Go is a tool for managing Go source code.

Usage:
    go command [arguments]
...`

3.5.2 Unicode

Long ago, life was simple and there was, at least in a parochial view, only one character set to deal with: ASCII, the American Standard Code for Information Interchange. ASCII, or more precisely US-ASCII, uses 7 bits to represent 128 “characters”: the upper- and lower-case letters of English, digits, and a variety of punctuation and device-control characters. For much of the early days of computing, this was adequate, but it left a very large fraction of the world’s population unable to use their own writing systems in computers. With the growth of the Internet, data in myriad languages has become much more common. How can this rich variety be dealt with at all and, if possible, efficiently?

The answer is Unicode (unicode.org), which collects all of the characters in all of the world’s writing systems, plus accents and other diacritical marks, control codes like tab and carriage return, and plenty of esoterica, and assigns each one a standard number called a Unicode code point or, in Go terminology, a rune.

Unicode version 8 defines code points for over 120,000 characters in well over 100 languages and scripts. How are these represented in computer programs and data? The natural data type to hold a single rune is int32, and that’s what Go uses; it has the synonym rune for precisely this purpose.

We could represent a sequence of runes as a sequence of int32 values. In this representation, which is called UTF-32 or UCS-4, the encoding of each Unicode code point has the same size, 32 bits. This is simple and uniform, but it uses much more space than necessary since most computer-readable text is in ASCII, which requires only 8 bits or 1 byte per character. All the characters in widespread use still number fewer than 65,536, which would fit in 16 bits. Can we do better?

3.5.3 UTF-8

UTF-8 is a variable-length encoding of Unicode code points as bytes. UTF-8 was invented by Ken Thompson and Rob Pike, two of the creators of Go, and is now a Unicode standard. It uses between 1 and 4 bytes to represent each rune, but only 1 byte for ASCII characters, and only 2 or 3 bytes for most runes in common use. The high-order bits of the first byte of the encoding for a rune indicate how many bytes follow. A high-order 0 indicates 7-bit ASCII, where each rune takes only 1 byte, so it is identical to conventional ASCII. A high-order 110 indicates that the rune takes 2 bytes; the second byte begins with 10. Larger runes have analogous encodings.

0xxxxxx runes 0–127 (ASCII)
110xxxxx 10xxxxxx 128–2047 (values <128 unused)
1110xxxx 10xxxxxx 10xxxxxx 2048–65535 (values <2048 unused)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 65536–0x10ffff (other values unused)

A variable-length encoding precludes direct indexing to access the n-th character of a string, but UTF-8 has many desirable properties to compensate. The encoding is compact, compatible with ASCII, and self-synchronizing: it’s possible to find the beginning of a character by backing up no more than three bytes. It’s also a prefix code, so it can be decoded from left to right without any ambiguity or lookahead. No rune’s encoding is a substring of any other, or even of a sequence of others, so you can search for a rune by just searching for its bytes, without worrying about the preceding context. The lexicographic byte order equals the Unicode code point order, so sorting UTF-8 works naturally. There are no embedded NUL (zero) bytes, which is convenient for programming languages that use NUL to terminate strings.

Go source files are always encoded in UTF-8, and UTF-8 is the preferred encoding for text strings manipulated by Go programs. The unicode package provides functions for working with individual runes (such as distinguishing letters from numbers, or converting an upper-case letter to a lower-case one), and the unicode/utf8 package provides functions for encoding and decoding runes as bytes using UTF-8.

Many Unicode characters are hard to type on a keyboard or to distinguish visually from similar-looking ones; some are even invisible. Unicode escapes in Go string literals allow us to specify them by their numeric code point value. There are two forms, uhhhh for a 16-bit value and Uhhhhhhhh for a 32-bit value, where each h is a hexadecimal digit; the need for the 32-bit form arises very infrequently. Each denotes the UTF-8 encoding of the specified code point. Thus, for example, the following string literals all represent the same six-byte string:

"Image"
"xe4xb8x96xe7x95x8c"
"u4e16u754c"
"U00004e16U0000754c"

The three escape sequences above provide alternative notations for the first string, but the values they denote are identical.

Unicode escapes may also be used in rune literals. These three literals are equivalent:

'Image'  'u4e16'  'U00004e16'

A rune whose value is less than 256 may be written with a single hexadecimal escape, such as 'x41' for 'A', but for higher values, a u or U escape must be used. Consequently, 'xe4xb8x96' is not a legal rune literal, even though those three bytes are a valid UTF-8 encoding of a single code point.

Thanks to the nice properties of UTF-8, many string operations don’t require decoding. We can test whether one string contains another as a prefix:

func HasPrefix(s, prefix string) bool {
    return len(s) >= len(prefix) && s[:len(prefix)] == prefix
}

or as a suffix:

func HasSuffix(s, suffix string) bool {
    return len(s) >= len(suffix) && s[len(s)-len(suffix):] == suffix
}

or as a substring:

func Contains(s, substr string) bool {
    for i := 0; i < len(s); i++ {
        if HasPrefix(s[i:], substr) {
            return true
        }
    }
    return false
}

using the same logic for UTF-8-encoded text as for raw bytes. This is not true for other encodings. (The functions above are drawn from the strings package, though its implementation of Contains uses a hashing technique to search more efficiently.)

On the other hand, if we really care about the individual Unicode characters, we have to use other mechanisms. Consider the string from our very first example, which includes two East Asian characters. Figure 3.5 illustrates its representation in memory. The string contains 13 bytes, but interpreted as UTF-8, it encodes only nine code points or runes:

A range loop decodes a UTF-8-encoded string.

Figure 3.5. A range loop decodes a UTF-8-encoded string.

import "unicode/utf8"

s := "Hello, Image"
fmt.Println(len(s))                    // "13"
fmt.Println(utf8.RuneCountInString(s)) // "9"

To process those characters, we need a UTF-8 decoder. The unicode/utf8 package provides one that we can use like this:

for i := 0; i < len(s); {
    r, size := utf8.DecodeRuneInString(s[i:])
    fmt.Printf("%d	%c
", i, r)
    i += size
}

Each call to DecodeRuneInString returns r, the rune itself, and size, the number of bytes occupied by the UTF-8 encoding of r. The size is used to update the byte index i of the next rune in the string. But this is clumsy, and we need loops of this kind all the time. Fortunately, Go’s range loop, when applied to a string, performs UTF-8 decoding implicitly. The output of the loop below is also shown in Figure 3.5; notice how the index jumps by more than 1 for each non-ASCII rune.

for i, r := range "Hello, Image" {
    fmt.Printf("%d	%q	%d
", i, r, r)
}

We could use a simple range loop to count the number of runes in a string, like this:

n := 0
for _, _ = range s {
    n++
}

As with the other forms of range loop, we can omit the variables we don’t need:

n := 0
for range s {
    n++
}

Or we can just call utf8.RuneCountInString(s).

We mentioned earlier that it is mostly a matter of convention in Go that text strings are interpreted as UTF-8-encoded sequences of Unicode code points, but for correct use of range loops on strings, it’s more than a convention, it’s a necessity. What happens if we range over a string containing arbitrary binary data or, for that matter, UTF-8 data containing errors?

Each time a UTF-8 decoder, whether explicit in a call to utf8.DecodeRuneInString or implicit in a range loop, consumes an unexpected input byte, it generates a special Unicode replacement character, 'uFFFD', which is usually printed as a white question mark inside a black hexagonal or diamond-like shape Image. When a program encounters this rune value, it’s often a sign that some upstream part of the system that generated the string data has been careless in its treatment of text encodings.

UTF-8 is exceptionally convenient as an interchange format but within a program runes may be more convenient because they are of uniform size and are thus easily indexed in arrays and slices.

A []rune conversion applied to a UTF-8-encoded string returns the sequence of Unicode code points that the string encodes:

// "program" in Japanese katakana
s := "Image"
fmt.Printf("% x
", s) // "e3 83 97 e3 83 ad e3 82 b0 e3 83 a9 e3 83 a0"
r := []rune(s)
fmt.Printf("%x
", r)  // "[30d7 30ed 30b0 30e9 30e0]"

(The verb % x in the first Printf inserts a space between each pair of hex digits.)

If a slice of runes is converted to a string, it produces the concatenation of the UTF-8 encodings of each rune:

fmt.Println(string(r)) // "Image"

Converting an integer value to a string interprets the integer as a rune value, and yields the UTF-8 representation of that rune:

fmt.Println(string(65))     // "A", not "65"
fmt.Println(string(0x4eac)) // "Image"

If the rune is invalid, the replacement character is substituted:

fmt.Println(string(1234567)) // "Image"

3.5.4 Strings and Byte Slices

Four standard packages are particularly important for manipulating strings: bytes, strings, strconv, and unicode. The strings package provides many functions for searching, replacing, comparing, trimming, splitting, and joining strings.

The bytes package has similar functions for manipulating slices of bytes, of type []byte, which share some properties with strings. Because strings are immutable, building up strings incrementally can involve a lot of allocation and copying. In such cases, it’s more efficient to use the bytes.Buffer type, which we’ll show in a moment.

The strconv package provides functions for converting boolean, integer, and floating-point values to and from their string representations, and functions for quoting and unquoting strings.

The unicode package provides functions like IsDigit, IsLetter, IsUpper, and IsLower for classifying runes. Each function takes a single rune argument and returns a boolean. Conversion functions like ToUpper and ToLower convert a rune into the given case if it is a letter. All these functions use the Unicode standard categories for letters, digits, and so on. The strings package has similar functions, also called ToUpper and ToLower, that return a new string with the specified transformation applied to each character of the original string.

The basename function below was inspired by the Unix shell utility of the same name. In our version, basename(s) removes any prefix of s that looks like a file system path with components separated by slashes, and it removes any suffix that looks like a file type:

fmt.Println(basename("a/b/c.go")) // "c"
fmt.Println(basename("c.d.go"))   // "c.d"
fmt.Println(basename("abc"))      // "abc"

The first version of basename does all the work without the help of libraries:

gopl.io/ch3/basename1
// basename removes directory components and a .suffix.
// e.g., a => a, a.go => a, a/b/c.go => c, a/b.c.go => b.c
func basename(s string) string {
    // Discard last '/' and everything before.
    for i := len(s) - 1; i >= 0; i-- {
        if s[i] == '/' {
            s = s[i+1:]
            break
        }
    }
    // Preserve everything before last '.'.
    for i := len(s) - 1; i >= 0; i-- {
        if s[i] == '.' {
            s = s[:i]
            break
        }
    }
    return s
}

A simpler version uses the strings.LastIndex library function:

gopl.io/ch3/basename2
func basename(s string) string {
    slash := strings.LastIndex(s, "/") // -1 if "/" not found
    s = s[slash+1:]
    if dot := strings.LastIndex(s, "."); dot >= 0 {
        s = s[:dot]
    }
    return s
}

The path and path/filepath packages provide a more general set of functions for manipulating hierarchical names. The path package works with slash-delimited paths on any platform. It shouldn’t be used for file names, but it is appropriate for other domains, like the path component of a URL. By contrast, path/filepath manipulates file names using the rules for the host platform, such as /foo/bar for POSIX or c:fooar on Microsoft Windows.

Let’s continue with another substring example. The task is to take a string representation of an integer, such as "12345", and insert commas every three places, as in "12,345". This version only works for integers; handling floating-point numbers is left as an exercise.

gopl.io/ch3/comma
// comma inserts commas in a non-negative decimal integer string.
func comma(s string) string {
    n := len(s)
    if n <= 3 {
        return s
    }
    return comma(s[:n-3]) + "," + s[n-3:]
}

The argument to comma is a string. If its length is less than or equal to 3, no comma is necessary. Otherwise, comma calls itself recursively with a substring consisting of all but the last three characters, and appends a comma and the last three characters to the result of the recursive call.

A string contains an array of bytes that, once created, is immutable. By contrast, the elements of a byte slice can be freely modified.

Strings can be converted to byte slices and back again:

s := "abc"
b := []byte(s)
s2 := string(b)

Conceptually, the []byte(s) conversion allocates a new byte array holding a copy of the bytes of s, and yields a slice that references the entirety of that array. An optimizing compiler may be able to avoid the allocation and copying in some cases, but in general copying is required to ensure that the bytes of s remain unchanged even if those of b are subsequently modified. The conversion from byte slice back to string with string(b) also makes a copy, to ensure immutability of the resulting string s2.

To avoid conversions and unnecessary memory allocation, many of the utility functions in the bytes package directly parallel their counterparts in the strings package. For example, here are half a dozen functions from strings:

func Contains(s, substr string) bool
func Count(s, sep string) int
func Fields(s string) []string
func HasPrefix(s, prefix string) bool
func Index(s, sep string) int
func Join(a []string, sep string) string

and the corresponding ones from bytes:

func Contains(b, subslice []byte) bool
func Count(s, sep []byte) int
func Fields(s []byte) [][]byte
func HasPrefix(s, prefix []byte) bool
func Index(s, sep []byte) int
func Join(s [][]byte, sep []byte) []byte

The only difference is that strings have been replaced by byte slices.

The bytes package provides the Buffer type for efficient manipulation of byte slices. A Buffer starts out empty but grows as data of types like string, byte, and []byte are written to it. As the example below shows, a bytes.Buffer variable requires no initialization because its zero value is usable:

gopl.io/ch3/printints
// intsToString is like fmt.Sprint(values) but adds commas.
func intsToString(values []int) string {
    var buf bytes.Buffer
    buf.WriteByte('[')
    for i, v := range values {
        if i > 0 {
            buf.WriteString(", ")
        }
        fmt.Fprintf(&buf, "%d", v)
    }
    buf.WriteByte(']')
    return buf.String()
}

func main() {
    fmt.Println(intsToString([]int{1, 2, 3})) // "[1, 2, 3]"
}

When appending the UTF-8 encoding of an arbitrary rune to a bytes.Buffer, it’s best to use bytes.Buffer’s WriteRune method, but WriteByte is fine for ASCII characters such as '[' and ']'.

The bytes.Buffer type is extremely versatile, and when we discuss interfaces in Chapter 7, we’ll see how it may be used as a replacement for a file whenever an I/O function requires a sink for bytes (io.Writer) as Fprintf does above, or a source of bytes (io.Reader).

Exercise 3.10: Write a non-recursive version of comma, using bytes.Buffer instead of string concatenation.

Exercise 3.11: Enhance comma so that it deals correctly with floating-point numbers and an optional sign.

Exercise 3.12: Write a function that reports whether two strings are anagrams of each other, that is, they contain the same letters in a different order.

3.5.5 Conversions between Strings and Numbers

In addition to conversions between strings, runes, and bytes, it’s often necessary to convert between numeric values and their string representations. This is done with functions from the strconv package.

To convert an integer to a string, one option is to use fmt.Sprintf; another is to use the function strconv.Itoa (“integer to ASCII”):

x := 123
y := fmt.Sprintf("%d", x)
fmt.Println(y, strconv.Itoa(x)) // "123 123"

FormatInt and FormatUint can be used to format numbers in a different base:

fmt.Println(strconv.FormatInt(int64(x), 2)) // "1111011"

The fmt.Printf verbs %b, %d, %u, and %x are often more convenient than Format functions, especially if we want to include additional information besides the number:

s := fmt.Sprintf("x=%b", x) // "x=1111011"

To parse a string representing an integer, use the strconv functions Atoi or ParseInt, or ParseUint for unsigned integers:

x, err := strconv.Atoi("123")             // x is an int
y, err := strconv.ParseInt("123", 10, 64) // base 10, up to 64 bits

The third argument of ParseInt gives the size of the integer type that the result must fit into; for example, 16 implies int16, and the special value of 0 implies int. In any case, the type of the result y is always int64, which you can then convert to a smaller type.

Sometimes fmt.Scanf is useful for parsing input that consists of orderly mixtures of strings and numbers all on a single line, but it can be inflexible, especially when handling incomplete or irregular input.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.118.244