© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
C. MilanesiBeginning Rusthttps://doi.org/10.1007/978-1-4842-7208-4_14

14. Using Changeable Strings

Carlo Milanesi1  
(1)
Bergamo, Italy
 
In this chapter, you will learn:
  • How static strings are implemented

  • How dynamic strings are implemented

  • How you can add characters to or remove characters from a dynamic string

  • How to convert a static string to a dynamic string, and conversely

  • How to concatenate strings

Static Strings

Are the strings that we have used so far changeable?

They may be mutable and so, in a sense, we can change them, like in this code:
let mut a = "Hel";
print!("{}", a);
a = "lo";
print!("{}", a);

This will print: Hello. But in what sense did we change it? We abruptly changed all the content of the string, not just some characters. In fact, so far we changed strings only by assigning to a string variable a string literal or another string variable.

But if we wanted to create a string either algorithmically, or by reading it from a file, or by letting the user type it in, how could we do it? Simply stated, using the kind of strings we have used so far, we cannot do that. Indeed, although these string objects can be changed to refer to other string content, they have an immutable content, that is, it is not possible to overwrite some characters or to add or remove characters in a string. Because of this, they are called static strings. The following example helps to clarify:
use std::mem::*;
let a: &str = "";
let b: &str = "0123456789";
let c: &str = "abcdè";
print!("{} {} {}",
    size_of_val(a),
    size_of_val(b),
    size_of_val(c));

This program will print: 0 10 6.

First, notice that we specified the type of the three variables. That type is &str, which means reference to str.

The str word is defined in the standard library as the type of an unmodifiable array of bytes representing a UTF-8 string. Each time the compiler parses a literal string, it stores in a static program area the characters of that string, and that area is of str type. Then the compiler uses a reference to that area as the value of the literal string expression, so any string literal is of type &str.

In the example, the size_of_val generic function is invoked on the three string variables. Remember that such a function returns the size of the object referenced by its argument. If the argument is a, which is of type &str, this function returns the size of the string buffer referenced by a, which is of type str.

So the sizes of the three buffers referred to by the variables a, b, and c are printed. Such sizes are, respectively, 0, 10, and 6 bytes. Indeed, the first string was empty, so its buffer occupies no bytes; the second string contained exactly ten digits, and each digit is a character that occupies one byte, so that string buffer occupies ten bytes; the third string contained only five characters, but the number six was printed as its length. This is because of the UTF-8 notation. In that notation, every character is represented by one or more bytes, depending on the character. The ASCII characters are represented by a single byte, while the “grave e” character, that is, è, is represented by two bytes. So, the whole string of five characters is represented by six bytes.

Notice that the buffers referred to by the a, b, and c variables are of the same type, which is str, but they have different lengths: 0, 10, and 6. So here, for the first time, we see a type that hasn’t an associated length.

Such types are not very common, and they have some limitations. One is that you cannot declare a variable or a function argument of the type. Another obvious limitation is that you cannot ask the size of the type.

Here is an example of what you cannot do with the str type:
let a: str;
fn f(a: str) {}
print!("{}", std::mem::size_of::<str>());

All the three previous statements are illegal, so for each of them the compiler emits the error message: the size for values of type `str` cannot be known at compilation time.

But then, how can the previous program get the sizes of the buffers? In C language, string terminators are used to mark the end of strings, but Rust has no string terminators.

Actually the &str type is not a normal Rust reference, containing just a pointer, rather it is a pair of a pointer and a length. The pointer value is the address of the beginning of the string buffer, and the length value is the number of bytes of the string buffer.

Let’s explore this strange type in more depth, with this code:
use std::mem::*;
let a: &str = "";
let b: &str = "0123456789";
let c: &str = "abcdè";
print!("{} {} {}; ",
    size_of_val(&a),
    size_of_val(&b),
    size_of_val(&c));
print!("{} {} {}",
    size_of_val(&&a),
    size_of_val(&&b),
    size_of_val(&&c));

This program in a 64-bit system will print: 16 16 16; 8 8 8, while in a 32-bit system it will print: 8 8 8; 4 4 4.

The first print statement prints the sizes of the variables themselves, which are of type &str . Such variables result in sizes that are twice as large as that of a normal reference, as they contain a pointer object and a usize object.

The second print statement prints the sizes of references to the variables themselves, which are of type &&str. They are normal references.

When we invoke the len function on a static string, we just read the second field of that pair, without even accessing the string buffer, so this function is quite efficient.

Dynamic Strings

So if we want to create or change the contents of a string at runtime, the &str type, which we always used so far, is unfit.

But Rust also provides another kind of strings, dynamic strings , whose content can be changed. Here is some code that uses a dynamic string:
let mut a: String = "He".to_string();
a.push('l');
a.push('l');
a.push('o');
print!("{}", a);

This will print: Hello.

The a variable is of type String, which is the type Rust uses for dynamic strings.

In Rust there are no literal dynamic strings; literal strings are always static. But a dynamic string may be constructed from a static string in several ways. One is to invoke the to_string function on a static string. The name of this function should be thought of as if it were to_dynamic_string or to_String. But the first alternative would be too long, and the second one would violate the convention of never using uppercase letters in the name of functions.

A dynamic string can be printed like any static string, as shown by the last statement of the example. But it is capable of something a static string cannot do: it can grow.

Each of the second, third, and fourth statements add a character at the end of the string.

It is also possible to add characters in other positions inside a dynamic string, or to remove any character, as is shown by this code:
let mut a: String = "Xy".to_string(); // "Xy"
a.remove(0); // "y"
a.insert(0, 'H'); // "Hy"
a.pop(); // "H"
a.push('i'); // "Hi"
print!("{}", a);

This prints: Hi.

The a variable is initialized to contain “Xy”. Then the character at position 0 is removed, leaving “y”. Then an “H” is inserted at position 0, obtaining a “Hy”. Then the last character is popped from the end, leaving “H”. Then an “i” is pushed at the end, obtaining the final “Hi”.

Implementation of String

While a Rust static string is somewhat similar to a C language string, without a string terminator but with an additional counter, a Rust dynamic string is quite similar to a C++ std::string object. Both Rust and C++ dynamic string types contain a dynamically allocated array of bytes, which contains the characters of the string.

The main difference between Rust and C++ dynamic string types is that while for C++ strings each byte of the buffer represents exactly one character, the buffer of any Rust dynamic string, like that of any Rust static string, is guaranteed to use the UTF-8 encoding; so a byte of it does not necessarily correspond to a character.

Remaining in the Rust language, there are similarities of strings with arrays, and with vectors. While static string buffers are similar to arrays, that is, the str type is similar to the generic [u8; N] type, dynamic strings are similar to vectors of bytes, that is, the String type is similar to the Vec<u8> type.

Indeed, the functions we saw previously (push, pop, insert, and remove, and also the len function) have the same name of the corresponding functions of the Vector generic type.

In addition, both dynamic strings and vectors have the same implementation. Both are structures consisting of three fields:
  • The address of the beginning of the heap-allocated buffer containing the data items

  • The number of items that may be contained in the allocated buffer

  • The number of items presently used in the allocated buffer

However, notice that for the strings, such “items” are bytes, not characters, like this code shows:
let mut s1 = "".to_string();
s1.push('e');
let mut s2 = "".to_string();
s2.push('è');
let mut s3 = "".to_string();
s3.push('€');
print!("{} {}; ", s1.capacity(), s1.len());
print!("{} {}; ", s2.capacity(), s2.len());
print!("{} {}", s3.capacity(), s3.len());

This may print: 8 1; 8 2; 8 3. That means that in all these cases the allocated buffer is eight bytes long. And the ASCII character e occupies just one byte in that buffer; the accented character è occupies two bytes; and the currency symbol € occupies three bytes. The number of bytes occupied by the characters is because of the UTF-8 standard, while the size of the buffers is dependent on the implementation of the Rust standard library. It has changed in previous versions of Rust, and it may change in future versions.

Let’s see what happens when several characters are added to a dynamic string, one at a time:
let mut s1 = "".to_string();
for _ in 0..16 {
    println!("{:p} {} {}",
        s1.as_ptr(), s1.capacity(), s1.len());
    s1.push('a');
}
let s2 = "x".to_string();
s1.push('-');
println!("{:p}", s2.as_ptr());
println!("{:p} {} {}: {}",
    s1.as_ptr(), s1.capacity(), s1.len(), s1);
This, in a 64-bit system, may print:
0x1 0 0
0x55f7d528f9d0 8 1
0x55f7d528f9d0 8 2
0x55f7d528f9d0 8 3
0x55f7d528f9d0 8 4
0x55f7d528f9d0 8 5
0x55f7d528f9d0 8 6
0x55f7d528f9d0 8 7
0x55f7d528f9d0 8 8
0x55f7d528f9d0 16 9
0x55f7d528f9d0 16 10
0x55f7d528f9d0 16 11
0x55f7d528f9d0 16 12
0x55f7d528f9d0 16 13
0x55f7d528f9d0 16 14
0x55f7d528f9d0 16 15
0x55f7d528f9f0
0x55f7d528fa10 32 17: aaaaaaaaaaaaaaaa-

The as_ptr function (to be read “as pointer”) returns the address of the heap-allocated buffer containing the characters of the string.

Notice that when the s1 string is empty, the capacity is zero, meaning that there are no allocated bytes. As no buffer is allocated, the address of such buffer is simply 1, which is an invalid memory address.

When one ASCII character is added to s1, an 8-byte buffer is allocated at an address represented by the hexadecimal number 55f7d528f9d0.

Adding seven other characters, no reallocations are required, because the allocated buffer is large enough.

When the ninth character is added, a reallocation is required, but, as the memory immediately following the buffer of s1 is still free, the buffer may simply be extended to 16 bytes. This avoids the overhead of allocating a new buffer, copying the eight used bytes, and deallocating the previous buffer.

This buffer extension may be long, as this program has no other allocations, so all the address space is available. So, just before adding the seventeenth character to the s1 string, the s2 variable is declared and initialized. It allocates a buffer containing only the letter x. That buffer will go just after the current buffer for the s1 string. So, when the buffer for the s1 string must be extended to 32 bytes, it has to be reallocated.

This can be seen by looking at the addresses printed by the program. Remember that 20 in hexadecimal is 32 in decimal. The buffer for s2 begins 32 bytes after the original position where the buffer for s1 begins. After the last push on s1, its buffer has been moved 32 bytes after the position where the buffer for s1 begins.

Creating Dynamic Strings

There are several ways to create an empty dynamic string.
let s1 = String::new();
let s2 = String::from("");
let s3 = "".to_string();
let s4 = "".to_owned();
let s5 = format!("");
print!("({}{}{}{}{})", s1, s2, s3, s4, s5);

This will print: ().

The new function of the String type is the basic constructor, similar to a default constructor in C++.

The from function of the String type is the converter constructor, similar to a non-default constructor in C++.

The functions to_string and to_owned are now interchangeable, but there are both because historically they were somewhat different.

The format macro is identical to the print macro, with the only difference that while the latter sends its result to the console, the former returns a String object containing the result.

Except for the new function, all the previous ways to create a dynamic string also can be used to convert a nonempty static string to a dynamic string. Here are some examples of such conversions:
let s = "a,";
let s1 = String::from(s);
let s2 = s.to_string();
let s3 = s.to_owned();
//let s4 = format! (s);
//let s5 = format!("a,{}");
let s6 = format!("{}", s);
print!("({}{}{}{})", s1, s2, s3, s6);

This will print: (a,a,a,a,).

Instead, the statements in the fifth and sixth lines would generate compilation errors.

For the fifth line, the error message format argument must be a string literal is printed. Indeed, the format macro, like the print and println macros, requires that the first argument is a literal.

For the sixth line, the error message 1 positional argument in format string, but no arguments were given is printed. This is because the first argument of the format macro must contain as many placeholders as the successive arguments to the macro.

Converting Static Strings to Dynamic Strings

We have already seen how to convert a static string to a dynamic string, that is, to get an object of type String whose contents is equal to that of a given object of type &str. Is the reverse conversion possible, from a dynamic string to a static string?

Yes, with this code:
let s1: String = "abc".to_string();
let s2: &String = &s1;
let s3: &str = &s1;
println!("{} {} {}", s1, s2, s3);

It will print: abc abc abc.

In the first line, a dynamic string is declared and initialized.

In the second line, a reference to it is assigned to a variable whose type is that of reference to String.

In the third line, a reference to the dynamic string is assigned to a variable whose type is that of reference to str, that is, to a static string.

It is allowed to initialize a variable of type String with an expression of type &str, because the standard library defines such types in a way to allow an implicit conversion between them.

Notice that the two kinds of conversion are quite different.

A conversion from a static string to a dynamic string creates a distinct object, which allocates a distinct buffer and copies into it all the characters of the static string.

Whereas a conversion from a dynamic string to a static string creates a reference inside the dynamic string. No characters are copied. They are shared between the two strings.

Concatenating Strings

A dynamic string can be obtained also by concatenating two static strings, two dynamic strings, or a dynamic string and a static string. Here are some examples:
let ss1 = "He";
let ss2 = "llo ";
let ds1 = ss1.to_string();
let ds2 = ss2.to_string();
let ds3 = format!("{}{}", ss1, ss2);
print!("{}", ds3);
let ds3 = format!("{}{}", ss1, ds2);
print!("{}", ds3);
let ds3 = format!("{}{}", ds1, ss2);
print!("{}", ds3);
let ds3 = format!("{}{}", ds1, ds2);
print!("{}", ds3);

This will print: Hello Hello Hello Hello.

First, the two static strings ss1 and ss2 are declared. Then, they are converted to the dynamic strings ds1 and ds2. Then, the format macro is used four times to concatenate any combination of such strings: static-static, static-dynamic, dynamic-static, and dynamic-dynamic. The result is the same in all four cases.

Often, it is desired to have a loop that appends many strings to another string, which of course must be mutable. This is possible using the format macro, in the following way:
let vs = ["Hello", ", ", "world", "!"];
let mut result = String::new();
for s in vs {
    result = format!("{}{}", result, s);
}
print!("{}", result);

It will print: Hello, world!.

In each iteration, the previous result is concatenated with the new string piece, and the result becomes the new partial result.

Though, this is inefficient, because any iteration definitely destroys and reallocates the result string.

This is a better way:
let vs = ["Hello", ", ", "world", "!"];
let mut result = String::new();
for s in vs {
    result.push_str(s);
}
print!("{}", result);

The function push_str takes a static string and pushes all its characters to the end of the receiving string. It may be more efficient because, if the appended characters can be contained in the already allocated buffer, no reallocation is required.

To make the code even shorter, the function push_str can be replaced by the equivalent += operator, in this way:
let vs = ["Hello", ", ", "world", "!"];
let mut result = String::new();
for s in vs {
    result += s;
}
print!("{}", result);

It is a kind of syntactic sugar.

It’s possible also to append dynamic strings or single characters:
let comma = ", ".to_string();
let world = "world".to_string();
let excl_point = '!';
let mut result = "Hello".to_string();
result += &comma;
result.push_str(&world);
result.push(excl_point);
print!("{}", result);

This program is equivalent to the previous ones.

Notice, at the fifth and sixth lines, that a dynamic string is passed as an argument of push_str and +=, respectively. However, such functions expect a static string, so our dynamic strings have to be converted to static strings beforehand. Such effect is obtained using the & operator.

At last, a single character is appended to the result string using push.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.57.223