Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

12. Data Implementation

Carlo Milanesi¹

(1)

Bergamo, Italy

In this chapter, you will learn:

How to know how many bytes of stack are taken by objects of various types
How to shorten the path to access functions declared in external modules
How bits are stored in primitive type objects
How to know where an object is stored in memory
Why padding can increase the size taken by some objects
How vectors are implemented

Discovering the Size of Objects

Given a source file, the Rust compiler is free to generate any machine code, as long as it behaves in the way specified by the Rust language for that source code.

Therefore, for example, given a variable, how many memory bits it uses, and where it is located in memory are not defined. The compiler could even remove that variable from memory because it is never used, or because it is kept in a processor register.

However, it is instructive to see a possible typical implementation of the arrangement of the data used by a Rust program.

For such a purpose, some Rust features are available:

print!("{} ", std::mem::size_of::<i32>());

print!("{} ", std::mem::size_of_val(&12));

This will print: 4 4 .

In the first statement, the compiler enters the standard library module std (shorthand for standard), then it enters its submodule mem (shorthand for memory), and then it takes its generic function size_of.

The compiler concretizes such a generic function using the type i32, and then it generates the invocation of such resulting concrete function without passing any arguments. Such a function will return the number of octets of bits occupied by any object of the specified type. For more than four decades, such octets have been unambiguously named bytes. In the first years of computer technology, there were computers for which the name byte was used for shorter sequences of bits. Anyway, such very old computer architectures are not supported by Rust.

Typically , such function invocation will be inlined, and so the code generated is just a constant number. Actually, a 32-bit number occupies 4 bytes.

Notice that you may invoke this function even if in the program there is no object of the specified type.

In the second statement, the compiler enters in the same library module, but it accesses the generic function size_of_val (meaning size of value). In this case, the type of the parameter needed to concretize the generic function is inferred from the argument, so it wasn’t required to specify it explicitly. Instead, in the first statement there were no arguments, so the type parameter was required.

When the concretized function size_of_val is invoked, an immutable reference to an object is passed to it. The function returns the size in bytes of that object.

The use Directive

If the path to reach a library function must be specified several times, it is convenient to import all or part of such path into the current scope, using the use directive.

The previous example can be rewritten in this way:

use std::mem;

print!("{} ", mem::size_of::<i32>());

print!("{} ", mem::size_of_val(&12));

or in this way:

use std::mem::size_of;

use std::mem::size_of_val;

print!("{} ", size_of::<i32>());

print!("{} ", size_of_val(&12));

The Rust use keyword is similar to the C++ using keyword.

There is an even more compact form of it:

use std::mem::*;

print!("{} ", size_of::<i32>());

print!("{} ", size_of_val(&12));

The asterisk is a wildcard that causes all the names at that level to be imported.

The Sizes of the Primitive Types

Now you should imagine the sizes of the objects having a primitive type:

use std::mem::*;

print!("{} {} {} {} {} {} {} {} {} {} {} {} {} {}",

size_of::<i8>(),

size_of::<u8>(),

size_of::<i16>(),

size_of::<u16>(),

size_of::<i32>(),

size_of::<u32>(),

size_of::<i64>(),

size_of::<u64>(),

size_of::<i128>(),

size_of::<u128>(),

size_of::<f32>(),

size_of::<f64>(),

size_of::<bool>(),

size_of::<char>());

In any computer, this will print: 1 1 2 2 4 4 8 8 16 16 4 8 1 4.

Some other data types have sizes that depend on the target platform of the compiler:

use std::mem::*;

print!("{} {} {} {}",

size_of::<isize>(),

size_of::<usize>(),

size_of::<&i8>(),

size_of::<&u32>());

In a 64-bit system, this will print: 8 8 8 8, while in a 32-bit system, it will print: 4 4 4 4.

The last two printed values are references. Independently from the referenced object, a reference (a.k.a. pointer) has the size of a memory address.

The Representation of Primitive Types

Rust discourages accessing the internal representation of objects, so it is not easy to do; but there is a trick to do that:

fn as_bytes<T>(o: &T) -> &[u8] {

unsafe {

std::slice::from_raw_parts(

o as *const _ as *const u8,

std::mem::size_of::<T>())

}

println!("{:?}", as_bytes(&1i8));

println!("{:?}", as_bytes(&2i16));

println!("{:?}", as_bytes(&3i32));

println!("{:?}", as_bytes(&(4i64 + 5 * 256 + 6 * 256 * 256)));

println!("{:?}", as_bytes(&'A'));

println!("{:?}", as_bytes(&true));

println!("{:?}", as_bytes(&&1i8));

In an x86_64 system, this could print:

[1]

[2, 0]

[3, 0, 0, 0]

[4, 5, 6, 0, 0, 0, 0, 0]

[65, 0, 0, 0]

[1]

[129, 165, 54, 102, 23, 86, 0, 0]

The generic function as_bytes uses some Rust constructs we haven’t seen yet, and that will not be explained here, because their knowledge is not required to understand what it does. It simply takes a reference to an argument of any type, and returns an object that represents the sequence of bytes contained in such an object. By printing such an object, you can see the representation of any object as the sequence of bytes it stores in memory.

First, an i8 having value 1 is stored in a single byte. And that is just the same in any supported hardware architecture.

Then, an i16 having value 2 is stored as a pair of bytes, of which the first one is 2 and the second one is 0. This happens both on 32-bit and 64-bit processors, but only in the so-called little-endian hardware architectures, which are those that store multibyte numbers placing the least significant byte at the lowest address. Instead, a big-endian hardware architecture would have printed [0, 2].

Similar behavior appears in the following printed lines.

Notice that a char is stored as a 32-bit number containing the Unicode value of such character, and a bool is stored as a single byte, which is 1 for true and 0 for false.

Finally, the last statement prints the address of an i8 number. Such an address occupies eight bytes for a 64-bit processor, and differs from run to run.

Location of Bytes in Memory

You can also discover the (virtual) memory location of any object, which is its address:

let b1 = true;

let b2 = true;

let b3 = false;

print!("{} {} {}",

&b1 as *const bool as usize,

&b2 as *const bool as usize,

&b3 as *const bool as usize);

In a 64-bit system, this will print three huge numbers, resembling 140729279188109 140729279188110 140729279188111. But in a 32-bit system, it will print three numbers smaller than five billion.

Also, the constructs to get such numbers will not be explained here.

However, there is a simpler way to print the address of any object, in hexadecimal notation:

let b1 = true;

let b2 = true;

let b3 = false;

print!("{:p} {:p} {:p}", &b1, &b2, &b3);

The p format specifier is shorthand for pointer.

It will print something like:

0x7ffe16b20c8d 0x7ffe16b20c8e 0x7ffe16b20c8f

Here is a representation of the location of the three objects in the preceding code.

Absolute Address	Binary Value	Variable Name	Type
140729279188109	0000_0001	b1	bool
140729279188110	0000_0001	b2	bool
140729279188111	0000_0000	b3	bool

Each one of the three objects occupies just one byte. The first printed number is the address of the b1 variable; the second one is the address of the b2 variable; and the third one is the address of the b3 variable. As it appears, the three numbers are consecutive, and that means that the three objects are allocated in contiguous virtual memory locations.

The first number contains the true Boolean value, which is represented by the 1 byte, which in turn consists of seven bits having zero value and one bit having one value. Also the second object contains the true value. Instead, the third object contains the false value, represented by eight bits having zero value.

When allocating single bytes, the Rust compiler usually lays them out sequentially and contiguously, but when allocating larger objects, their position in memory is not easily predictable.

Almost all modern processors require that elementary data have particular memory locations, so Rust places its objects so that they are easily accessible by the processors.

The typical alignment rule is this: “Every object of a primitive type must have an address that is a multiple of its own size.”

So, while the objects occupying just one byte can be placed anywhere, the objects occupying two bytes can be placed only at even addresses; the objects occupying four bytes can be placed only at addresses that are divisible by four; and the objects occupying eight bytes can be places only at addresses that are a multiple of eight.

In addition, larger objects often have addresses that are a multiple of sixteen.

Therefore, such alignment requirements can create unused spaces, the so-called padding.

Sizes of Composite Data Types

The effect of padding appears when there is a sequence of composite objects:

enum E1 { E1a, E1b }

enum E2 { E2a, E2b(f64) }

use std::mem::*;

print!("{} {} {} {} {} {}",

size_of_val(&[0i16; 80]),

size_of_val(&(0i16, 0i64)),

size_of_val(&[(0i16, 0i64); 100]),

size_of_val(&E1::E1a),

size_of_val(&E2::E2a),

size_of_val(&vec![(0i16, 0i64); 100]));

This will print: 160 16 1600 1 16 24.

This means that:

An array of 80 16-bit numbers occupies 160 bytes, which is 80 * 2, so there is no waste.
A tuple of a 16-bit number and a 64-bit number occupies 16 bytes, as if both numbers would occupy 8 bytes, so a padding of 6 bytes has been added.
An array of 100 16-byte tuples occupies 1600 bytes, so there is no padding between array items, but the size of every item is multiplied by the length of the array.
An enum having all its variants without data fields occupies just one byte; as long as there are no more than 256 variants, a single byte can contain a value that identifies a variant; such hidden value is usually named tag.
An enum whose largest variant contains an 8-byte number occupies 16 bytes, even if the current value has no data; this is because any value of a variant must occupy the same size, that is, the size of the largest possible value of that enum. In this case, the second variant has a 1-byte tag and an 8-byte numeric field; the address of this field must be a multiple of 8, so a 7-byte padding is added between the tag and the field.
A vector of 100 16-byte tuples looks like it occupies just 24 bytes, but of course something is missing from this measure.

Let’s see just the case of the vector.

The data placed in the stack must have a size known at compile time, so arrays can be fully allocated in the stack; while for vectors only a fixed-size header can be placed in the stack, and the remaining data must be allocated in the heap.

Vector Allocation

We saw that vectors must be implemented as a two-object structure: a stack-allocated fixed-size header, and a heap-allocated variable-length buffer.

There are theoretically several possible ways to implement a vector data structure.

One way would be to keep in the header only a pointer to the buffer.

That has the disadvantage that every time the length of the array is desired, an indirection level would be required. The length of arrays is needed quite often, implicitly or explicitly, so it is better to keep such information in the header.

A naive way to implement the buffer would be to size it just large enough to keep the desired data. For example, if a vector of 9 i32 items is requested, a buffer of 9 * 4 bytes would be allocated in the heap.

As long as the vector does not grow, this is good. But if another item is pushed into the vector, the buffer must be reallocated, and heap allocations and deallocation are costly. In addition, the old buffer contents must be copied to the new buffer.

If a 1,000-item vector is constructed, an item at a time, by creating an empty vector and calling the push function 1,000 times, there will be 1,000 heap allocations, 999 heap deallocations, and 1000 * 999 / 2 == 499_500 copies of items.

To improve such awful performance, a larger buffer may be allocated, so that a reallocation is performed only when that buffer is not enough.

Therefore, there is the need to track both the number of places in the allocated buffer and the number of used places in that buffer.

The number of places in the allocated buffer is usually called capacity , and that is also the name of the function used to access such number. Here is some code that displays the size of the buffer of a vector:

let mut v = vec![0; 0];