Parsing byte streams

Sometimes, you might want to parse byte streams or byte slices to get valuable data. An example could be parsing a TCP byte stream to get HTTP data. Thanks to Rust and the Nom crate, we have an extremely efficient parser generator, which will not add extra overhead by copying data within your crate.

With the Nom crate, you create functions that will read the input data byte by byte and return the parsed data. The aim in this section is not to master the Nom crate, but to understand its power and point you to the appropriate documentation. So, let's see the adapted example from Zbigniew Siciarz's 24 days of Rust (https://siciarz.net/24-days-rust-nom-part-1/), where he showed a short example of how to parse the first line of the HTTP protocol. You can read more complex tutorials on his blog.

Let's first define what the first line of the protocol looks like:

let first_line = b"GET /home/ HTTP/1.1
";

As you can see, the first_line variable is a byte array (denoted by the b before the string). It just has the method as the first word, in this case GET, but it could be POST, PUT, DELETE, or any of the rest methods. We will stick to these four for simplicity. Then, we can read the URL the client is trying to get, and finally, the HTTP protocol version which will be 1.1 in this case. The line ends with a carriage return and a new line.

Nom uses a macro called named!(), where you define a parser function. The name of the macro comes from the fact that you are giving a name to the function and then its implementation.

If we want to start checking the first HTTP line, we will need to parse the request method. To do that, we have to tell the parser that the first line can be any of the possible request methods. We can do this by using the alt!() macro with multiple tag!() macros, one per protocol. Let's add Nom to our Cargo.toml file and start coding the method parsing:

#[macro_use]
extern crate nom;

named!(parse_method,
alt!(
tag!("GET") |
tag!("POST") |
tag!("PUT") |
tag!("DELETE")
)
);

fn main() {
let first_line = b"GET /home/ HTTP/1.1 ";
println!("{:?}", parse_method(&first_line[..]));
}

This will output the following:

Ok(([32, 47, 104, 111, 109, 101, 47, 32, 72, 84, 84, 80, 47, 49, 46, 49, 13, 10], [71, 69, 84]))

What is happening here? This seems like just a bunch of numbers, one after the other. Well, as we mentioned earlier, Nom works byte by byte, and does not care (unless we tell it) about the string representation of things. In this case, it has correctly found a GET, bytes 71, 69, and 84 in ASCII, and the rest is still not parsed. It returns a tuple with the unparsed data first and the parsed data second.

We can tell Nom that we want to read the actual GET string by mapping the result to the str::from_utf8 function. Let's change the parser accordingly:

named!(parse_method<&[u8], &str>,
alt!(
map_res!(tag!("GET"), str::from_utf8) |
map_res!(tag!("POST"), str::from_utf8) |
map_res!(tag!("PUT"), str::from_utf8) |
map_res!(tag!("DELETE"), str::from_utf8)
)
);

As you can see, apart from adding the map_res!() macro, I had to specify that the parse_method returns &str after parsing the input, since Nom assumes that your parsers will return byte slices by default. This will output the following:

Ok(([32, 47, 104, 111, 109, 101, 47, 32, 72, 84, 84, 80, 47, 49, 46, 49, 13, 10], "GET"))

We can even create an enumeration and map it directly, as you can see here:

#[derive(Debug)]
enum Method {
Get,
Post,
Put,
Delete,
}

impl Method {
fn from_bytes(b: &[u8]) -> Result<Self, String> {
match b {
b"GET" => Ok(Method::Get),
b"POST" => Ok(Method::Post),
b"PUT" => Ok(Method::Put),
b"DELETE" => Ok(Method::Delete),
_ => {
let error = format!("invalid method: {}",
str::from_utf8(b)
.unwrap_or("not UTF-8"));
Err(error)
}
}
}
}

named!(parse_method<&[u8], Method>,
alt!(
map_res!(tag!("GET"), Method::from_bytes) |
map_res!(tag!("POST"), Method::from_bytes) |
map_res!(tag!("PUT"), Method::from_bytes) |
map_res!(tag!("DELETE"), Method::from_bytes)
)
);

We can combine multiple parsers and create variables in one parser that will be reused in the next one. This is useful, for example, when some parts of the data contain information for parsing the rest. This is the case with the HTTP content length header, which lets you know how much you should parse later. Let's use it to parse the complete request:

use std::str;

#[derive(Debug)]
struct Request {
method: Method,
url: String,
version: String,
}

named!(parse_request<&[u8], Request>, ws!(do_parse!(
method: parse_method >>
url: map_res!(take_until!(" "), str::from_utf8) >>
tag!("HTTP/") >>
version: map_res!(take_until!(" "), str::from_utf8) >>
(Request {
method,
url: url.to_owned(),
version: version.to_owned()
})
)));

fn main() {
let first_line = b"GET /home/ HTTP/1.1 ";
println!("{:?}", parse_request(&first_line[..]));
}

Let's see what's happening here. We created the structure to store the line data and then we created a parser by using the ws!() macro (which will automatically consume spacers between tokens). The do_parse!() macro allows us to create a sequence of many parsers.

We call the parse_method() parser we just created for the request method and then we just store the other two strings as variables. We then just need to create the structure with the variables. Note that I also changed the call in the main() function. Let's see the result:

Ok(([], Request { method: Get, url: "/home/", version: "1.1" }))

As we can see, there are no more bytes to parse, and the Request structure has been properly generated. You can generate parsers for extremely complex structures and you could, for example, parse the URL to get the segments, or the version number to get the major and minor version numbers, and so on. The only limitations are your needs.

In this case, we did some copying when calling to_owned() for the two strings, but we needed it if we wanted to generate an owned field. You can use explicit lifetimes to avoid a lot of copying if you require faster processing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.35.193