CHAPTER 8

image

Video-to-Video Using WebRTC

The natural progression to our chatroom is video-to-video, such as the kind Skype and Google Hangouts use, but until very recently this was a lot harder than it sounds. Why? Yep, you guessed it . . . plugins were required (namely, Flash). It was not until 2012, nearly 20 years after the start of the Web, that the beginnings of native video streaming from a webcam started trickling into the browsers. Now that WebRTC is starting to become well supported, it is worth learning how to use it. I am not going to pretend it is simple—it isn’t—but, once you’ve got your head around how it works, it does make sense.

Introduction to WebRTC

Okay, so we know that WebRTC is used to allow communication between browsers, but what is it? WebRTC is actually a set of specifications that are often used together to create a real-time communication link. The group of specs includes one for data too so it is not just for video and audio (though at the time of writing, the data spec is not well supported but it is being actively developed). WebRTC is formed very similar to TCP servers in that it requires a handshake sent using correct protocols rather than just being a common API that you are used to using. It should also be noted that WebRTC is peer to peer rather than client/server, which is why it requires extra networking on the client-side. Since the majority of web developers have probably never needed to write a TCP server, I think we should start from the beginning.

GetUserMedia()

One of the hardest problems with creating native video communication before WebRTC was not the fact that communication itself was hard to do (though without any peer to peer capabilities, it was very hard to do). No, the most difficult problem was accessing the webcam!

Luckily, we now have navigator.getUserMedia(), which is a method that lets us get both the video and audio from an input device and do something with it using the callback.

navigator.getUserMedia( {audio: true, video: true},
    function(s) {
        vid.src = URL.createObjectURL(s);
        vid.play();
    }, function(error) {throw error;}
);

The first parameter is a configuration object that let’s you choose the type of input that you want to use. Then we have two callbacks, the first for what to do if the input successfully gives you a stream and the second for if it fails.

And that’s that. Simple, right? One of the most annoying problems in web development, now possible using just one method.

Now, while we are talking purely about getUserMedia, I would like to point out that you could do anything with it that you can do with a normal video (since it goes through the video element). In the next few chapters we will be looking at computer vision, a massive topic that is not usually thought of as related to the Web, we will be using getUserMedia as the base for the entire project and it will show you that video is not just for conference calls.

Specifications

As I mentioned at the beginning of the chapter, WebRTC is a set of specifications. It consists of the following three APIs:

  • Media Stream API
  • PeerConnection API
  • DataChannel API

The Media Stream API primarily covers the getUserMedia() method that I explained previously. It requires its own specification because, despite being very easy to use, it is complicated for browser implementers. Peer Connections are also complicated because there has never before been any kind of peer-to-peer networking within the HTML APIs. It provides us with a simple API to use rather than getting bogged down in the messy workings of networking.

Using the API, we create a Peer Connection object to handle all peer-to-peer communication. This object consists of an ICE (Interactive Connectivity Establishment) agent, the state of the connection, and an ICE state. Luckily for us, this is handled using the API too! So instead of working out the correct states and so on, we can simply create the object using new RTCIceCandidate and it handles the states itself.

The Data Channel API is, at the time of writing, the least implemented due to it being much easier to transfer simple data already (using servers) than it is to transfer video and audio in real time. Data channels will probably be more important than the media streams because the data channels allow data to be sent peer to peer, which has countless possibilities such as for making multiplayer games that do not need a server (or at least, the server does not need to take the full load). As we will see in the chapters that follow, the media streams can be extremely powerful but data has a lot more common usages, so it is exciting to be able to transfer data without a server!

Servers

I’m sorry. Yes, WebRTC does not need servers—but it does. If you think about it on very simple terms, you always need a way to find each other. Without a server, unless you manually enter the IP addresses, the Peer Connections simply cannot know where to connect. So we use servers, in our case Node.js, so that messages get to the right clients. This is known as signaling. The complexity of the servers depends on your needs. We will be writing a fairly basic server that handles offers and candidates but will be limited to two clients. It would be fairly simple to extend the server to enable more connections.  I encourage you to attempt this, as it is a great way to understand the intricacies of WebRTC and its networking structure.

Where to Start?

As you can see, we have quite a big task ahead of us. I think we should start with the server because it is the smallest part of the code and acts as a kind of testing suite for us. We are extending the code from Chapter 7, which I have included here for your memory and as a comparison.

var app = require('express')(),
server = require('http').createServer(app),
io = require('socket.io').listen(server);
io.set('log level', 1); // Removes debug logs
server.listen(8080);
 
 
app.get('/:id', function (req, res) {
    res.sendfile(__dirname + '/index.html'),
});
 
io.sockets.on('connection', function (socket) {
    var room = socket.handshake.headers.referer;
    socket.join(room);
    socket.on('message', function (data) {
        io.sockets.in(room).emit('message', data);
    });
    socket.on('leave', function (room) {
        socket.leave(room);
    });
});

We need to keep all this, though some parts will be edited. In applications such as this, where there is potential for it to be quite complicated, I tend to give each socket a unique identifier. To do this I have used node-uuid, so we need to import that at the top. Then within the code that runs when a new socket has connected, we assign the output of uuid.v1() to socket.id. Of course we need to let the client know that it has been given an id, so we emit an 'assigned_id' message that sends the id to the client.

var express = require('express'),
    http = require('http'),
    uuid = require ("node-uuid"),
    app = express(),
    server = http.createServer(app),
    io = require('socket.io').listen(server);
 
 
io.sockets.on('connection', function (socket) {
 
    var room = socket.handshake.headers.referer;
    console.log('JOINED', room);
    socket.join(room);
    socket.id = uuid.v1();
    
    socket.emit('assigned_id', socket.id);
 
    // Everything else goes here
});

The client-side needs to look for the emitted message and to store it within the client-side socket.

socket.on('assigned_id', function(data)  {
  console.log('assigned id' + data);
  socket.id = data;
});

Now we need to create the structure for the signaling. It starts when a client’s webcam is successfully running, it adds its stream to its own peer connection, then creates an offer. Within the callback, the peer connection is given the local description. We get the description from the callback as a parameter. This description holds all the data (as SDP) about the client, such as IP address. I will explain the exact format of the description after we have written the server; it is mostly complicated data for the browser but some of it is useful to know about. The client now needs to let the server know that it is time to send the offer to the other client. To avoid confusion, let’s call the original client c1 and the one it connects to c2. So, the server needs to send any 'received_offer' messages on to the correct socket, c2. It eventually does the same for both 'received_candidate' and 'received_answer' as well; these three are all that are needed for basic switching.

socket.on('received_offer', function(data) {
    console.log("received_offer %j", data);
    io.sockets.in(room).emit('received_offer', data);
});
 
socket.on('received_candidate', function(data) {
  console.log(" received_candidate %j", data);
  io.sockets.in(room).emit('received_candidate', data);
});
 
socket.on('received_answer', function(data) {
  console.log(" received_answer %j", data);
  io.sockets.in(room).emit('received_answer', data);
});

You must remember that the entire code runs on both clients, so c1 will send received_offer to c2 but c2 will also send it to c1. This is how we know when both are ready. When c2 receives the offer it sets its peer connection’s remote description to the one that c1 sent (c1’s local description becomes c2’s remote description and vice versa). It then creates an answer that sends back its local description (again, from the callback); this step may not seem necessary since they both already have each other’s description but it can be considered a confirmation. It also sets a Boolean variable, on the client-side, called connected to be true so that this cannot repeat itself and cause an infinite loop. The peer connection, on each client, registers that an ICE candidate is ready once the local descriptions are set; this triggers an event that is handled by the client using pc.onIceCandidate(), which is then used to send more data (in the form of SDP, which I shall explain soon) to the server by emitting the 'received_candidate' message that was mentioned previously. Of course, c2 now receives the SDP and uses it to add c1 as an ICE candidate to the peer connection.

If all went to according to plan, c1 and c2 will now see each other. Well, they would if we had implemented the client-side. My explanation may have seemed quite confusing but that is because I wanted to explain exactly how it works, not just that the server needs to serve a few messages. It is basically the same concept as we had in Chapter 7 and a pattern that you will often use when writing node.js code. Listing 8-1 is the code for the server, including a message for closing the connection smoothly and another for how many clients are in the room:

Listing 8-1. Server.js

var express = require('express'),
    http = require('http'),
    uuid = require ("node-uuid"),
    app = express(),
    server = http.createServer(app),
    io = require('socket.io').listen(server);
io.set('log level', 1); // Removes debug logs
app.use(express.static('public'));
server.listen(8080);
 
app.get('/:id?', function (req, res) {
    res.sendfile(__dirname + '/index.html'),
});
 
io.sockets.on('connection', function(socket) {
    var room = socket.handshake.headers.referer;
    console.log('JOINED', room);
    socket.join(room);
    socket.id = uuid.v1();
 
    socket.emit('assigned_id', socket.id);
 
    socket.on('debug_clients', function(data) {
        socket.emit('room_count', io.sockets.manager.rooms['/' + room].length);
    });
 
    io.sockets.in(room).emit('room_count', io.sockets.manager.rooms['/' + room].length);
 
    socket.on('received_offer', function(data) {
        console.log(" received_offer %j", data);
        io.sockets.in(room).emit('received_offer', data);
    });
 
    socket.on('received_candidate', function(data) {
        console.log(" received_candidate %j", data);
        io.sockets.in(room).emit('received_candidate', data);
    });
 
    socket.on('received_answer', function(data) {
        console.log(" received_answer %j", data);
        io.sockets.in(room).emit('received_answer', data);
    });
 
    socket.on('close', function() {
        console.log("closed %j", room);
        io.sockets.in(room).emit('closed', room);
    });
 
});

Technologies Behind WebRTC

Okay, let’s take a brief break from the code. I’ve described the flow of how WebRTC clients connect to each other, but there are a lot of terms that you probably haven’t seen before. So before we write the client-side code, I would like to go through the technologies behind WebRTC and how they work, they are a bit out of scope for this book so I encourage you to read more about them. If you’re not particularly interested, then feel free to skip over to the next section where we implement the client-side.

ICE

I’ve mentioned ICE quite a bit. It stands for Interactive Connectivity Establishment, which according to RFC 5245 (http://www.ietf.org/rfc/rfc5245.txt) is “A Protocol for Network Address Translator (NAT) Traversal for Offer/Answer Protocols”. The RFC goes into a lot of detail, which some of you may find very interesting, but I think the abstract sums up the purpose quite well. To some of you this will sound like absolute gobbledygook, so hopefully once you have read my brief summaries about ICE and the other technologies then you will understand it.

This document describes a protocol for Network Address Translator (NAT) traversal for UDP-based multimedia sessions established with the offer/answer model. This protocol is called Interactive Connectivity Establishment (ICE). ICE makes use of the Session Traversal Utilities for NAT (STUN) protocol and its extension, Traversal Using Relay NAT (TURN). ICE can be used by any protocol utilizing the offer/answer model, such as the Session Initiation Protocol (SIP).

RFC 5245 Abstract

As I explained in the “Specifications” section of this chapter, the peer connection that every WebRTC client must have is made up of components, including iceConnectionState, iceGatheringState and the ICE Agent. The ICE Agent is the important (yet simple) part. It is the endpoint for WebRTC on each client and it is responsible for communicating with the other endpoints, using a signaling protocol. The reason we use ICE at all is that it is (as the abstract states) a standardized protocol for NAT traversal for UDP communication. This basically means that it is a way of getting through security, such as firewalls, without being blocked. ICE works by checking all possible paths of communication in parallel, which quickly finds the best possible path. The states are used (quite obviously) to find the state of the ICE Agent and they are read-only.

The possible states of iceGatheringState are

  • new
  • gathering
  • complete

The possible states of iceConnectionState are

  • starting
  • checking
  • connected
  • completed
  • failed
  • disconnected
  • closed

NAT Traversal

NAT is the technology behind hardware routers (and some commercial switches) that allows many machines behind one router to work using one IP. It takes incoming packets of data, and sends them to the correct device based on where it came from. It translates the internal network addresses (for example, 192.168.1.123) to the outward facing IP and back again. Inbound communication is restricted for many reasons, including security, and usually involves a port whitelist table that the NAT device uses to associate internal addresses to inbound ports (commonly referred to as a firewall).

NAT Traversal is a name for a group of methods of getting around this in a user friendly way. Without NAT Traversal techniques, users would have to manually open ports on their firewall to allow inbound traffic, and give this port information to other connecting parties. Some common NAT Traversal techniques are; Reverse Proxies, Socket Secure (SOCKS), Universal Plug and Play (UPnP), and UDP Hole Punching.

STUN/TURN

STUN (Session Traversal Utilities for NAT) and TURN (Traversal Using Relays around NAT) are both signaling protocols that are often used as tools by ICE for the NAT traversal. There are open source implementations for each if you wish to run your own server. The two that I used for this chapter (you only need one, but you can have an array of them so that there are backups) are run by Google and Mozilla at stun:stun.l.google.com:19302 and stun:stun.services.mozilla.com. You will notice that both of these servers are STUN. This is mostly because they were the first free servers that I found to be reliable. The main difference between the two is that TURN uses a relay technique, so the data is relayed through to server to get to the client. The upside however is that TURN builds on STUN, so it always tries to use STUN first but if that fails it, then relays. Another advantage of TURN is that it works with Symmetric NATs, which STUN does not.

SDP

If I called the abstract for the ICE protocol gobbledygook, then my vocabulary cannot describe SDP. At first glance (and probably at second and third too), it looks impenetrable. But do not worry, the important bits are quite simple to understand and the rest is documented (primarily in RFC4566). SDP stands for Session Description Protocol and is the form of data that is sent between clients. Below is an example of SDP, in this case running the code for this chapter under localhost and having a connection between two Google Chrome tabs. There is a wide range of possible SDP contents, largely dependent on the stage of communication, so don’t expect it to look identical if you look at your own SDP data. I have only included the first 11 lines because the majority of it is not often useful to us for debugging and it mostly concerned with encryption.

 v=0
 o=- 6888311506875211333 2 IN IP4 127.0.0.1
 s=-
 t=0 0
 a=group:BUNDLE audio video
 a=msid-semantic: WMS YuWoij3TGBXIt0Ud2iqCWMbt7FnkBqSNpP96
 m=audio 1 RTP/SAVPF 111 103 104 0 8 107 106 105 13 126
 c=IN IP4 0.0.0.0
 a=rtcp:1 IN IP4 0.0.0.0
 a=ice-ufrag:qbj6LYoXCSHnUUuq
 a=ice-pwd:C/eziXfWhziBTVpMylTLU2M3

Daunting, I know. Let’s start with the letters before the equal sign. This is called the type and it is always just one character and is case-sensitive. The order of types is predefined, mostly to reduce errors and so that parsers can be implemented more easily. Of course it is also handy for debugging, as we will always know where the data we are looking for should be! I will only explain the first four lines as they are the ones that are most likely to go wrong, the rest are optional and therefore are less likely to break (and if it does, more likely to be a bug in the browser rather than your code). Do note that the attributes are in the form of <type>=<attribute>:<value>. Most attributes are easily searchable online if you are interested in learning about them, and I will probably write an article to accompany this chapter on my site.

Session description
   v=  (protocol version)
   o=  (originator and session identifier)
   s=  (session name)
   i=* (session information)
   u=* (URI of description)
   e=* (email address)
   p=* (phone number)
   c=* (connection information -- not required if included in
        all media)
   b=* (zero or more bandwidth information lines)
   One or more time descriptions ("t=" and "r=" lines; see below)
   z=* (time zone adjustments)
   k=* (encryption key)
   a=* (zero or more session attribute lines)
   Zero or more media descriptions
 
Time description
   t=  (time the session is active)
   r=* (zero or more repeat times)
 
Media description, if present
   m=  (media name and transport address)
   i=* (media title)
   c=* (connection information -- optional if included at
        session level)
   b=* (zero or more bandwidth information lines)
   k=* (encryption key)
   a=* (zero or more media attribute lines)
* means optional

This guide to types was taken from the RFC 4566, so credit goes to The Internet Society. You can see quite easily how the SDP object breaks up into three sections. At the time of writing, all SDP objects start with v=0 because there are no other versions available. In my example, the second line (originator and session identifier) was o=- 6888311506875211333 2 IN IP4 127.0.0.1. The hyphen indicates that there is no username because I was not using a private TURN server. The long number is the unique identifier for the session. Next we find the important part of this line, at least for debugging, the session version; this is the amount of times the session data has changed (each time we send a new message with data, this number increments). You can use the session version to be absolutely sure that the session description was updated. IN stands for Internet, it is there to make the protocol future proof. We then have the type of IP, in this case IP version 4. Last on the second line is the IP address. I was running locally so it reported the IP address back as being 127.0.0.1 (localhost).

You may notice that my example jumps straight to the timing section, this is an implementation decision (in this case by Chrome) as the lines in between are all optional. The first 0 represents start time and the following 0 represents stop time. Since both are 0, it means the session is permanent.

Client-side Code

Okay, that was a nice break from the code but I think we are all ready to dive back in with a firmer understanding of how WebRTC actually works and what the data we are sending really is. Remember the walkthrough of how the signaling works? Here it is again in the form of a list to make it easier to implement:

  1. Start webcam.
  2. Add stream to peer connection.
  3. Send received_offer message with Local Description.
    1. Add local description to Peer Connection.
  4. Adds received description to Peer Connection as Remote Description.
  5. Create Answer.
  6. Set connect to true to prevent infinite loop.
  7. Add ICE Candidate.
  8. Load webcam.

You may remember from the beginning of the chapter that the correct way to start the webcam is by using navigator.getUserMedia(), unfortunately due to the different implementations in browsers at the time of writing we are using Adapter.js and that drops the vendor prefix in order to get RTC working cross browser. To make the video load quickly, without worrying about user interfaces and so on. we put getUserMedia into a function called broadcast() and call it without window.onload. Within the callback of getUserMedia, we need to add the stream to the Peer Connection, then use URL.createObjectURL()(which we used in Chapter 2) to add the stream to the video element and play the video. Once the server sends a message that the room count is more than 1, it is time to send an offer to the server.

var socket = io.connect('http://localhost'),
var pc = new RTCPeerConnection(servers, mediaConstraints);
var initConnection = false;
 
socket.on('room_count', function(data)  {
  room_count = data;
  console.log('ROOM COUNT', data);
  if (room_count > 1)
    initConnection = true;
});
 
function broadcast() {
  getUserMedia({audio: true, video: true}, function(s) {
    pc.addStream(s);
    console.log("GOT MEDIA");
    vid1.src = URL.createObjectURL(s);
    vid1.play();
    if (initConnection) start();
  }, function(error) {throw error;});
};
 
window.onload = function() {
  broadcast();
};

Within start() we need to create the offer, this happens through Peer Connection’s pc.createOffer(). The method itself provides the SDP accessible as a parameter, we can then use it to set the local description and emit a 'received_offer' message to the server containing the SDP.

function start() {
  console.log('STARTING'),
  // this initializes the peer connection
  pc.createOffer(function(description) {
    console.log(description);
    pc.setLocalDescription(description);
    socket.emit('received_offer', JSON.stringify(description));
  }, null, mediaConstraints);
};

You will notice two extra parameters, the second is a callback used for error handling (for simplicity I have left that out) and the third is a variable that I’ve called mediaConstraints. This variable contains both mandatory data that says whether video and/or audio is available as well as optional, in this case DtlsSrtpKeyAgreement, which is used to enable Datagram Transport Layer Security for browsers that do not enable it by default (such as older versions of Chrome).

var mediaConstraints = {
  'mandatory': {
    'OfferToReceiveAudio':true,
    'OfferToReceiveVideo':true
  },
  'optional': [{'DtlsSrtpKeyAgreement': 'true'}]
};

This brings us to step 4, adding the remote description to the Peer Connection. We get this description from the “received_offer” message, the one that we previously sent (remember that both client 1 and 2 send the message to each other). This starts by parsing the stringifyed JSON of the SDP and store in a variable called data. We then check that the data is definitely an offer by accessing the type property, just as a way to prevent possible edge cases. Then we set the peer connection’s remote description to be an RTCSessionDescription of the data. With that done, we use peer connection’s createAnswer() in much the same way as we used createOffer(). First set the local description to be the same as the data that is provided, then emit a “received_answer” message to the server.

socket.on('received_offer', function(data) {
  data = JSON.parse(data);
  console.log('received offer'),
  if (data.type == "offer")  {
    pc.setRemoteDescription(new RTCSessionDescription(data));
    pc.createAnswer(function(data) {
      console.log('sending answer', data);
      pc.setLocalDescription(data);
      socket.emit('received_answer', data );
    }, null, mediaConstraints);
  }
});

Of course, we now check for “received_answer” (this all gets much less complicated as you break it down into pieces, don’t you think?). We now need a variable called connected so that it only attempts to connect once, if the variable is set to false and the data is definitely an answer, then we once again create a remote description using the SDP that is then added to the peer connection. Then we simply set connected to true so that it doesn’t keep reconnecting (again, an edge case that shouldn’t often happen).

socket.on('received_answer', function(data) {
  console.log('received answer', data);
  if(!connected && data.type == "answer") {
    var description = new RTCSessionDescription(data);
    console.log("Setting remote description", description);
    pc.setRemoteDescription(description);
    connected = true;
  }
});

The peer connection now knows that it had an ICE candidate as the peer connection ‘gathers candidates’ once the local descriptions have been set. When it has an ICE candidate, we then send its own candidate back (including an id and label) as a “received_candidate” message. So we now need to listen for “received_candidate” and add that candidate to the peer connection. This process is mostly just for confirmation of each candidate, it is more of a problem when there are more than two clients. Listing 8-2 shows the current state of the client.js file.

Listing 8-2. Client.js

socket.on('received_candidate', function(data) {
  console.log('received candidate', data);
  data = JSON.parse(data);
 
  var candidate = new RTCIceCandidate({
    sdpMLineIndex: data.label,
    candidate: data.candidate
  });
  pc.addIceCandidate(candidate);
});
 
pc.onicecandidate = function(e) {
  console.log("oniceCandidate", e);
  if(e.candidate) {
    socket.emit('received_candidate', JSON.stringify({
        label: e.candidate.sdpMLineIndex,
        id: e.candidate.sdpMid,
        candidate: e.candidate.candidate
    }));
  }
};

Now that both peer connections have all the candidates needed, we are practically finished. Once everything is ready, the addstream event will be triggered, letting us put the remote video stream into a video element (in this case, vid2).

pc.onaddstream = function(e) {
  console.log('start remote video stream'),
  console.log(e);
  vid2.src = URL.createObjectURL(e.stream);
  vid2.play();
};

It is worth noting that the user who has not closed the connection will see the remote video as frozen on the last frame that was delivered, you could instead make it say that the connection has dropped by having an event handler on the 'close' message. The final version of the code can be found in the download available on the Apress website at www.apress.com/9781430259442 or on my own website at shanehudson.net/javascript-creativity.

Summary

I hope you enjoyed this chapter, WebRTC is one of those technologies that are quite confusing at first and is very complex at a browser implementation level (such as the SDP), but once you get your head around it, it is actually fairly simple. It is also incredibly powerful. We did the most obvious use-case here, because it is the most useful as an introduction, but WebRTC is (especially if you want to drop the signaling and have a different method of finding other clients) useful as a general usage peer to peer technology; for example, you could have a game or a prototype app in which you want to have users connected without the overheads of a server. The access to the user’s webcam is also extremely powerful, as you will see in the next chapter where we start to explore the realm of Computer Vision.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.133.160