Chapter 1. The Elements of Conversation

With one burst of energy, I can issue a pretty sophisticated directive, such as, “Get me one large turkey hoagie with everything on it, and a small Coke.”

Think about how many steps it would take me to communicate that same command in a graphical user interface, say on an iPhone. I’d have to select the sandwich (Turkey Hoagie) from a drop down list of sandwiches, then I would need to select the size of the sandwich (another drop down maybe), then I would click on the “all” toppings check box (assuming there was one offered), and finally I would need to select the drink and its size. That would be at least five distinct steps (and this doesn’t even include the tap(s) for unlocking the app). That’s clearly far more effort than speaking one sentence. In the case of the one spoken sentence, the effort equation is shifted away from the user and onto the interface: the user can speak naturally (they don’t need to use the artificial devices of drop down menus, check boxes, radio buttons, etc.) so that the burden is on the voicebot to figure out what the user wants by interpreting the words that they spoke.

The ability of a voicebot to parse a rich, complex user statement, however, does not obviate the need for the voicebot to engage the user in a stepwise, back and forth interaction. Users don’t pack their commands to the hilt just to save steps. They will try to be efficient, but only to a point. When I call to order a pizza, I don’t say in one breath, “I want two large pizzas, both thin crust; I want the first to have pepperoni, beef, and black olives, and the second, I want chicken, green peppers, and extra tomatoes; for drinks, I also want two large bottles: one Coke and one Sprite.” I have somehow come to learn that people can’t take more than 2 or 3 pieces of information at a time, and so, instead, a more natural flow might something along these lines:

  1. Agent: Hello. Alphonso’s Pizza. Can I take your order?

  2. Me: Yes. I want to order two large pizzas, please. Both thin crust.

  3. Agent: Ok. What would you like on your first pizza?

  4. Me: Pepperoni, beef, and black olives.

  5. Agent: Ok. That’s pepperoni, beef, and black olives.

  6. Me: Yup. And the second pizza…

  7. Agent: Yes.

  8. Me: I want chicken, green peppers, and extra tomatoes.

  9. Agent: That’s chicken, green peppers, and extra tomatoes, correct?

  10. Me: That’s right.

  11. Agent: Any drinks?

  12. Me: Yes. Two large bottles, one Coke and one Sprite.

  13. Agent: Two bottles. Coke and Sprite. Anything else.

  14. Me: No, that’s it.

  15. Agent: Etc…..

We are calling the mechanism that enables the methodical exchange of structured information a conversation, and we, humans, are masters in the art of conversing. Even the most involved of conversational exchanges come to us very naturally. And yet, upon closer observation, the simplest of conversations turn out to be an impressively complex endeavor. Conversations are not simply interactional dyads. They are highly structured and closely regulated. Pulling off a successful conversation that accomplishes its goals requires close and careful management by all participants.

In this chapter, we describe the basic mechanism that is used by participants in a conversation to successfully start and complete conversations.

The four key concepts used to describe the conversational mechanism are: Action, State, Context, and Signaling.

The Ontology of Conversations

In order to introduce the concepts of Action, State, Context, and State Signalling, the first step is to settle on an ontology -- a way of slicing the world that gives shape to objects. Unless we have objects that have properties, we can’t talk about actions, states, contexts, and signallings.

Let’s start with the ontology of a conversation: that is, the main objects that make up the world of conversations.

Participant: This is a person or a voicebot that is engaged in a conversation. Participants are the primary actors: they start a conversation, they stop it, pause it, end it; they provide its content and its tone; and they are the ones who manage it along the way.

  • Statement: This refers to what participants spoke (or did not speak), including how the content is spoken: the words that are said (or not said), the intonations used, the pauses (or the non-pauses when a pause was expected), the omissions (or non-omissions), the violations of maxims (or their observance), etc. For instance, not saying anything when the voicebot asks you a question is a statement and it has meaning (for instance, ‘I didn’t understand your question’ or ‘I’m distracted’ or ‘I didn’t hear you because we have a spotty connection,’ etc.). This statement just happens to be a statement without words (just as zero is a number that denotes the absence of quantity).

  • Turn: The conversational ‘real estate’ that is borrowed by a participant and within which the participant communicates their statements. Technically, a turn is an interval of time during which one of the participants has the floor, so to speak. It is important to note that a turn is not simply the period of time during which a person is speaking: a person may own a turn and yet no one is speaking (when, for instance, the conversation is paused because the owner of the turn needed to take another call); someone may be speaking out of turn under the protest of the turn owner, or may be speaking with the permission of the turn owner, but only for the duration of the permission, with the owner having the right to reclaim their turn at any time.

  • Conversation: The conversation is itself an object in the ontology. Conversations have properties: they are short, long, focused or all over the place; they have a starting point (once they come into existence) and an endpoint (once they have been ended); conversations move from one state to another state: non-existent to started, then progressing, then possibly paused, then ended.

Moreover, each of these four objects can itself be the topic of a conversational statement during a conversation. We can pause a conversation and later on come back to it by saying, “So, where did we leave things off last time?” Or, during a conversation, one could refer to a statement that was spoken a couple of minutes ago. Or, one may react to an interruption from you by saying, “Hang on, it’s my turn to speak.” Or one can even refer to the very conversation itself, as in: “I’m glad we are having this talk.”

The Conversational Actions

As noted above, Conversational Actions are taken, through Statements, by Participants, to take the Conversation from one State to another.

Here are the main actions.

Start Conversation

These are the actions taken by the initiator of the conversation, and in the context of human-to-human conversations, they usually consist in brief, formalized greetings, such as “Hello” or “Hi”, or, in the case of a voicebot, a wakeword that the user speaks, or, if the voicebot is the one initiating the conversation, a sound or a chime that the user has come to know means that the voicebot wishes to engage in a conversation.

Articulate content

This is the action of the human or the voicebot providing content. The content could be information, a question, some sound, or even silence.

Offer Turn

A turn is usually offered implicitly by the cessation of talking. If a participant stops talking, that in itself is usually (but not always) a signal that the counterpart should now assume the conversational turn. At times, a participant may explicitly offer the turn if they feel that their counterpart is not interpreting the silence as a turn offer (“Go ahead!”) or is reluctant to take the turn (“Your turn: what do you think?”)

Request Turn

In human-to-human conversations, a participant may request a turn by either silently signaling that they wish to speak (raising a finger, moving head back and opening mouth), or by barging in: they could do it gently, by clearing their throat, speaking a hesitant word, politely requesting the turn (“May I interject?”) or outright attempting to take the turn over by speaking over the turn owner. In the case of human-to-voicebot -- at least in its current state of the art -- humans simply seize the turn rather than request it.

Cede Turn

This is the action of a turn owner ceding ownership of a turn upon request of the turn by the other participant. Note that ceding is not the same as offering: a participant cedes only if they believe that their counterpart in the conversation is requesting the turn. Turn ceding may be accomplished silently (the owner ceases speaking) or explicitly, (“Please, go ahead”).

Retain Turn

A turn owner, by virtue of being the owner (and being recognized as such by virtue of the turn being previously relinquished to them) has the privilege of retaining a turn that is being requested. The turn owner may decide to retain the turn by either explicitly rejecting the request to relinquish the turn (“Hang on: I’m not done”) or implicitly rejecting that request (they continue talking and ignore the request).

Seize Turn

A turn is seized when a participant who is not the current owner of the turn assumes ownership of the turn in spite of the current owner’s attempt to retain it.


Participants often interrupt each other – sometimes to request the turn, at other times, to simply to add something on the side without requesting the turn or fully taking it over.


Conversations are started and ended, but they are also paused. The pauses may be explicit: a participant may say, “Hang on one sec – I need to take this” -- or, perhaps when interacting with a voicebot, the user pushes a specific button to pause the exchange; or the pause may be implicit: some external event interrupts the conversation – for instance, the boss pops her head in while you are chatting with your colleague and you and your colleague stop talking to each other and direct your attention to the boss.

Pausing in the context of a voicebot introduces some unique challenges. In the context of the traditional telephony based voicebot, pausing is a somewhat unnatural action. A phone call is expected to be briskly moving along, inexorably, towards completion. Phone calls are time-boxed activities: they have a very well defined start time (when the voicebot picks up), and an equally well defined end time (the hang up event). In the interim, the user is expected to fully dedicate their attention to the task at hand – their exchange with the voicebot-- until the phone call is completed or the customer is routed to a human being. Compare this interaction with say an iPhone app. At any time, the user of an iPhone app can click on the home button and minimize the app. What that action usually means is: pause -- I need to do something else and I want to pause my interaction with you. I may, or may not come back, but that’s something that I will decide later on. Meanwhile, remember where we were.

Resume after Pause

Pause resumption introduces a host of interesting questions in the context of a voice conversation: Should the conversation pick up where it left off? Who owned the turn? How long ago was the conversation paused? Was it so long ago that the participants would need to be reminded where the conversation left off, or was it so recent (a few seconds ago) that it should pick up right where it left off? And was there information that was offered that is now no longer valid? Think of an exchange between you and a voicebot that was helping you book a hotel room. Say the conversation was paused several hours ago and say it was paused at the point where you were providing your payment information. How should the conversation resume: obviously, at a minimum, a recap is in order, reminding the user of the information provided so far and where the conversation was paused. But a smart voicebot would first check to make sure that any information that was provided (availability of hotel room, for instance, or that the rates have not changed since the interaction) is still valid before recapping it.


Conversational voice interactions are ephemeral. Unlike, say, texting or Instant Messaging (e.g., Slack channel chat), a pure voice conversation leaves no visual traces to be consulted during or after the conversation. As a result, participants will need to ask their counterpart to repeat themselves. Repeats may be requested explicitly (“Can you repeat that, please?”), may be offered explicitly (“Do you want me to repeat that?”), or may be offered outright (“That’s 7 7 8 1 2. Repeat: 7 7 8 1 2”).

Start Over

The drastic measure of restarting a conversation is rarely resorted to when both participants are human, but it is a useful method when a human is interacting with a voicebot and either the human participant or the voicebot decides that it is best to reset the conversation rather than either repair it or pick it up from where it left off.


Conversations end in one of two ways: cooperatively (both participants agree to end the conversation) or unilaterally (with one participant ending the conversation without bothering to cooperate with their counterpart). Human conversations almost always end cooperatively, with the intentionally unilateral ending of a conversation carrying a strong connotation of conflict. Conversations between a human and a voicebot, on the other hand, are often ended unilaterally, usually by the human participant (e.g., having received the information about their date of the last payment received, the customer stops the exchange).

The Conversational States

In the context of a human interacting with a voicebot, four states are identified:

  1. Not started: No conversation is taking place and, crucially, no conversation is in a paused state.

  2. Speaking: Either the human or the voicebot is speaking.

  3. Listening: Either the human or the voicebot is listening.

  4. Paused: The conversation is paused.

  5. Processing/Thinking: Either the human or the voicebot is processing some input.

  6. Ended: The conversation that took place between a human and a voicebot has ended. It is important to note that a paused conversation is not the same as one that has ended.

The Internal Conversational Context

The internal conversational context here is described by identifying the Turn Owner, The Action being taken, the State of the Participants, and the Information collected so far. As we mentioned in the Introduction, we distinguish the Internal Conversational Context from The External Conversational Context, which refers to the conditions within which the conversation is taking place: for instance, the physical context (a noisy environment), the emotional context (the human is in distress), and other such considerations.

For instance, at Step 12 in the Pizza Ordering example earlier in the section, the Conversation Context just as I was stating “Yes”, is as follows:

  • Turn Owner: Human.

  • Action: Providing content (saying “Yes”)

  • State: Talking.

  • Information: Pizza 1: {size = large; crust = thin; topping 1 = pepperoni; topping 2 = beef; topping 3 = black olives}; Pizza 2: {size = large; crust = thin; topping 1 = chicken; topping 2 = green peppers; topping 3 = extra tomatoes};

Conversational Signalling

Crucial to managing conversations is the continual signaling by the participants to each other. In the context of interactions between humans and voicebots, two types of signaling are identified: (1) Signaling states and (2) Signaling transitions between states. In what follows, we describe the types of signaling that a voicebot needs to issue to the human to ensure that the human is aware of the voicebot’s state.

Signaling States

Initial/Rest: this signals that the voicebot is ready to engage.

Listening: this is the crucial feedback provided to the user that the voicebot is actively listening. This often takes the form of visual fluttering or pulsing, indicating that the audio stream issued by the human is being actively received by the voicebot.

Processing/Thinking: this signaling occurs when the human has stopped speaking, but before the voicebot has responded verbally. This signaling is usually delivered with a sound, something along the lines of light percolation.

Speaking: this signaling occurs by virtue of the sound being made by the voicebot as it speaks its response.

Paused: this is signaling that indicates that the voicebot was interrupted and is in a state of suspension. The difference between an Initial State and a Paused State being the absence of context in the former and the existence of one in the latter.

Signaling Transitions

Equally crucial to signaling states is signaling transitions between states. Here are the three signals that a voicebot sends to a human user.

  • Started Listening: This is a signal by the voicebot to the user that the voicebot has started listening and that they, the human user, now own the turn.

  • Finished Listening: This is a signal by the voicebot to the user that the voicebot has stopped listening and that they, the automated voicebot, now own the turn.

  • Finished Interacting: This is a signal by the voicebot to the user that the voicebot has stopped interacting with the user. This usually takes place when the human user says something like “stop” or “I’m done” or does not respond to repeated requests for a response from the automated voicebot.

Equipped with an ontology (a slicing up of the world) and a collection of concepts, let’s now take a look at the sequence of starting a voice-only phone conversation between two humans to illustrate how they can help us break apart, so to speak, the elements of the conversation.

In other words, let’s see if the ontology and the concepts that emerged from it can do some work for us.

  • I pick up the phone, dial up my friend Jodi, listen to the phone ring a few times and wait for her to pick the phone.

  • After three rings, Jodi picks up the phone and says “Hello!” To which I respond with: “Hey Jodi!”

  • In this sequence, we went from a state where no conversation existed to a state where a conversation has now been started.

  • My action of picking up my phone and dialing Jodi’s phone number and letting her phone ring three times will mean to Jodi that I probably want to talk to her, unless it was an intentional dial -- for instance, I accidentally clicked on her phone number in an email that contained that number.

  • My friend’s action of picking up the phone and saying “Hello” means to me that Jodi is available and that she is ready to engage, and now would like me to respond to her “Hello”.

  • If she does not pick up the phone, it may mean, among other things, that she is busy and can’t take my call, that the phone is not near her, or that she is just not in the mood to talk to me.

  • If she does pick up and I respond with “Hi Jodi,” this means that I did mean to call her and that I would like to engage with her in a conversation.

  • At which point, having heard me and wanting to engage, Jodi may respond with “Hey!” an action signaling that indeed a conversation has now been established and has been started, taking us now fully into the Conversation Started State.

Note how Jodi’s initial action of saying “Hello” is followed by another action: the action of not speaking -- silence. The purpose of this second action is to hand over to me the conversational turn, which she owns immediately after she accepts the call. Worth noting, if Jodi was not available to engage with me but wanted to be polite to me, she could have started the conversation with “Hello” and immediately followed it with, “Can I call you back in like 10 minutes? I’m in the middle of another call.”

My response of “Hey Jodi” to Jodi’s second action -- silence -- means that I have accepted her offer for me to take the turn and that I am willing to engage with her in a back and forth conversation.

My friend answers with “Hey there!”, which in fact are two actions packed into one utterance: by speaking, my friend acknowledges my greeting, but also agrees to take the turn back and to engage. “Hey there” means (at least in part): ‘I am available and I am ready to speak with you.’

Depending on the tone, or some other indicators, such as the slightest pause or hesitation with which Jodi articulated her “Hey there,” she could come across as saying, ‘I am angry at you, but I am available and I am ready to speak with you,’ or ‘What a pleasant surprise! I am available and I am ready to speak with you,’ or ‘I’m feeling a bit down, but I am ready to speak with you,’ and many other variations, depending on our respective moods, our relationship, our respective contexts, and various other considerations.1

In this chapter, our aim has been to describe the mechanism of conversation that is used by participants to engage each other in voice and audio interactions, and to delineate an ontology of the objects that populate the world of conversations. The four key concepts that we used to describe the mechanism are: Action, State, Context, and Signaling. The ontology that we delineated consists in: Participants, Statements, Turns, and Conversations. During a Conversation, a Participant takes an Action within a Turn and a Context to bring about a State that is signalled by the Participant through a Statement.

With this conceptual framework at hand, we are now ready to begin laying down the foundation for the next level of concepts: the norms and rules that participants observe, ignore, or violate, to generate meaning as they engage each other in conversational interactions.

1 “Talk: The Science of Conversation,” Elizabeth Stokoe, 2018, Hachette Digital, Inc.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.