Humans are very social creatures, and our ability to find connections – with each other and with other things in our lives – is one of our strongest impulses. We naturally love to connect with others, we distinguish our connections with different names or levels (friend, spouse, acquaintance, lover, enemy, BFF, frenemy, boss, employee, stranger, co-worker, neighbor), and we sometimes keep these connections for years or decades. We are fascinated by seeing people from our network appearing in other, seemingly unconnected networks. We love the notion that there might be only six (or four, or three) degrees of separation between any two people on Earth. The small world phenomenon reminds us that we are more closely connected to each other than it may appear.
So far in this book, we have experimented a lot with finding connections between things, first by finding items that are commonly associated, and then by finding entities that appear different but are really the same. In this chapter, we will continue to explore how things are connected, but we will focus on data that can be represented as a network. We will learn:
Networks are all around us. We use the word network to refer to many different types of connected things: multiple computers hooked together, a system of cities connected by small roads and large highways, a group of people who all work in the same industry, a series of television stations that broadcast common programming. In common usage, the word network can refer to almost any set of interconnected entities.
From a data mining perspective, however, our use of the word network is more precise. We use the word network to refer to a system that can be represented by a graph made up of nodes and links. In our specialized vocabulary, nodes are the things being connected, and the links are the relationships between the nodes. The collection of all the nodes and links is called a graph. Note that we are using the word graph in its mathematical sense, not in the sense of a visualization, like a bar graph or a line graph. In graph theory vocabulary, nodes are also called vertices, and links are also called edges. Figure 1 shows some of these terms we use for a graph or network:
Many times we describe the network in terms of the direction of the edges. A network can be either directed or undirected. An undirected network is one in which all the links are symmetrical, and the relationship between nodes goes in both directions. A common example of an undirected social network is Facebook. Both nodes in a Facebook friend relationship must confirm the link. Directed networks are when the links are not symmetrical. An example of a directed social network is Twitter. You can follow someone on Twitter but not have that person follow you back. Figure 2 shows the difference between an undirected and directed graph:
In network mining, we may sometimes be interested in the attributes of the nodes. For instance, in the cities-and-roads network, we might keep track of how many people live in each city, or who the current mayor of the city is. But unlike in traditional entity-oriented data mining, in network mining we are also very interested in the attributes of the edges. How many people drove from Smallville to Anytown using Highway 1? How many people went the other way, from Anytown to Smallville? What is the minimum number of roads we would need to take to get from Anytown to Big City?
A weighted network is where one of the attributes of the links is expressed as a number. For example, in a network of airports, where the links are direct flights between airports, one weight could be how many flights there are per day. Another weight for this network could be number of passengers that fly between those airports per day. An unweighted network simply shows the relationship or link between the nodes, with no additional information about how heavy that link is. Figure 3 shows different combinations of weighted, unweighted, directed, and undirected graphs:
Here are some ways of expressing real-world networks as graphs:
You might think that the last example of shopping baskets sounds awfully familiar to what we did with frequent itemsets and association rules in Chapter 2, Association Rule Mining. Should we go back and revise that chapter now that we have another way to find frequent itemsets? Maybe, but we still have a lot to learn first. First, we would need to express the entire set of items and transactions as a weighted graph. Then we will need to learn how to find the frequent subgraphs, which would be the equivalent of our frequent itemset mining.. Doing this requires that we learn how to measure the network and traverse, or walk, the graph. So let's learn those things next.
3.144.38.253