Misinformation has never been so deadly in the age of COVID-19. On social networks like Twitter and Facebook, conspiracy theories about the origin of the virus, the treatment of the virus, and vaccination are rampaging. They often spread in communities and circles and echo with other conspiracy theories about the US presidential elections. Such misinformation and conspiracy theories reinforce each other and form a waterproof echo chamber.
Misinformation and disinformation are quite similar. The difference is that disinformation is false information that is spread deliberately to deceive. In other words, disinformation is wrong on purpose. Misinformation is a super set of disinformation. The spreader of misinformation may be unaware of the incorrectness or harmfulness of the wrong information. In this chapter, we will study how misinformation spreads with a network/graph setting.
Model the Social Network
We have studied some network-like systems earlier like the state transition of the Markov model and the forest fire spreading model. In the Markov model of poem generation, we treat each word as a node, and each node has a probability to be followed by another one, thus forming a network of words with directional edges.
In the forest fire spreading model, each tree is neighbored by four nearest trees on a square grid. Fire can spread from one tree to another with a predefined probability. The network is somewhat homogeneous because they all have four neighbors, except the boundary ones, and no tree is special than another in any way.
In this chapter, we will study generalnetworks that can represent arbitrary connectivity of a community. Let’s first introduce some terminologies.
So far, I have been using network and graph interchangeably in this chapter. Their definition differences are usually subject to the underlined scientific domains. For example, sometimes people use links instead of edges in networks, but they essentially represent the same thing. For consistency with other literature, I will use graphs for the following content.
A graph is consistent with nodes and edges. For example, Figure 9-1 has five nodes and six edges.
The graph in Figure 9-1 is also a connected graph because you can reach any node to any other by traversing the edges. This is not always true. The graph in Figure 9-2 is a disconnected graph.
This graph has five disconnected clusters (also called communities). They are not connected through edges.
A common misconception is the definition of subgraph. Any subset of nodes and edges from one graph will form a subgraph. Those nodes and edges may not be visually forming a cluster. For example, in the five-cluster graph earlier, node 1 forms a single-node subgraph, while nodes 1, 2, 3, and 6 together with the edge between 2 and 6 also form a four-node subgraph, although a weird one in three visually disconnected clusters.
A graph’s edge can also be directional, which makes a graph directional. For example, the Twitter following relationship is a directional relationship. You may follow a super star, but the super star is not likely to follow you back. If two Twitter accounts follow each other, then there must be two edges between them to represent such a relationship: one from one to the other and the other way around. We call such a graph which allows multiple edges between nodes a multigraph. Oftentimes, multigraphs represent directional relationships but not always.
Figure 9-3 is a directional multigraph. The edges are directional, and multiple edges are allowed. For example, there are two edges between nodes 2 and 3. Another interesting edge is the loopedge that points to the node itself for nodes 0, 1, and 2. With proper modeling, a loop can represent a tweet’s self-retweet or similar actions. Figure 9-3 is a directed graph example.
Nodes and edges can also have attributes. For example, nodes can have weights that indicate how influential they are in the graph, while edges’ weights can represent how strong the bondings are between two nodes. To model a social network, we need to assign certain attributes to graph elements, either intrinsic or calculated.
Let’s define our social network, specifically the Twitter ecosystem, using a graph. Each node represents an account, and each edge represents a following relationship. Information can therefore flow from the followee to the follower through posting tweets.
Think about the properties of a Twitter account. The most important indicator of an account’s influence is the number of followers; in our case, it is exactly the number of outgoing edges. In graph theory, it has a name: out-degree, the number of outgoing edges. Similarly, the in-degree counts the number of incoming edges. The larger the out-degree is, the more accounts on the social network can be potentially influenced.
We want to also borrow some concepts from the previous chapter. We will partition the nodes into three categories, besides the sources of misinformation, like in the susceptible, infected, and recovered model. The susceptible are accounts who haven’t been exposed to misinformation; they have a probability to be infected if the people they follow are the sources of the misinformation or infected. The infected nodes are like the infected patients in the disease spreading simulation that they actively influence their followers by retweeting and sharing misinformation. The recoverednodes are accounts who either intrinsically resist, R for both resistance and recovered, misinformation or recover from misinformation pollution.
In summary, we use the properties in Table 9-1 to characterize a node on the graph.
Table 9-1
Properties of nodes/accounts in a social network
Name
Description
Possible Values
Changeable
State
The state of the node
“Source,” “Susceptible,” “Infected,” or “Recovered”
Yes
Influence
How influential the node is, measured by the out-degree
A nonnegative integer
No*
Resistance
The resistance of an account against misinformation
A value between 0 and 1
Yes
The influence power is not changeable because our social network is not dynamic. If we allow the follower-followee relationship to change during the simulation, which is much more realistic, then the influence power of a node will change if fewer and fewer accounts follow it.
Now, we have all the static settings laid out. The next step is to define the mechanism that governs the simulation.
Simulate Misinformation Spreading
A simulation starts with a few nodes, likely malicious, beginning spreading misinformation. Their susceptible followers will respond to the misinformation and react. The followers may become infected or stay susceptible. If a follower becomes susceptible, the follower’s followers will also be exposed to misinformation and so on and so forth.
At any time, a susceptible node has a chance to recover.
The probabilities in the simulation are also dynamic. For example, if a node is exposed to multiple sources of misinformation, then the chance that they become infected can become very high, while the chance of recovering can be strongly suppressed.
Simple Cases
Let’s start with a simple case, in Figure 9-4, with only five nodes to test the accuracy of our code. The graph looks like Figure 9-4. Note that nodes 0 and 1 are following each other.
Networkx is a powerful Python library to manipulate graphs. Its syntax is quite straightforward. The only non-intuitive method in the preceding code is fast_gnp_random_graph. It is a built-in graph generator that, in this example, generates 5 nodes and arbitrarily connects every pair with a probability of 20%. You can choose to generate a directed graph and set the random seed for reproducibility as well. I choose the name sg because it is indeed a small graph.
We can check the nodes and edges of the graph by running the following:
We can also check the neighbors of a specific node:
list(sg.neighbors(1))
For node 1, the code returns nodes 0 and 3. Node 4 is not returned because the neighbors method only returns the successors. In the social network context, it means we are only viewing the followers of node 1.
To see all neighbors including the followee, we use
list(nx.all_neighbors(sg,1))
This returns [0, 4, 0, 3]; we have duplicates because the relationship between 0 and 1 is bidirectional. We can easily use a set operation to deduplicate it.
One last piece of networkx knowledge is the usage of attributes. Each node and edge in a graph can have attributes. For example, the following code will assign a value to node 1 with an attribute name attr1_1. This can be very handy to update attributes of the social network accounts.
sg.nodes[1]["attr_1"] = "val_1"
First, let’s define some helpful data structures:
class State(Enum):
SOURCE = 0
SUSCEPTIBLE = 1
INFECTED = 2
RECOVERED = 3
STATE2COLOR = {
State.SOURCE: "red",
State.SUSCEPTIBLE: "grey",
State.INFECTED: "orange",
State.RECOVERED: "green"
}
We can now initialize the attributes of our simple five-node graph as follows:
We can now plot the graph with color indicating the initial state of the accounts. As the dictionary STATE2COLOR denotes, red means the source of the misinformation. Figure 9-5 indicates the source of misinformation.
To simulate the simultaneous states updating in one step, we need to
1.
Update the states of each node and save the state in a copy of the graph
Our logic is based on the observation of accounts’ behaviors on social media. The source of the information is always trying to influence the followers. A susceptible person will be infected if their resistance is smaller than a random number.
For an infected person, there is a chance to become recovered by comparing against a random number, but if it fails, the resistance will be halved or reduced by a random number, whichever is larger.
For a recovered person, the resistance will be increased, up to 100% immune, if the resistance is greater than a random number in each round; otherwise, the person will fall back to the susceptible domain.
We already see the initial state, but let’s also take note of the resistance numbers:
{node: sg.nodes[node]["resistance"] for node in sg.nodes}
{0: 0.417022004702574,
1: 0.7203244934421581,
2: 0.00011437481734488664,
3: 0.30233257263183977,
4: None}
Next, let’s explore the system step by step. Depending on the random number seeds, your results may look different from mine. After four iterations, my network looks like the one in Figure 9-6. The resistance values remain unchanged, but account 0 bought the story pushed by account 4. Figure 9-6 shows the updates of states.
Just after another iteration, my graph changes to the following. Account 1, with a relatively high resistance, bought the story, as shown in Figure 9-7.
After two more iterations, things change again. Figure 9-8 shows that none of account 4’s followers believe in it.
The resistance data shows that account 1 is not only recovered but also highly unlikely to buy the misinformation spreader’s story anymore.
{0: 0.417022004702574,
1: 0.9184259825270369,
2: 0.00011437481734488664,
3: 0.30233257263183977,
4: None}
Such states are likely going to last forever as account 4 has no approach to reach account 3, unless ads are available. We say that our system is stable.
Let’s try a bigger system with ten accounts and a different topology. In this topology, there are much more bidirectional connections which can represent a small community, like a family group or a local community. This topology is represented in Figure 9-9.
I randomly pick accounts 4 and 6 as the sources. After one iteration, two other accounts are infected as shown in Figure 9-10.
After four iterations, account 9 is recovered, while account 3 is infected. Figure 9-11 shows the evolution.
Continuing the simulation, the stable state of my system looks as in Figure 9-12.
Note that simply changing the random seed initiator can significantly change the final stable state of the graph. Another stable state is presented in Figure 9-13.
Alrighty, let’s move on to much bigger graphs.
Misinformation Spreading on Different Networks
The preceding graphs don’t actually capture the essence of online social networks. First, they are too small. Second, they don’t exhibit the most significant properties of social media networks: only a small portion of accounts have the majority of the followers.
To understand it, let’s plot the distribution of the numbers of followers in the network. We need to create a bigger graph using the fast_gnp_random_graph()function:
As you can see, the network doesn’t have a single, or a handful of, strong influencers. The average number of followers is about 150, and the distribution is almost symmetric. This is not true in real social networks. People don’t follow each other randomly with a probability of 15%.
We need to use another random graph generator, the scale_free_graph()method. It has many parameters, but we will take the default arguments for simplicity.
Let’s take a quick comparison of the ten-node graphs generated by fast_gnp_random_graph() and scale_free_graph():
fig, axes = plt.subplots(1,2, figsize=(15,6))
G = nx.fast_gnp_random_graph(n=10, p =0.2, seed = 1)
The result looks like the one in Figure 9-15. The one on the left is much more centric visually. Well, account 0 follows many other accounts, not the other way around though. The result is presented in Figure 9-15.
Let’s create the same out-degree distribution visualization for a 1000-node scale-free graph. The result is presented in Figure 9-16.
As you can see, there are accounts with more than 80 followers, but the majority of the accounts have only 1 or 2 followers. By controlling the parameters of the scale_free_graphfunction, you can control the mechanism of the graph generation. The algorithm for generating a scale-free graph is to continuously add new nodes to an existing graph according to a set of preferences. I changed the default parameters to alter the out-degree distribution. For example:
The preceding code gives me a more equal world that the biggest influencer in the community is not that influential. The largest out-degree in Figure 9-17 is around 30, not 80 as in Figure 9-16.
The name scale-free comes from the fact that if the network is large enough and you can zoom in to a local small subgraph, you will identify the similar properties and metrics.
We care about the two different cases because misinformation in the second case can spread much faster and dangerously. To quantify that, we need a function to aggregate the numbers of accounts in different state first.
from collections import Counter
def count_states(g):
states = [g.nodes[node]["state"] for node in g.nodes]
return Counter(states)
Alrighty, let’s track the spreading of misinformation for both types of graphs. We will randomly generate graphs with 1000 nodes for each case and check how long it takes to reach the stable states. We will also examine the ratios of infected accounts when the stable states are reached.
We will start by setting the top two influential accounts as the sources of misinformation. Note that we can also control the p parameter in the fast_gnp_random_graphfunction to control the edge density. It is a crucial parameter because if p is very large, then every node is essentially connected with any other node. The misinformation reaches everyone in step one. We will see how it affects the simulation in detail.
First, let me bundle our earlier code into several functions:
def initialize(g, top_k = 5):
tops = sorted(( (n, d) for n, d in g.out_degree()), reverse=True,
The result shown in Figure 9-18 indicates that the system does tend to reach a stable state fairly quickly, although only 0.2% of all possible edges exist. Go ahead and do the calculation that this is actually a quite large number.
How about the scale-free graph? Our expectation is that the misinformation spreading should be faster. However, Figure 9-19 doesn’t say so.
res = simulate(g_scale_free, steps = 20, top_k = 5)
visualize(res)
Clearly, something is off. The system seems stuck in less than five steps. Why? Let’s take a look at the most influential account’s neighbors:
tops = sorted(((n, d) for n, d in g_scale_free.out_degree()), reverse=True,
key = lambda pair: pair[1])[:5]
In my case, account 2 has the most followers:
[(2, 15), (18, 14), (6, 12), (11, 12), (32, 10)]
Let’s examine the neighbors of account 2. We are mostly interested in the number of their followers:
[g_scale_free.out_degree()[k] for k in list(nx.neighbors(g_scale_free,2))]
It looks like the majority of them don’t have any followers!
[2, 0, 0, 0, 1, 3, 1, 0, 9, 0, 0, 0, 0, 0, 0]
This is why with a naively created directed scale-free graph, the misinformation spreading is kind of limited to a small neighborhood of the sources.
An easy way to fix it is to use an undirected graph. Or significantly increase the top players’ influential power in a much larger graph. Unfortunately, the second option crashes my laptop. Please try the first approach on your own.
Lastly, let’s check how the p parameter in the fast_gnp_random_graphmethod influences the speed of misinformation spreading.
To do that, we need to slightly modify the simulate function and make one approximation. When the counting of different states is not changing in the last two iterations, we assume that the system reaches a stable state. Of course, there is a possibility that the situation will change, but that’s not very predictable due to random number generation.
The result is quite interesting as follows. I am just going to paste the number here:
0.001 3.0 0.02 0.02
0.002 4.0 0.03 0.05
0.004 5.0 0.04 0.04
0.01 16.0 0.21 0.12
0.02 10.0 0.2 0.29
0.04 19.0 0.31 0.54
0.08 11.0 0.23 0.64
0.1 17.0 0.4 0.51
0.2 11.0 0.38 0.47
0.4 13.0 0.28 0.6
At the beginning, because of the bad connectivity, the simulations stop quite early, and only a small portion of accounts see the misinformation and recover from it. As more and more edges are added to the graph, the stable state takes a longer time to reach with a generally larger size of infected community and recovered community as well.
Exercise
1.
Use undirected scale-free graphs to redo the simulation. Check the built-in networkx graph generators to choose the right one.
2.
Each account has an influence property, which should influence how likely its followers are to accept the misinformation. Introduce a mechanism to address such behavior.
Summary
In this chapter, we discussed another important type of simulation: the simulation on a graph data structure. Specifically, we studied the simulation of misinformation spreading on social media. We introduced basic concepts of graphs and quantified properties to describe graph elements. On top of that, we ran the simulations on different types of graphs and interpreted the behaviors.