Imagine that it’s 2 a.m. during finals week, and you’re scrambling to finish a research paper on topica obscura. Your adrenaline jumps when you find a relevant Wikipedia article, with links to more Wikipedia articles! You start clicking away, jumping from page to page in search of facts. An hour later, you realize you’re still clicking, but these are pages you’ve already visited. No matter which link you click, you can’t seem to discover any new pages!
If this has happened to you, you’ve unknowingly (and unfortunately) stumbled upon a knot in Wikipedia. Knots are a unique property of directed graphs. To understand them, it is necessary to have a basic understanding of directed graphs.
The graphs in the rest of this book are undirected. In an undirected graph, an edge represents a symmetric relationship between vertices. This abstraction is appropriate for some applications (such as acquaintance networks and transportation networks), but other applications involve asymmetric relationships (for example, links between pages in the World Wide Web).
Wikipedia links are also asymmetric. Page A might link to page B, but page B doesn’t have to include any links to page A. To fully describe the relationship between pages A and B, we need two bits of information: whether A links to B, and whether B links to A. This is the essence of a directed graph. In a directed graph, a directed edge, , is an edge from A to B.
Knots are a unique (and sometimes cruel) property of directed graphs. A knot is a collection of vertices and edges with the property that every vertex in the knot has outgoing edges, and all outgoing edges from vertices in the knot terminate at other vertices in the knot. Thus, it is impossible to leave the knot while following any path along its directed edges.
In the scenario above, we imagined finding a knot in Wikipedia; that is, a set of articles with links to each other, but no links to other articles.
We wondered whether such a thing was even possible. Given that there are 3,811,000+ articles on Wikipedia and they all link to different pages, it seemed unlikely that there would be a knot, but we decided to investigate.
We started by extending Graph.py
, from Representing Graphs, to
support directed graphs. The result is DirectedGraph.py
, which you can download from
http://thinkcomplex.com/DirectedGraph.py.
When we were designing DirectedGraph
, we wanted to preserve the order of
growth of Vertex
and
Arc
lookups as they were
in Graph
. However, we
needed to keep track of edges into a vertex and edges from a vertex. It
made sense to keep track of these in separate dictionaries. Having two
dictionaries takes twice as much memory, but this is a reasonable price
to pay for maintaining the speed of lookups.
To detect knots in a directed graph, we developed an algorithm
based on a breadth-first-search. _bfsknots
searches for all the vertices that can
be reached from a given starting vertex, and returns these vertices as a
set:
def _bfsknots(self, s): # initialize the queue with the start vertex queue = [s] visited = set() on_first_vertex = True while queue: # get the next vertex v = queue.pop(0) # skip it if it's already marked if v in visited: continue # if we're on the first vertex, we're not actually visiting if v != s or not on_first_vertex: visited.add(v) on_first_vertex = False for x in self.out_vertices(v): #if its out vertices have been cached, update visited if x in self._knot_cache.keys(): visited.update(self._knot_cache[x]) visited.add(x) #otherwise add it to the queue elif x not in self._knot_cache.keys(): queue.append(x) return visited
We run _bfsknots
for every vertex in the graph and build a dictionary that maps from a
vertex to the set of vertices it can reach; that is, for each vertex,
V, we know the set of reachable vertices,
SV.
If there is a knot in the graph, then for every vertex, V, in the knot, the reachable set SV is exactly the set of vertices in the knot.
The function has_knot
iterates through each vertex in a graph
and returns true
if this condition
holds. If it checks the whole graph and does not find a knot, it returns
false
:
def has_knot(self): """ Returns true if directed graph has a knot. """ self._knot_cache = {} #build the cache of which vertices are accessible from which for v in self: self._knot_cache[v] = self._bfsknots(v) #searches for knot for v in self: if self._knot_at_v(v): return True return False
Finally, _knot_at_v
checks whether all vertices reachable
from V have the reachable set
SV.
Download http://thinkcomplex.com/DirectedGraph.py. Read through the file to make yourself familiar with the terminology. Note that edges in DirectedGraph are represented by Arcs.
Determine the order of growth of has_knots
experimentally. You can follow the
example in Summing Lists. Below are some hints
to help you out:
Use DirectedGraph.add_random_edges
to generate
graphs of different sizes.
For each graph, time how long it takes to check whether it has a knot.
On a log-log scale, plot runtime versus the number of vertices and runtime versus the number of edges.
To find knots in Wikipedia, we selected 558 disjoint subsets of Wikipedia articles, organized by index. For example, the index of articles about neurobiology was used to define one subset, and the index of articles about Zimbabwe was used to define another subset. The 558 subsets contain about 321,000 articles, or 10% of Wikipedia.
Of these subsets we examined, 38% contained at least one knot. So if you are restricted to articles listed on a single index page, there is a substantial chance you will eventually find a knot.
For the complete Wikipedia, the probability is lower because articles can link to articles in other indices. Based on these results, we cannot say yet whether the complete Wikipedia has a knot.
When you started your hypothetical research on topica obscura, you might have used journal articles instead of Wikipedia articles. Instead of links, you would have used citations to find other papers.
In that case, you would never find a knot. Why not? Because science articles are published sequentially, and a paper cannot cite a paper that will be created in the future. So the directed graph that represents papers and their citations cannot have a knot.
3.144.123.155