As one embarks on the task of creating a network graph, it quickly becomes apparent that neither is there a shortage of topics to visualize, nor is there a lack of data detailing many potential sets of network relationships. The more difficult task is to determine what we choose to visualize and how to move from a simple idea to a finished graph. In this chapter, you will be exposed to a proposed framework that details how this author goes through the entire process from the initial idea to a final published graph. The chapter will then take you through an actual example, where we can begin creating a network graph together.
In the following sections, I will discuss my personal approach to create a finished graph using the following:
After completing the process, we'll create and export our own graph. By the end of the chapter, you should be comfortable with a general process to prepare and create network graphs using either the steps presented in this chapter, or through using a flow of your own creation.
This process might seem like a lot of steps, but it is meant merely to provide a framework to move an idea from your imagination to a final published graph. In fact, you might find a better approach or might already be using a different workflow that suits your particular style or specific needs. By all means, if it works for you, keep using it. On the other hand, if you are new to this discipline and need some direction, then follow this process to get started. I have found that especially in cases where there are multiple graphs to be created around a common theme or dataset, this process can make graph creation more efficient to move from start to finish. So let's get started, and we can ultimately get to the best part—actually creating and publishing some graphs.
The world around us is literally filled with examples of networks, ranging from our own social media connections through very complex webs of information, such as the connections between millions of websites. What story do you want to tell the world, and how would you propose going about it? Think of your graph as you would if you were writing a paper or preparing a speech. Do you want to inform, persuade, educate, or entertain? While it is possible to create a graph that serves multiple functions, it will still be useful to narrow our focus to one of these possibilities, as it will help us reduce the level of complexity to a slightly more manageable scope.
Now, we need to find a more specific idea or topic that we feel comfortable working with, as that will make the process of creating graphs easier. While it is certainly possible to take a previously unfamiliar topic and create an exceptional graph, it is typically far simpler to start with a familiar subject. Think of your hobbies, professional interests, personal networks, or educational background. Are there potential network topics in one of these areas where you already have a high degree of knowledge? Allow me to digress for a moment and describe how I would proceed using the upcoming topics in which I have either professional or personal experience.
My personal list of potential topics is as follows:
There are perhaps others, but if I start with these three topics that I feel very comfortable with, they are likely to make the process of ideation, data gathering, graph creation, and so on much easier, as opposed to attempting to work with a less familiar topic. In addition, while I cannot be considered one of the true experts in any of these areas, I do have enough background to lend credibility to my work and be able to address potential questions that might be encountered during the creation process.
Here are a couple examples of graphs that I created using this approach:
Topic |
Graph |
Location |
---|---|---|
Miles Davis studio album network |
Bi-partite graph with 351 nodes and 581 edges | |
Detroit Tigers player network |
Complete network with 1566 nodes and 47905 edges |
Enough about me and my interests! What is it that you would feel comfortable pursuing? Do you avidly follow political issues, a particular sport, or a specific aspect of history? One of the beauties of networks is that they can be found in almost every endeavor if one looks for them. So, start considering your own interests, make a list if needed, and then begin to narrow down the possibilities to the one idea that sparks your interest at the moment. It is important to keep the list manageable initially. There will always be time to return to your backup choices; the goal at this point is to get started with an idea. Don't worry about viability just yet—you might find out that the data you need is not available, although this is becoming less and less of an issue due to the remarkable array of data sources made available through the Web.
After developing your topic, the next logical question concerns the final format: who will view the graph and how? This will help you make decisions along the way about how to use layouts, color, sizes, and so on. For example, if this is simply a project intended for personal use, then design considerations will most likely take a different direction versus a project to be displayed on the Web or exported to a PDF format for high-resolution printing.
Consider some of these questions and how it will have an impact on your graph:
There are a few other options besides those listed in the preceding bullet list. The main point is to begin thinking about your end goal to display and share the network. In many cases, your network will translate well to several of these methods, giving you a bit more freedom and the ability to produce multiple versions for different audiences. In other cases, such as a network with tens of thousands of nodes, you might find that a static image yields dismal results, so you might need to orient your project toward an interactive version early on. There are no hard and fast rules that dictate the final decision; instead, the best solution will come through trial and error coupled with visual assessment.
Courtesy of the Web, we live in a magnificent era of data availability and transportability, with ever faster processing and connection speeds. This has had an enormous impact on the ability of both theorists and practitioners to create complex graphs that were unimaginable just a generation ago. There are tens of thousands of sites that provide rich datasets, and many of them are free of charge. All that's required is a local device, a web connection, a bit of tenacity, and some innate curiosity.
To help you get started, I have listed a variety of available (and free) data sources in Appendix, Data Sources and Other Web Resources, but the list is far from exhaustive. Take some time to scan the Web for data sources in your interest areas—you will almost certainly find some sources to download and begin preparing for Gephi.
While there is an incredible number of datasets available, not all will be suitable for network analysis and graphing. You will need to find resources that provide some sort of relationship data or at least data that can be converted into relationships. Think of datasets that can be structured this way or that can be adapted to show the connections within a network. Relatively few data sources will be fully prepared for this purpose, but with a little tweaking and an understanding of the objective, many can quickly be converted into powerful resources for network graphing. If you would like to begin with some prepared data, the Gephi wiki is a good place to start, or you could visit Stanford Large Network Dataset Collection, which is found at http://snap.stanford.edu/data/.
To work with data in Gephi, it must be in the form of nodes and edges. Otherwise, there will be no possibility to create a network graph. In theory, you could have nodes only, but this defeats the point of creating and analyzing a network. Gephi provides the ability to convert an edge-only source into nodes, saving you a potential step. However, this approach has some limitations from a node perspective, particularly if you are working with supplemental fields that hold incidental node information to be used for partitioning, ranking, filtering, or any other possible use.
Fortunately, it isn't difficult to prepare data for use with Gephi as long as the basic Gephi naming conventions are followed. At a minimum, the following fields are required by the nodes and edges sheets in Gephi:
Attribute |
Required |
Optional |
---|---|---|
Nodes |
Nodes, ID, and label |
Other fields that provide information about individual nodes |
Edges |
Source, target, and type, ID |
Label, weight, and other descriptive information |
Note that in cases where data is entered directly into Gephi, some fields will be automatically populated based on the initial entry. However, this will not be the typical data entry process, so our focus will be on structuring the data for import into Gephi. The simplest way to do this is by employing these naming conventions in your data source file, making for a seamless process on the Gephi side.
Gephi provides multiple options to import data from other sources, including spreadsheets, databases (MySQL), GraphML files, Pajek NET files, GEXF data, RDF files, and several additional formats. The Gephi website provides further details on how to create many of these data files as well as some examples showing the required structure for each type. Start with https://gephi.github.io/users/supported-graph-formats/
Probably the easiest way to get data ready for Gephi is to create a pair of simple .csv
files, using one file for nodes and another for edges. As I mentioned in Chapter 1, Fundamentals of Complex Networks and Gephi, Gephi will create a nodes table if you first import an edge file, but this approach will limit your options, so it's a good idea to create both the node and edge files using a spreadsheet tool. In cases where you have no node data beyond the basics (ID or label), this might be fine. However, if your dataset has additional node information, start your import with the node file to preserve all of your data. This will enable the creation of ad hoc fields that are relevant to your nodes.
Likewise, MySQL can be used to create both node and edge tables that can be pulled into Gephi by providing database connection parameters. This approach has the advantage of porting data directly from an existing source if you happen to be a MySQL user. Other options exist, although they require some extra effort using an appropriate database wrapper.
If you choose to work with existing datasets, there are many examples on the Web that are already in one of the available graph formats, such as CSV, GEXF, GML, GraphML, and others. Gephi will be indifferent to your data format once the import is complete and will allow you to export your network data to many of these same formats. Just remember to create the required fields—the source, target, and type for edges and the node and label fields for the node file. For a CSV file, you can do your work in any spreadsheet platform, such as Excel or OpenOffice Calc.
Once the data has been successfully imported by Gephi, an initial graph will appear in the graph window. This will be a barebones random graph to be certain, but it does provide us with a starting point to assess some basic features inherent in the data. Some of the questions we can address at this point include:
Some of these points will become easier to detect after employing some sort of layout algorithm, but we still might get a glimpse prior to that stage. Gephi allows you to zoom in using a mouse wheel or tracking pad, which can help us answer some basic questions about the network. If the graph is too large or complex, it might be difficult to answer some of these questions without resorting to some more advanced techniques, which will be discussed in subsequent chapters. For now, let's address each of these points from a theoretical perspective. Later in this chapter, we'll use actual network data that will further illustrate these ideas.
Now on to the question of nodes, more specifically, what we mean when we say that a network has a lot of nodes. For instance, a network might have a few dozen nodes, or it might have tens of thousands (or even more). So, when we pose the question about assessing the number of nodes, it is somewhat relative as well as subjective. Certainly, a network with 20 nodes will always be thought of as small, and a network of 10,000 nodes will be thought of as large, but what about those points in between? Is 200 nodes a lot? What about 500? From a practical perspective, if your screen display feels crowded, with very little spacing between any nodes, then you might consider that network to have a lot of nodes, and thus, a high degree of complexity. Again, the size of your display, the intent of your graph, and the final format (paper/screen or static/interactive) all play roles in determining the visual density of your graph. If it feels too crowded to you, the creator, then users will almost invariably find the graph difficult to navigate.
What of the connectedness of the network, then? When we speak of connected nodes, we refer to the edges between two nodes, either undirected or directed. In some cases, the number of connections relative to nodes is rather low, which is an indication of a sparse or loosely connected network. In other instances, the graph will have many nodes with high degrees, leading to a considerable number of edges populating the graph. The former instance is related to the concept of random graphs, while the latter is more aligned with real-world graphs exhibiting the small-world phenomenon.
What about disconnected versus fully connected networks? Some networks will have multiple small clusters that are distinct from a single large component connecting many of the nodes. In some cases, there might not even be a large component but rather a series of small clusters. In either case, these are termed disconnected networks, as discussed briefly in Chapter 1, Fundamentals of Complex Networks and Gephi. There are many examples in literature that show this type of network, with one of the more notable recent examples showing the romantic relationships at a single high school, titled Chains of Affection (http://www.soc.duke.edu/~jmoody77/chains.pdf). In other cases, including many examples from the social network analysis field, networks are fully connected, with all nodes having the ability to traverse the graph and link directly or indirectly to every node in the network.
Finally, and in a slightly more subjective vein, we'll talk about the subject of network structure. In some cases, it is quite simple to view a network and see patterns defined by association, homophily, or some other network behavior. Many of these graphs will have multiple clusters that connect to one another through a single node that acts as a conduit between otherwise unconnected groups. However, in many cases, determining whether a graph is random or has a more defined structure is not so easily done; therefore, we rely on tools such as Gephi to aid in discovering the underlying structure. In certain cases, we will see visual evidence of networks where the power law distribution is at work, resulting in a small number of high degree hubs surrounded by a large number of less influential members. These structures can be confirmed by examining the degree distribution of a network. One very simple approach is to size nodes according to degree; another is to simply browse the node table using the data laboratory.
Now that we have walked through a brief primer on what to look for when viewing a network, the next step is to find the best way to display the graph to take advantage of the underlying network structure. This is also a somewhat subjective decision, although we can apply a degree of rigor to the process by testing many of the varied layout options provided in Gephi.
One of the most critical steps to create a network graph is to make sure that we select a layout that helps us tell the story most effectively. Technically speaking, any layout will perform the basic function of showing you the network; at the same time, some will be far more effective than others, and it is not an exact science to determine which layout will yield the best results. For one dataset, a Force Atlas algorithm might be ideal, while for another network, a different approach will create far better results.
As it is unlikely that you will be totally satisfied with your initial attempt at creating a perfect graph, I recommend an iterative approach, which is otherwise known as trial and error. Gephi makes this process quite painless, although certain algorithms will take a bit of time to run depending on the complexity of the network. Unless you are working with a familiar data structure you have previously graphed to your satisfaction, it is a good practice to try a minimum of three or four algorithms before selecting a favored approach.
Network complexity and structure are other factors that will help determine your final layout selection. If your dataset is small, and the goal is to show the known relationships between entities (perhaps members of specific groups), then your choices will be quite different than for a network where the goal is to explore and discover the interactions between nodes. For the former, some of the circular layouts might prove ideal, as they will allow ordering using a specific criterion. However, this would not be suitable in the second case; here is where algorithms based on spring mechanisms such as repulsion and attraction are probably far more useful in drawing the network.
In the end, it will be your visual inspection of the graph that rules the day. So, given that the final layout selection will be highly dependent on this visual inspection, what is it that should be inspected? The next section will walk you through some of the more critical criteria to be examined when judging the effectiveness of a graph.
Regardless of which layout is selected, recognize that the graph might not be in a finished state and will most likely require multiple modifications. In fact, it would be surprising if this weren't the case, as even the most appropriate layout algorithm cannot possibly define everything we wish to see in the finished graph. With that in mind, let's discuss some of the nuances we are looking for when we analyze the graph, starting with this list:
There are additional considerations, but paying attention to the ones just shared will go a long way toward making your graphs more attractive and powerful. So, now that we've discussed a few of the important factors in making a graph more effective, we'll look at what can be done within Gephi to achieve these outcomes.
Graph modification is the final step prior to exporting or publishing your network, and it can be done both manually and programmatically. On the manual side, there are an endless number of small tweaks that can be made within Gephi using a variety of toolbar and plugin components. Here are a few options that can be performed manually in Gephi:
There are other tools within Gephi and its plugins that will also facilitate the manual manipulation of your graph—take time to explore each of these features to see how to best leverage them for your network. All changes made using these tools persist between the Overview and Preview tabs and into the final output regardless of format.
There is one step remaining in our process, assuming you wish to share your work with others through the Web or some other outlet. Now that all the graph modifications are complete, it is time to export your work from Gephi to a more universal output format such as PNG, SVG, or PDF, or publish it to the Web using one of several available tools.
So, you've arrived at the point where your graph is ready to be shared. The next question, if you haven't already considered it, is what do you intend to do with your work. If the goal is to share it through social media or on a blog, then you might well be content to export your work as an image using the .png
format made available by Gephi. However, if you intend to make it interactive or plan to do some additional modification using Illustrator or Inkscape, then other options need to be considered.
Let's walk through a number of available export options, and the use cases associated with each one, using the following table. Note that this list isn't exhaustive and isn't intended to provide great detail for each approach. The Gephi website and discussion forums provide additional insight into these and other export methods.
3.144.17.128