Chapter 2. A Network Graph Framework

As one embarks on the task of creating a network graph, it quickly becomes apparent that neither is there a shortage of topics to visualize, nor is there a lack of data detailing many potential sets of network relationships. The more difficult task is to determine what we choose to visualize and how to move from a simple idea to a finished graph. In this chapter, you will be exposed to a proposed framework that details how this author goes through the entire process from the initial idea to a final published graph. The chapter will then take you through an actual example, where we can begin creating a network graph together.

In the following sections, I will discuss my personal approach to create a finished graph using the following:

  • Identifying an idea or topic to pursue
  • Determining the final output
  • Identifying the data source(s) needed to populate the graph
  • Formatting the data for Gephi according to the required naming conventions
  • Importing data into Gephi to begin working on the graph
  • Viewing the initial network created by Gephi to help understand the network structure
  • Selecting a layout that will be appropriate for the network
  • Analyzing the graph using a variety of Gephi filters and statistics
  • Modifying the graph with color, size, and other features provided within Gephi
  • Exporting the graph to external formats for additional customization or deployment (optional)

After completing the process, we'll create and export our own graph. By the end of the chapter, you should be comfortable with a general process to prepare and create network graphs using either the steps presented in this chapter, or through using a flow of your own creation.

A proposed process flow

This process might seem like a lot of steps, but it is meant merely to provide a framework to move an idea from your imagination to a final published graph. In fact, you might find a better approach or might already be using a different workflow that suits your particular style or specific needs. By all means, if it works for you, keep using it. On the other hand, if you are new to this discipline and need some direction, then follow this process to get started. I have found that especially in cases where there are multiple graphs to be created around a common theme or dataset, this process can make graph creation more efficient to move from start to finish. So let's get started, and we can ultimately get to the best part—actually creating and publishing some graphs.

Identifying an idea or topic

The world around us is literally filled with examples of networks, ranging from our own social media connections through very complex webs of information, such as the connections between millions of websites. What story do you want to tell the world, and how would you propose going about it? Think of your graph as you would if you were writing a paper or preparing a speech. Do you want to inform, persuade, educate, or entertain? While it is possible to create a graph that serves multiple functions, it will still be useful to narrow our focus to one of these possibilities, as it will help us reduce the level of complexity to a slightly more manageable scope.

Now, we need to find a more specific idea or topic that we feel comfortable working with, as that will make the process of creating graphs easier. While it is certainly possible to take a previously unfamiliar topic and create an exceptional graph, it is typically far simpler to start with a familiar subject. Think of your hobbies, professional interests, personal networks, or educational background. Are there potential network topics in one of these areas where you already have a high degree of knowledge? Allow me to digress for a moment and describe how I would proceed using the upcoming topics in which I have either professional or personal experience.

My personal list of potential topics is as follows:

  • Wine: I have been in the business for several years—mostly as an interested consumer—and I also have a respectable collection of books on many aspects of the wine business
  • Baseball: I have spent many years performing statistical and visual analysis, attending games, and collecting a considerable library on the subject
  • Jazz: I have been listening to recorded music, attending concerts, and reading jazz histories for 25 years

There are perhaps others, but if I start with these three topics that I feel very comfortable with, they are likely to make the process of ideation, data gathering, graph creation, and so on much easier, as opposed to attempting to work with a less familiar topic. In addition, while I cannot be considered one of the true experts in any of these areas, I do have enough background to lend credibility to my work and be able to address potential questions that might be encountered during the creation process.

Here are a couple examples of graphs that I created using this approach:

Topic

Graph

Location

Miles Davis studio album network

Bi-partite graph with 351 nodes and 581 edges

http://visual-baseball.com/gephi/jazz/miles_davis/

Detroit Tigers player network

Complete network with 1566 nodes and 47905 edges

http://visual-baseball.com/gephi/teams/tigers_network/

Enough about me and my interests! What is it that you would feel comfortable pursuing? Do you avidly follow political issues, a particular sport, or a specific aspect of history? One of the beauties of networks is that they can be found in almost every endeavor if one looks for them. So, start considering your own interests, make a list if needed, and then begin to narrow down the possibilities to the one idea that sparks your interest at the moment. It is important to keep the list manageable initially. There will always be time to return to your backup choices; the goal at this point is to get started with an idea. Don't worry about viability just yet—you might find out that the data you need is not available, although this is becoming less and less of an issue due to the remarkable array of data sources made available through the Web.

Determining the final output

After developing your topic, the next logical question concerns the final format: who will view the graph and how? This will help you make decisions along the way about how to use layouts, color, sizes, and so on. For example, if this is simply a project intended for personal use, then design considerations will most likely take a different direction versus a project to be displayed on the Web or exported to a PDF format for high-resolution printing.

Consider some of these questions and how it will have an impact on your graph:

  • Will my project be interactive or static? If the answer is interactive, then you have the luxury of allowing users to navigate and discover the network, so the network can be quite dense while still telling a good story. If, however, the output is static, then special formatting such as size, color, and text might be needed to help guide users through the story.
  • Where will the final network output reside? If you wish to post to a blog, Facebook, or Twitter, then a simple PNG output will suffice, although you might need to give users a larger version to click through to, depending on the complexity of the graph.
  • Will the graph need further enhancement beyond what can be done in Gephi? Is there a need for textboxes, callouts, legends, or other adornments using an editing tool such as Illustrator or Inkscape? If this is likely to be the case, then exporting to an SVG or PDF format is a logical choice. My personal choice is to use a PDF format that can be fully disassembled in Inkscape for detailed editing and then easily reassembled for the final output.
  • If the graph is intended to be navigated via the Web, then Gephi offers multiple options, including Seadragon, Sigma.js, and the Loxa Web Site exporter. If you have geographic data, then the Google Earth export is yet another option.

There are a few other options besides those listed in the preceding bullet list. The main point is to begin thinking about your end goal to display and share the network. In many cases, your network will translate well to several of these methods, giving you a bit more freedom and the ability to produce multiple versions for different audiences. In other cases, such as a network with tens of thousands of nodes, you might find that a static image yields dismal results, so you might need to orient your project toward an interactive version early on. There are no hard and fast rules that dictate the final decision; instead, the best solution will come through trial and error coupled with visual assessment.

Identifying the data sources

Courtesy of the Web, we live in a magnificent era of data availability and transportability, with ever faster processing and connection speeds. This has had an enormous impact on the ability of both theorists and practitioners to create complex graphs that were unimaginable just a generation ago. There are tens of thousands of sites that provide rich datasets, and many of them are free of charge. All that's required is a local device, a web connection, a bit of tenacity, and some innate curiosity.

To help you get started, I have listed a variety of available (and free) data sources in Appendix, Data Sources and Other Web Resources, but the list is far from exhaustive. Take some time to scan the Web for data sources in your interest areas—you will almost certainly find some sources to download and begin preparing for Gephi.

While there is an incredible number of datasets available, not all will be suitable for network analysis and graphing. You will need to find resources that provide some sort of relationship data or at least data that can be converted into relationships. Think of datasets that can be structured this way or that can be adapted to show the connections within a network. Relatively few data sources will be fully prepared for this purpose, but with a little tweaking and an understanding of the objective, many can quickly be converted into powerful resources for network graphing. If you would like to begin with some prepared data, the Gephi wiki is a good place to start, or you could visit Stanford Large Network Dataset Collection, which is found at http://snap.stanford.edu/data/.

Formatting the data for Gephi

To work with data in Gephi, it must be in the form of nodes and edges. Otherwise, there will be no possibility to create a network graph. In theory, you could have nodes only, but this defeats the point of creating and analyzing a network. Gephi provides the ability to convert an edge-only source into nodes, saving you a potential step. However, this approach has some limitations from a node perspective, particularly if you are working with supplemental fields that hold incidental node information to be used for partitioning, ranking, filtering, or any other possible use.

Fortunately, it isn't difficult to prepare data for use with Gephi as long as the basic Gephi naming conventions are followed. At a minimum, the following fields are required by the nodes and edges sheets in Gephi:

Attribute

Required

Optional

Nodes

Nodes, ID, and label

Other fields that provide information about individual nodes

Edges

Source, target, and type, ID

Label, weight, and other descriptive information

Note

Note that in cases where data is entered directly into Gephi, some fields will be automatically populated based on the initial entry. However, this will not be the typical data entry process, so our focus will be on structuring the data for import into Gephi. The simplest way to do this is by employing these naming conventions in your data source file, making for a seamless process on the Gephi side.

Importing data into Gephi

Gephi provides multiple options to import data from other sources, including spreadsheets, databases (MySQL), GraphML files, Pajek NET files, GEXF data, RDF files, and several additional formats. The Gephi website provides further details on how to create many of these data files as well as some examples showing the required structure for each type. Start with https://gephi.github.io/users/supported-graph-formats/

Probably the easiest way to get data ready for Gephi is to create a pair of simple .csv files, using one file for nodes and another for edges. As I mentioned in Chapter 1, Fundamentals of Complex Networks and Gephi, Gephi will create a nodes table if you first import an edge file, but this approach will limit your options, so it's a good idea to create both the node and edge files using a spreadsheet tool. In cases where you have no node data beyond the basics (ID or label), this might be fine. However, if your dataset has additional node information, start your import with the node file to preserve all of your data. This will enable the creation of ad hoc fields that are relevant to your nodes.

Likewise, MySQL can be used to create both node and edge tables that can be pulled into Gephi by providing database connection parameters. This approach has the advantage of porting data directly from an existing source if you happen to be a MySQL user. Other options exist, although they require some extra effort using an appropriate database wrapper.

If you choose to work with existing datasets, there are many examples on the Web that are already in one of the available graph formats, such as CSV, GEXF, GML, GraphML, and others. Gephi will be indifferent to your data format once the import is complete and will allow you to export your network data to many of these same formats. Just remember to create the required fields—the source, target, and type for edges and the node and label fields for the node file. For a CSV file, you can do your work in any spreadsheet platform, such as Excel or OpenOffice Calc.

Viewing the initial graph layout

Once the data has been successfully imported by Gephi, an initial graph will appear in the graph window. This will be a barebones random graph to be certain, but it does provide us with a starting point to assess some basic features inherent in the data. Some of the questions we can address at this point include:

  • Does the graph have enough nodes to make a simple visual analysis difficult or impossible?
  • Are the nodes loosely or densely connected?
  • Is the network fully connected via a single giant component, or are there a number of disconnected nodes?
  • Is there some sort of observable network structure, or do things appear to be random? Do we see a small world effect and/or considerable clustering?

Some of these points will become easier to detect after employing some sort of layout algorithm, but we still might get a glimpse prior to that stage. Gephi allows you to zoom in using a mouse wheel or tracking pad, which can help us answer some basic questions about the network. If the graph is too large or complex, it might be difficult to answer some of these questions without resorting to some more advanced techniques, which will be discussed in subsequent chapters. For now, let's address each of these points from a theoretical perspective. Later in this chapter, we'll use actual network data that will further illustrate these ideas.

Now on to the question of nodes, more specifically, what we mean when we say that a network has a lot of nodes. For instance, a network might have a few dozen nodes, or it might have tens of thousands (or even more). So, when we pose the question about assessing the number of nodes, it is somewhat relative as well as subjective. Certainly, a network with 20 nodes will always be thought of as small, and a network of 10,000 nodes will be thought of as large, but what about those points in between? Is 200 nodes a lot? What about 500? From a practical perspective, if your screen display feels crowded, with very little spacing between any nodes, then you might consider that network to have a lot of nodes, and thus, a high degree of complexity. Again, the size of your display, the intent of your graph, and the final format (paper/screen or static/interactive) all play roles in determining the visual density of your graph. If it feels too crowded to you, the creator, then users will almost invariably find the graph difficult to navigate.

What of the connectedness of the network, then? When we speak of connected nodes, we refer to the edges between two nodes, either undirected or directed. In some cases, the number of connections relative to nodes is rather low, which is an indication of a sparse or loosely connected network. In other instances, the graph will have many nodes with high degrees, leading to a considerable number of edges populating the graph. The former instance is related to the concept of random graphs, while the latter is more aligned with real-world graphs exhibiting the small-world phenomenon.

What about disconnected versus fully connected networks? Some networks will have multiple small clusters that are distinct from a single large component connecting many of the nodes. In some cases, there might not even be a large component but rather a series of small clusters. In either case, these are termed disconnected networks, as discussed briefly in Chapter 1, Fundamentals of Complex Networks and Gephi. There are many examples in literature that show this type of network, with one of the more notable recent examples showing the romantic relationships at a single high school, titled Chains of Affection (http://www.soc.duke.edu/~jmoody77/chains.pdf). In other cases, including many examples from the social network analysis field, networks are fully connected, with all nodes having the ability to traverse the graph and link directly or indirectly to every node in the network.

Finally, and in a slightly more subjective vein, we'll talk about the subject of network structure. In some cases, it is quite simple to view a network and see patterns defined by association, homophily, or some other network behavior. Many of these graphs will have multiple clusters that connect to one another through a single node that acts as a conduit between otherwise unconnected groups. However, in many cases, determining whether a graph is random or has a more defined structure is not so easily done; therefore, we rely on tools such as Gephi to aid in discovering the underlying structure. In certain cases, we will see visual evidence of networks where the power law distribution is at work, resulting in a small number of high degree hubs surrounded by a large number of less influential members. These structures can be confirmed by examining the degree distribution of a network. One very simple approach is to size nodes according to degree; another is to simply browse the node table using the data laboratory.

Now that we have walked through a brief primer on what to look for when viewing a network, the next step is to find the best way to display the graph to take advantage of the underlying network structure. This is also a somewhat subjective decision, although we can apply a degree of rigor to the process by testing many of the varied layout options provided in Gephi.

Selecting a layout

One of the most critical steps to create a network graph is to make sure that we select a layout that helps us tell the story most effectively. Technically speaking, any layout will perform the basic function of showing you the network; at the same time, some will be far more effective than others, and it is not an exact science to determine which layout will yield the best results. For one dataset, a Force Atlas algorithm might be ideal, while for another network, a different approach will create far better results.

Note

The technical results (centrality measures, network diameter, and so on) will be the same regardless of the selected layout. It is only the visual result that will differ, so we must rely on our visual assessment of the graph to determine which layout is most powerful.

As it is unlikely that you will be totally satisfied with your initial attempt at creating a perfect graph, I recommend an iterative approach, which is otherwise known as trial and error. Gephi makes this process quite painless, although certain algorithms will take a bit of time to run depending on the complexity of the network. Unless you are working with a familiar data structure you have previously graphed to your satisfaction, it is a good practice to try a minimum of three or four algorithms before selecting a favored approach.

Network complexity and structure are other factors that will help determine your final layout selection. If your dataset is small, and the goal is to show the known relationships between entities (perhaps members of specific groups), then your choices will be quite different than for a network where the goal is to explore and discover the interactions between nodes. For the former, some of the circular layouts might prove ideal, as they will allow ordering using a specific criterion. However, this would not be suitable in the second case; here is where algorithms based on spring mechanisms such as repulsion and attraction are probably far more useful in drawing the network.

In the end, it will be your visual inspection of the graph that rules the day. So, given that the final layout selection will be highly dependent on this visual inspection, what is it that should be inspected? The next section will walk you through some of the more critical criteria to be examined when judging the effectiveness of a graph.

Analyzing the graph

Regardless of which layout is selected, recognize that the graph might not be in a finished state and will most likely require multiple modifications. In fact, it would be surprising if this weren't the case, as even the most appropriate layout algorithm cannot possibly define everything we wish to see in the finished graph. With that in mind, let's discuss some of the nuances we are looking for when we analyze the graph, starting with this list:

  • Is the graph cluttered? Many graphs, even when they have a rich underlying dataset, are hampered by the so-called hairball effect, which renders them visually unintelligible to most viewers. This can be seen as a virtually impenetrable concentration of nodes and edges that are typically concentrated near the center of the graph. One of the critical steps to produce a finished graph is to prevent this effect using a wise algorithm choice coupled with some custom settings. This will often involve adjusting the default settings for attraction, repulsion, and gravity depending on the choices provided by the individual algorithm. Ironically, many well-known network graphs suffer from an excess of clutter, although this can be offset to a degree through user interaction, such as panning and zooming.
  • Do distinct features of the dataset stand out? For instance, if the network has a number of large hubs, are we able to see that in the graph? Gephi provides opportunities to make these hubs stand out from the clutter using size and color options in addition to the previously mentioned settings that help space out the network properly.
  • Are important connections in the network visible? If the relationships between particular nodes are critical to the story, viewers should be able to easily determine that from the graph. Gephi enables edges to be sized to reflect the strength of a connection, making it more easily seen by the end user. Edges are the guilty party in many of the aforementioned hairballs, so it is essential to minimize those that are not critical to the story. This can be done through effective weighting, the use of opacity, and subtle edge coloring.
  • Are there key groups, segments, or partitions that should stand out in the network? If so, there are several approaches to make these stand out, including colors, labeling, and special formatting. Gephi provides both native and plugin-based features to address this using partitions or clusters to identify the groups within the dataset.

There are additional considerations, but paying attention to the ones just shared will go a long way toward making your graphs more attractive and powerful. So, now that we've discussed a few of the important factors in making a graph more effective, we'll look at what can be done within Gephi to achieve these outcomes.

Modifying the graph

Graph modification is the final step prior to exporting or publishing your network, and it can be done both manually and programmatically. On the manual side, there are an endless number of small tweaks that can be made within Gephi using a variety of toolbar and plugin components. Here are a few options that can be performed manually in Gephi:

  • The Painter function: This function on the toolbar can be used for color-specific nodes, making them stand out or recede from the remainder of the network. This is a quick method that you can use when there are a small number of nodes you wish to edit; if you wish to color a large number of nodes, there are other options (we'll touch on them shortly).
  • The Sizer function: This function will enable the resizing of individual nodes in much the same fashion as how the Painter icon enables recoloring. This is particularly effective if nodes in the network are not already sized based on the degree and you simply want to call out important members within the graph.
  • The Brush function: This function makes it easy to see diffusion patterns relative to a selected node, allowing you to highlight neighbors (first degree), neighbors of neighbors (second degree), predecessors, and successors. This is an effective way to understand behavior within the network while highlighting network behaviors for viewers through the use of specific colors.
  • The Node Pencil and Edge Pencil tools: These tools enable users to create new nodes or edges, respectively, without the need to manually add these features in the data laboratory.

There are other tools within Gephi and its plugins that will also facilitate the manual manipulation of your graph—take time to explore each of these features to see how to best leverage them for your network. All changes made using these tools persist between the Overview and Preview tabs and into the final output regardless of format.

There is one step remaining in our process, assuming you wish to share your work with others through the Web or some other outlet. Now that all the graph modifications are complete, it is time to export your work from Gephi to a more universal output format such as PNG, SVG, or PDF, or publish it to the Web using one of several available tools.

Exporting the graph

So, you've arrived at the point where your graph is ready to be shared. The next question, if you haven't already considered it, is what do you intend to do with your work. If the goal is to share it through social media or on a blog, then you might well be content to export your work as an image using the .png format made available by Gephi. However, if you intend to make it interactive or plan to do some additional modification using Illustrator or Inkscape, then other options need to be considered.

Let's walk through a number of available export options, and the use cases associated with each one, using the following table. Note that this list isn't exhaustive and isn't intended to provide great detail for each approach. The Gephi website and discussion forums provide additional insight into these and other export methods.

Format/tool

Potential uses

Strengths

Weaknesses

.png

Sharing via e-mail, blog post, Facebook, Twitter, and Flickr

Quick, compact, web friendly

No interaction, not editable, and thus, limited value for complex networks

.svg

Post-Gephi editing, embedding in a web page

Scalability, small file size for large networks, editable, panning, zooming, and higher quality image

Not as familiar for many viewers

.pdf

Sharable in PDF format, additional edits in Illustrator or Inkscape

A widely available format for users and possibility for further editing

Limited interactivity

Seadragon

Interactive network for users to navigate

Zooming, panning capabilities, and easy creation

Limited functionality and no additional customization

Sigma.js

Interactive network for users to navigate

Searching, filtering, zooming, panning, and customization using template approach

Web browser only. Won't work locally with Chrome or IE generally. Can use the rgexf package in R to work around this limitation

Loxa Web exporter

Interactive network for users to navigate

Searching, filtering, zooming, panning, and exporting .gexf settings

Web browser only. Will not work locally with Chrome or IE

Graph file

Suitable for use with a variety of other network analysis tools, including Pajek, Tulip, GraphML, and others. Can also be exported to a .kmz file when geocoding is part of the dataset for further use in Google Maps and Google Earth

Allows use in other tools for portability and further exploration

Not a visual network export in the sense of the other options listed here

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.17.128