Semantic networks help extract meanings that consumers give to a brand or company based on the text they use in social media posts. Instead of connecting people to people, semantic networks connect words to words, based on their co-occurrence. Words that co-occur with a brand name can indicate points of concern that may require public relations interventions. Semantic network can also help understand the discussions around controversial topics, as illustrated with an example of a Twitter network related to the HPV Gardasil vaccine, which some anti-vaxxers discourage using. Additionally, networks of controlled vocabularies (i.e., pre-determined keywords or tags) can produce particularly insightful networks, or co-word graphs as illustrated by the computing dissertation co-word analysis. The word and word pair metric features of NodeXL allows you to skip certain words (e.g., stopwords), perform sentiment analysis, and count word occurrences and co-occurrences. Network clustering, filtering, and visualization can then occur.
Semantic networks; Social media; Network clusters; Co-word analysis; Word pairs; Sentiment analysis; HPV Gardasil; Vaccines; Twitter; Computing dissertations
So far in this book, you have mostly explored social networks where social actors are the vertices and edges represent connections between those people. In semantic networks, in contrast, vertices are words or concepts that are connected by co-occurrence within a pre-defined text (such as, a tweet, a Facebook post, a news article, etc.). Semantic networks, then, map relationships between concepts and allow us to extract meaning from text based on the co-occurrence of concepts (i.e., which concepts show up together). Later in this book, you will use NodeXL to extract user-generated content from platforms such as Twitter (Chapter 11) and YouTube (Chapter 13). These chapters will apply social network analysis to find patterns of relationships between users, or between videos. The retrieved content, however, can also be used for semantic network analysis. This chapter will walk you through the process of using NodeXL’s semantic analysis features using a Twitter sample dataset. That said, the process detailed in this chapter can be applied to any text-based dataset, such as emails or news stories.
In this chapter, you will create and then examine a semantic network that connects words that show up in the same tweet together. The dataset is from a Twitter Search Network (see Chapter 11) that was downloaded Aug. 17, 2018. It included words related to the HPV vaccine. Specifically, the search term was: (#gardasil OR Gardasil) OR ((#HPV OR HPV) AND (vaccination OR vaccine OR immunization)). There has been significant controversy on social media about the potential dangers of the Gardasil vaccine, despite the fact that the Center for Disease Control and numerous published papers have shown it to be safe and effective.1 The original dataset connects Twitter users to twitter users. However, you will use it to create the semantic network that connects words to words as described in this section.
Download the TwitterGardasilSearch.xlsx file from the book website2 and open the file. Open the Graph Metrics dialog and check the Words and word pairs metrics briefly introduced in Chapter 6. Before calculating the metrics choose Options… and select Tweet from the drop-down menu as shown in Figure 8.1. This tells NodeXL to use words found in the Tweet column on the Edges worksheet in the analysis. Make sure you have selected the Word and word pairs metric (as shown in Figure 8.1) when you click the Options… button to open the correct options dialog window. Check the Skip words and word pairs that occur only once box as shown in Figure 8.1, then click OK. It may take a while to run, as this is a large network.
Next, you will examine the results to identify words that should be omitted from the analysis in order to add more clarity. These are often called “stopwords” in content analysis. A stopwords list is used to remove words that are considered to be “noise” or low value grammatical terms that occur frequently, but impart limited meaning outside of their syntactic position. A list of common American English language terms are included in the default settings for NodeXL in the Skip these words section. This list can be edited, deleted, or replaced as needed to address special topics or other languages. All unicode languages (that use the space as a word delimiter) can be analyzed using this feature. Keep in mind that this list is generic, as it is set to be relevant across social media spaces and topics. Your goal now is to refine the list by using data from the Words and the Word Pairs worksheets.
Navigate to the Words worksheet to examine its contents (see the left-hand side of Figure 8.1). While the Vertices worksheet is focused on each entity and its attributes, the Words worksheet is focused on each word used in the text associated with the network (e.g., the text in the Tweet column). It reports the count of each word, its salience, and whether the word appears on any of the three Sentiment word lists. Examine the list and note which words should be removed. Focus on the most frequently used words, since the less frequently used words will likely be filtered out in later analyses. To generate a list of words to be omitted, you may ask yourself:
Add the words you have identified to the Skip these words section of the Word and Word Pair Metrics dialog as shown in Figure 8.1. Recalculate the Word and word pairs Graph Metrics and continue to the next step by examining the Word Pairs worksheet (Figure 8.2).
The Word Pairs worksheet is focused on words that are used together frequently in the text associated with the network. It reports the count of each word pair, its salience, and whether the first or second word appears on any of the three Sentiment word lists (see Chapter 6 for a basic introduction). You can repeat the steps taken above to identify more words that may frequently appear in word pairs, yet do not add value to the analysis. Add them to your list of skipped words and recalculate the graph metrics. You may want to go through this process several times. For example, for a dataset about HPV the following set of keywords and standalone characters, were omitted. It took a few rounds of filtering to narrow it down: HPV URL RT vaccine vaccines vaccination VAX gtpps dr r m e g t via 2 amp. Note that these values are already added into the TwitterGardasilSearch network Options… when you downloaded the file, so you do not need to manually type them in.
While this step may feel cumbersome, it is crucial for a successful semantic network analysis. Failing to remove the right keywords, especially if they are frequently used, leads to an inflation of meaningful edges in the network, making the network more connected than it is, obscuring the true structure of the network.
You can now use content from the Word Pairs worksheet to create a new NodeXL file. Navigate to the Word Pairs worksheet, copy the first two columns (without the headers), and past them into the Edges worksheet of a new NodeXL file. Make sure the new worksheet is an Undirected network, since NodeXL treats word pairs as undirected and weighted (based on the Count) networks. You will also want to capture the additional data, such as the Count and Salience columns by copying them to the Other Columns section of the edges worksheet as shown in Figure 8.3. When pasting the data, you should use the Paste Special, Values (V) option that is available by right-clicking on the cell that you want to paste. This will paste the numerical values, not the formulas. Pasting the formulas would not work in this context and would link the different files, which is not desirable. You may want to reformat the Salience column data by using the Decrease Decimal feature in the Home ribbon. At this point, there is no data in the Vertices worksheet. However, when you click on Show Graph, each unique Vertex will show up on the Vertices worksheet. You can also add Count and Salience data to the Vertices column, though making sure you line up the data correctly can be a bit tricky (see Advanced topic: Using vlookup). Once you are finished copying over the necessary data, make sure you save the new file (e.g., TwitterGardasilWords.xlsx).
Word networks allow you to explore large datasets based on words, themes, or concepts that tend to appear together. Looking for network groups (i.e., clusters) and calculating metrics is a useful way to begin the analysis, as well as generate data that will be useful for visualizing the dataset.
Begin by running the Group by Cluster feature using the Clauset-Newman-Moore algorithm as described in Chapter 7. Then calculate the following Graph Metrics as described in Chapter 6: Overall graph metrics, Vertex degree, Vertex betweenness and closeness centrality, Vertex eigenvector centrality, and Group metrics. This captures the most important undirected network metrics for analysis. A look at the Overall Metrics worksheet shows that the data is a bit messy, with some duplicate edges (which theoretically shouldn’t exist, but may due to different textual anomalies). Change the Layout Options… so that each group is in its own box as shown in Figure 8.5 (see Chapter 7).
In order to understand the metrics (see Chapter 6), recall that each vertex is a word. An edge represents co-occurrence of two words within social media content (e.g., a tweet). Consider the meaning of Degree in this type of network. Since a vertex is a word, the vertex’s degree centrality measures the number of words that it appears alongside (i.e., in the same tweet) within the network. Words with high degree centrality appear with many other words, indicating the dominance of a word or concept in the overall conversation. Navigate to the Vertices worksheet and Sort Largest to Smallest on the Degree column. What are the most common words? Some are ones you would expect (e.g., gardasil, cancer, girls), some are words you may not have guessed (e.g., itsmepanda1, which is a username of a prominent individual in the network), and others are artifacts of working with language that you may want to filter out by setting their Visibility to Skip (e.g., amp, which is part of a text string that indicates the & symbol). Try sorting on Betweenness Centrality. In this case, words with high betweenness centrality connect words that otherwise, are much less connected in the network. In practice, words with high betweenness centrality appear across themes of conversations.
Once in a while, a word may not make sense to you. The best approach is to search for that word in the original dataset (e.g., TwitterGardasilSearch.xlsx) and find the tweets it appeared in, which will give you context as to how the word is used.
A similar sorting approach can be used to identify the most important edges in the network, which represent the most important word pairs. Navigate to the Edges worksheet and Sort from Largest to Smallest on the Count column. Pairs of words that nearly always appear together, such as college pediatrician or side effects show up. While some of these may seem obvious, sometimes there are pairs that can lead you to issues that you may want to examine closer, such as forcibly injected or danish girl. Again, reviewing the tweets in the original dataset will provide context for these word pairs.
In word networks, clusters capture sub-groups of words that appear together in social media messages (e.g., posts, tweets) more than they appear with other words. Network clusters therefore capture themes in the conversations. Navigate to the Groups worksheet to examine the various sub-conversations. The Vertices column on the Groups worksheet shows you the number of words within the group; a measure of the total vocabulary used by the group. Meanwhile, the Total Edges column shows you the number of co-occurrences of the pairs of words that show up within the group.
Getting a sense of the size is important, but digging into the actual themes is most useful. To do so, focus in on the most commonly used words in each group. Navigate to the Vertices worksheet and Sort Largest to Smallest on the Count column. This will help identify the most common words. Notice that a new column now exists called Vertex Group (see Figure 8.6). The filter option can be used on this column to only display items from a particular group (e.g., Vertex Group 3 in Figure 8.6). While you don’t want to leave this type of filter on permanently, it can be a nice way of quickly focusing in on data for exploratory analysis. For example, browsing through the most common words in this cluster helps identify a theme that is focused on teen health (e.g., it includes words like girls, boys, teen, age, health). If the groups are too large, you may consider trying out other clustering algorithms.
A companion technique to explore the dataset is to use a similar process to examine the most popular word combinations on the Edges worksheet. There you will find new Vertex1 Group and Vertex2 Group columns. Sort from largest to smallest on the Count column as in the prior example. Then filter both of those columns on the same group (e.g., 3) to show the within- group pairs that are most common. Or, just filter on one of them (e.g., Vertex1 Group) and see what other groups the words in it connect to. In this example, you may notice there are discussions in group 3 about fda approval, 6th grade, free teens (presumably free vaccines for teens), etc. You can use a similar technique, but sort based on graph metrics. For example, the highest degree words in a group indicate the words that show up with a high variety of other words and thus, have widespread importance in the network. Those with high betweenness centrality are words that may span across groups or words within a group that serve as connecting words.
Once you have identified themes for the groups, you may want to come up with a name for each group and enter it into the Labels column on the Groups worksheet. That will allow you to visualize labels on the graph, as well as remember the work you have done.
Clear visualization is important for communicating your findings. There is no single visualization that will capture all of the insights into a dataset. Instead, you should strive to create visualizations that tell an accurate and clear picture about the words in a network. Using techniques explained earlier in Part II of the book, you can create visualizations such as Figure 8.7 to illustrate some of the most commonly used words as they appear in different sub-group conversations.
When visualizing word networks, you will typically want to show the actual words as labels. As described in Chapter 5, you can set the Vertex Shape to Label. If you have calculated groups, then you will need to change the Group Options so that the Shape data is pulled from the Vertices worksheet, not the Groups worksheet (Chapter 7). You likely will want to use Autofill Columns to change the size of the vertices and weight of the edges so they are based on metrics or Count numbers. And you will certainly need to filter out less important edges and/or vertices to focus in on the most important words and word connections (Chapter 7). When dealing with large networks like this one, you may want to use Dynamic Filters to explore the cutoff points, and then use Autofill Columns to set the Visibility to Skip for those that you want to filter out (as described in Chapter 7). For example, in Figure 8.7 only edges with a Count value of Great than 4 are shown. Additionally, only the most common 10 vertices from each of the 10 largest groups are shown (if they are in an edge).
Analysis of Figure 8.7 reveals several popular incidents that occurred during the time period. For example, the light orange group in the upper-left corner discusses an incident with a danish woman. The green group in the upper-middle focuses on news from a clinical trial of the National Cancer Institute (i.e., thenci). Other clusters focus on side effects, other languages, the benefits of vaccines, etc.
Another technique is to focus in on a single word and examine the ego-network (i.e., the vertices immediately connected to it). To do so, right-click on a word of interest (e.g., the username itsmepanda1) in the graph pane and choose Select Adjacent Vertices. Next choose Toggle Selection, which will reverse which vertices were selected or not. Finally, right-click on one of the selected ones and choose Edit Selected Vertex Properties and pick Skip from the Visibility drop-down menu. You should then only be able to see itsmepanda and all of the words that directly connect with it. Finally, set the Visibility for itsmepanda to Skip, since that word is connected to all others and will obscure some of the information on the graph. You should end up with something like Figure 8.8.
Analysis of the graph shows that this username is mentioned alongside several other usernames (e.g., joegooding, and most of the green colored words). It also shows that this username is mentioned alongside other words related to skepticism about vaccines, such as takethatdoctors, and kills. Creating additional ego-networks for other important words can reveal additional insights when they are compared.
Word networks can be particularly valuable when there is a controlled vocabulary describing something. For example, the Proquest Dissertations and Theses GlobalTM database3 includes data on tens of thousands of theses and dissertations. Students are asked to tag their theses and dissertations with a primary keyword and secondary keyword(s) chosen from 405 pre-determine keywords. This dataset was analyzed with data from 2004 through 2014 with a focus on dissertations and theses that related to computing disciplines [1]. A co-word analysis of the keywords helps understand the nature of computing on campuses in the United States. For example, Figure 8.9 shows the strong connections between Computer Science and Information Science, as well as Computer Science and Computer Engineering; despite the fact that there is no connection between Computer Engineering and Information Science. The important bridging role of terms such as Educational Technology, Management, and Computer Science are clearly visible and heighted in the visualization by making Size based on Betweenness Centrality.
For social media managers, semantic networks help extract the meaning that consumers give to your brand or company. Semantic network clusters capture a variety of meanings and opinions that different conversations associate with a brand. Semantic analysis can help social media managers identify points of concern, or “red flags,” that may require intervention. Campaigns often aim to set or change opinions about a brand. Semantic networks can be used to evaluate the success of such attempts, by looking at the change of shared meaning before and after a campaign. Additionally, networks of controlled vocabularies (i.e., pre-determined keywords such as those used to tag a specific object like a dissertation) can produce particularly insightful network, or co-word graphs. NodeXL allows you to create, analyze, and visualize such co-word networks using the word and word pair metrics, as well as features described earlier in Part II of the book.
The origins of the idea of semantic networks can be traced back to the 1960s, when researchers explored mental and cognitive processes [2], and later meaning construction, cognitive models, representation of knowledge, and semantic memory [3, 4]. Researchers argue that words are hierarchically stored in our memory and the meanings of the words depend on the relationships among them. When individuals talk, they do not only express themselves but also build connections with the audience through language. The use of words in conjunction with other words creates meaning [5]. Semantic analysis is also seen as an extension of traditional content analysis [6]. Semantic networks capture the structure of co-occurring words or concepts, providing an understanding of the meanings that people create as they discuss a topic or an issue. Semantic network analysis has become even more popular as large amounts of user-generated content have become available via the Internet.
Semantic networks have been studied in a wide range of areas, from politics and health to public relations and marketing (see Suggested reading). In an era where information is distributed and consumed from a wide range of sources on social media, semantic networks are increasingly able to provide deep insights into how vocabulary is used and meanings are ascribed. Semantic networks allow scholars to trace the emergent and self-formed meanings that groups of users give to an issue, event, political candidate, health-related topic, etc. Co-word analysis is also possible to apply to other datasets, such as research publications [1]. Related methods, such as co-citation analysis, examine the relationship between publications to gain insights into research community formation and changes over time. The intersection of social networks and co-word networks is a particularly active research area that is bound to bear new fruit in the coming years.
3.129.26.22