Chapter 8

Semantic networks

Abstract

Semantic networks help extract meanings that consumers give to a brand or company based on the text they use in social media posts. Instead of connecting people to people, semantic networks connect words to words, based on their co-occurrence. Words that co-occur with a brand name can indicate points of concern that may require public relations interventions. Semantic network can also help understand the discussions around controversial topics, as illustrated with an example of a Twitter network related to the HPV Gardasil vaccine, which some anti-vaxxers discourage using. Additionally, networks of controlled vocabularies (i.e., pre-determined keywords or tags) can produce particularly insightful networks, or co-word graphs as illustrated by the computing dissertation co-word analysis. The word and word pair metric features of NodeXL allows you to skip certain words (e.g., stopwords), perform sentiment analysis, and count word occurrences and co-occurrences. Network clustering, filtering, and visualization can then occur.

Keywords

Semantic networks; Social media; Network clusters; Co-word analysis; Word pairs; Sentiment analysis; HPV Gardasil; Vaccines; Twitter; Computing dissertations

8.1 Introduction

So far in this book, you have mostly explored social networks where social actors are the vertices and edges represent connections between those people. In semantic networks, in contrast, vertices are words or concepts that are connected by co-occurrence within a pre-defined text (such as, a tweet, a Facebook post, a news article, etc.). Semantic networks, then, map relationships between concepts and allow us to extract meaning from text based on the co-occurrence of concepts (i.e., which concepts show up together). Later in this book, you will use NodeXL to extract user-generated content from platforms such as Twitter (Chapter 11) and YouTube (Chapter 13). These chapters will apply social network analysis to find patterns of relationships between users, or between videos. The retrieved content, however, can also be used for semantic network analysis. This chapter will walk you through the process of using NodeXL’s semantic analysis features using a Twitter sample dataset. That said, the process detailed in this chapter can be applied to any text-based dataset, such as emails or news stories.

8.2 Creating the Twitter Gardasil HPV word pair network

In this chapter, you will create and then examine a semantic network that connects words that show up in the same tweet together. The dataset is from a Twitter Search Network (see Chapter 11) that was downloaded Aug. 17, 2018. It included words related to the HPV vaccine. Specifically, the search term was: (#gardasil OR Gardasil) OR ((#HPV OR HPV) AND (vaccination OR vaccine OR immunization)). There has been significant controversy on social media about the potential dangers of the Gardasil vaccine, despite the fact that the Center for Disease Control and numerous published papers have shown it to be safe and effective.1 The original dataset connects Twitter users to twitter users. However, you will use it to create the semantic network that connects words to words as described in this section.

8.2.1 Calculate word and word pair metrics

Download the TwitterGardasilSearch.xlsx file from the book website2 and open the file. Open the Graph Metrics dialog and check the Words and word pairs metrics briefly introduced in Chapter 6. Before calculating the metrics choose Options… and select Tweet from the drop-down menu as shown in Figure 8.1. This tells NodeXL to use words found in the Tweet column on the Edges worksheet in the analysis. Make sure you have selected the Word and word pairs metric (as shown in Figure 8.1) when you click the Options… button to open the correct options dialog window. Check the Skip words and word pairs that occur only once box as shown in Figure 8.1, then click OK. It may take a while to run, as this is a large network.

Figure 8.1
Figure 8.1 Graph Metrics dialog choosing Word and word pairs and setting the Options… to be based the word pairs found in the Tweet column. The highlighted words in the Skip these words section have been added.

8.2.2 Iteratively refine the list of skipped words

Next, you will examine the results to identify words that should be omitted from the analysis in order to add more clarity. These are often called “stopwords” in content analysis. A stopwords list is used to remove words that are considered to be “noise” or low value grammatical terms that occur frequently, but impart limited meaning outside of their syntactic position. A list of common American English language terms are included in the default settings for NodeXL in the Skip these words section. This list can be edited, deleted, or replaced as needed to address special topics or other languages. All unicode languages (that use the space as a word delimiter) can be analyzed using this feature. Keep in mind that this list is generic, as it is set to be relevant across social media spaces and topics. Your goal now is to refine the list by using data from the Words and the Word Pairs worksheets.

Navigate to the Words worksheet to examine its contents (see the left-hand side of Figure 8.1). While the Vertices worksheet is focused on each entity and its attributes, the Words worksheet is focused on each word used in the text associated with the network (e.g., the text in the Tweet column). It reports the count of each word, its salience, and whether the word appears on any of the three Sentiment word lists. Examine the list and note which words should be removed. Focus on the most frequently used words, since the less frequently used words will likely be filtered out in later analyses. To generate a list of words to be omitted, you may ask yourself:

  •  Are there words related to my search term that are in so many tweets they do not add any new information? (e.g., HPV, vaccine)
  •  Are there other words related to the source of data (e.g., Twitter) that are worth omitting? (e.g., RT)
  •  Are there other common words or numbers that likely won't add value to the analysis? (e.g., via, 2)

Add the words you have identified to the Skip these words section of the Word and Word Pair Metrics dialog as shown in Figure 8.1. Recalculate the Word and word pairs Graph Metrics and continue to the next step by examining the Word Pairs worksheet (Figure 8.2).

Figure 8.2
Figure 8.2 Word Pairs worksheet showing the most common pairs of words that show up together. This can be used as the basis for a word to word semantic network.

The Word Pairs worksheet is focused on words that are used together frequently in the text associated with the network. It reports the count of each word pair, its salience, and whether the first or second word appears on any of the three Sentiment word lists (see Chapter 6 for a basic introduction). You can repeat the steps taken above to identify more words that may frequently appear in word pairs, yet do not add value to the analysis. Add them to your list of skipped words and recalculate the graph metrics. You may want to go through this process several times. For example, for a dataset about HPV the following set of keywords and standalone characters, were omitted. It took a few rounds of filtering to narrow it down: HPV URL RT vaccine vaccines vaccination VAX gtpps dr r m e g t via 2 amp. Note that these values are already added into the TwitterGardasilSearch network Options… when you downloaded the file, so you do not need to manually type them in.

While this step may feel cumbersome, it is crucial for a successful semantic network analysis. Failing to remove the right keywords, especially if they are frequently used, leads to an inflation of meaningful edges in the network, making the network more connected than it is, obscuring the true structure of the network.

8.2.3 Creating a new word to word network file

You can now use content from the Word Pairs worksheet to create a new NodeXL file. Navigate to the Word Pairs worksheet, copy the first two columns (without the headers), and past them into the Edges worksheet of a new NodeXL file. Make sure the new worksheet is an Undirected network, since NodeXL treats word pairs as undirected and weighted (based on the Count) networks. You will also want to capture the additional data, such as the Count and Salience columns by copying them to the Other Columns section of the edges worksheet as shown in Figure 8.3. When pasting the data, you should use the Paste Special, Values (V) option that is available by right-clicking on the cell that you want to paste. This will paste the numerical values, not the formulas. Pasting the formulas would not work in this context and would link the different files, which is not desirable. You may want to reformat the Salience column data by using the Decrease Decimal feature in the Home ribbon. At this point, there is no data in the Vertices worksheet. However, when you click on Show Graph, each unique Vertex will show up on the Vertices worksheet. You can also add Count and Salience data to the Vertices column, though making sure you line up the data correctly can be a bit tricky (see Advanced topic: Using vlookup). Once you are finished copying over the necessary data, make sure you save the new file (e.g., TwitterGardasilWords.xlsx).

Figure 8.3
Figure 8.3 Creating a new Semantic network file by copying data from the Word Pairs worksheet of the original social media network file to the Edges worksheet of a new semantic network file.

Advanced topic

Using vlookup

When using NodeXL, there is often a need to add additional data about vertices in the Other Columns. However, in many cases, the data you are copying over may not be sorted in the same order as the Vertex column data. Or, you may be pulling data from a source that includes additional vertices not included in the spreadsheet you want to insert it into. In such cases, the vlookup formula can be used to automatically populate a column of data based on the data in another column (e.g., the Vertex column). An example can help illustrate the value.

Navigate to the Vertices worksheet of your new file (e.g., TwitterGardasilWords.xlsx). Rename two new columns in the Other Columns section called Count and Salience. The goal is to populate these columns with data from the Words worksheet in the TwitterGardasilSearch.xlsx network. The problem is the vertices in the Vertex column are not sorted correctly. You could sort each of them and they may line up correctly, but a safer way to make sure you have the right data associated with the right vertex is to use the vlookup function. Enter the following formula into the Count column: = VLOOKUP([@Vertex],[TwitterGardasilSearch.xlsx]Words!A$7:C$4251,2,FALSE). This formula includes several parameters. The first one, [@Vertex], says to use the data in the Vertex column as the value to lookup. For example, in the first row, it will use the word college as the index. The second parameter, [TwitterGardasilSearch.xlsx]Words!$A$7:$C$4251, proves the destination of the lookup table. In this case it is in another file called TwitterGardasilSearch.xlsx, and includes the cells $A$7:$C$4251 on the Words worksheet. Notice the $ symbols, which make the reference stay the same, even if you copy the formula to other cells. The third parameter, 2, indicates which column in the lookup table you want to return data from. This is 2 in the Count column, but 3 for the Salience column (see formula bar in Figure 8.4), since that is the order of the columns in the original table. The final parameter is set to FALSE, which means that you need an exact match between the text in the Vertex column and the first row of the lookup table.

Figure 8.4
Figure 8.4 Populating the Count and Salience columns using vlookup formula as shown in the formula bar. Right-clicking on cell AD3 and choosing Format Cells opened the dialog, where the format was changed from Text to General. This takes effect after clicking in the formula bar and typing the Enter key.

Once the data is populated, copy the data in the two columns and use the Paste Special and Values (V) option (available when you right-click) to paste the values right over the original formulas. This will remove the formulas which included pointers to a separate file. Then, to make sure it worked properly, sort by the Count column and scroll to the bottom to see if there are any #N/A errors, which occur when no match was made. In this case, there are seven errors for words that all began with a symbol, which is throwing off the lookup feature. You can ignore the errors (assuming you don’t try and calculate anything based on them), decide to Skip these vertices if they are not important, or find the correct values in your original dataset (using the built-in Find feature of Excel) and manually copy and paste them over.

8.3 Analyzing word networks

Word networks allow you to explore large datasets based on words, themes, or concepts that tend to appear together. Looking for network groups (i.e., clusters) and calculating metrics is a useful way to begin the analysis, as well as generate data that will be useful for visualizing the dataset.

Begin by running the Group by Cluster feature using the Clauset-Newman-Moore algorithm as described in Chapter 7. Then calculate the following Graph Metrics as described in Chapter 6: Overall graph metrics, Vertex degree, Vertex betweenness and closeness centrality, Vertex eigenvector centrality, and Group metrics. This captures the most important undirected network metrics for analysis. A look at the Overall Metrics worksheet shows that the data is a bit messy, with some duplicate edges (which theoretically shouldn’t exist, but may due to different textual anomalies). Change the Layout Options… so that each group is in its own box as shown in Figure 8.5 (see Chapter 7).

Figure 8.5
Figure 8.5 Twitter Gardasil Word network with groups calculated and displayed in a different box. This image is not yet refined.

8.3.1 Examining vertex and edge metrics

In order to understand the metrics (see Chapter 6), recall that each vertex is a word. An edge represents co-occurrence of two words within social media content (e.g., a tweet). Consider the meaning of Degree in this type of network. Since a vertex is a word, the vertex’s degree centrality measures the number of words that it appears alongside (i.e., in the same tweet) within the network. Words with high degree centrality appear with many other words, indicating the dominance of a word or concept in the overall conversation. Navigate to the Vertices worksheet and Sort Largest to Smallest on the Degree column. What are the most common words? Some are ones you would expect (e.g., gardasil, cancer, girls), some are words you may not have guessed (e.g., itsmepanda1, which is a username of a prominent individual in the network), and others are artifacts of working with language that you may want to filter out by setting their Visibility to Skip (e.g., amp, which is part of a text string that indicates the & symbol). Try sorting on Betweenness Centrality. In this case, words with high betweenness centrality connect words that otherwise, are much less connected in the network. In practice, words with high betweenness centrality appear across themes of conversations.

Once in a while, a word may not make sense to you. The best approach is to search for that word in the original dataset (e.g., TwitterGardasilSearch.xlsx) and find the tweets it appeared in, which will give you context as to how the word is used.

A similar sorting approach can be used to identify the most important edges in the network, which represent the most important word pairs. Navigate to the Edges worksheet and Sort from Largest to Smallest on the Count column. Pairs of words that nearly always appear together, such as college pediatrician or side effects show up. While some of these may seem obvious, sometimes there are pairs that can lead you to issues that you may want to examine closer, such as forcibly injected or danish girl. Again, reviewing the tweets in the original dataset will provide context for these word pairs.

8.3.2 Examining data by groups

In word networks, clusters capture sub-groups of words that appear together in social media messages (e.g., posts, tweets) more than they appear with other words. Network clusters therefore capture themes in the conversations. Navigate to the Groups worksheet to examine the various sub-conversations. The Vertices column on the Groups worksheet shows you the number of words within the group; a measure of the total vocabulary used by the group. Meanwhile, the Total Edges column shows you the number of co-occurrences of the pairs of words that show up within the group.

Getting a sense of the size is important, but digging into the actual themes is most useful. To do so, focus in on the most commonly used words in each group. Navigate to the Vertices worksheet and Sort Largest to Smallest on the Count column. This will help identify the most common words. Notice that a new column now exists called Vertex Group (see Figure 8.6). The filter option can be used on this column to only display items from a particular group (e.g., Vertex Group 3 in Figure 8.6). While you don’t want to leave this type of filter on permanently, it can be a nice way of quickly focusing in on data for exploratory analysis. For example, browsing through the most common words in this cluster helps identify a theme that is focused on teen health (e.g., it includes words like girls, boys, teen, age, health). If the groups are too large, you may consider trying out other clustering algorithms.

Figure 8.6
Figure 8.6 Filtering out all vertices that are not part of Vertex Group 3.

A companion technique to explore the dataset is to use a similar process to examine the most popular word combinations on the Edges worksheet. There you will find new Vertex1 Group and Vertex2 Group columns. Sort from largest to smallest on the Count column as in the prior example. Then filter both of those columns on the same group (e.g., 3) to show the within- group pairs that are most common. Or, just filter on one of them (e.g., Vertex1 Group) and see what other groups the words in it connect to. In this example, you may notice there are discussions in group 3 about fda approval, 6th grade, free teens (presumably free vaccines for teens), etc. You can use a similar technique, but sort based on graph metrics. For example, the highest degree words in a group indicate the words that show up with a high variety of other words and thus, have widespread importance in the network. Those with high betweenness centrality are words that may span across groups or words within a group that serve as connecting words.

Once you have identified themes for the groups, you may want to come up with a name for each group and enter it into the Labels column on the Groups worksheet. That will allow you to visualize labels on the graph, as well as remember the work you have done.

8.4 Visualizing work networks

Clear visualization is important for communicating your findings. There is no single visualization that will capture all of the insights into a dataset. Instead, you should strive to create visualizations that tell an accurate and clear picture about the words in a network. Using techniques explained earlier in Part II of the book, you can create visualizations such as Figure 8.7 to illustrate some of the most commonly used words as they appear in different sub-group conversations.

Figure 8.7
Figure 8.7 Filtered view of the most common word pairs (Edge Count Greater than 4) and words (top 10 words if they are in an edge) from the 10 largest groups. Edge Width and Opacity are based on edge Count. Color is based on automatically calculated clusters. Size is based on Degree.

When visualizing word networks, you will typically want to show the actual words as labels. As described in Chapter 5, you can set the Vertex Shape to Label. If you have calculated groups, then you will need to change the Group Options so that the Shape data is pulled from the Vertices worksheet, not the Groups worksheet (Chapter 7). You likely will want to use Autofill Columns to change the size of the vertices and weight of the edges so they are based on metrics or Count numbers. And you will certainly need to filter out less important edges and/or vertices to focus in on the most important words and word connections (Chapter 7). When dealing with large networks like this one, you may want to use Dynamic Filters to explore the cutoff points, and then use Autofill Columns to set the Visibility to Skip for those that you want to filter out (as described in Chapter 7). For example, in Figure 8.7 only edges with a Count value of Great than 4 are shown. Additionally, only the most common 10 vertices from each of the 10 largest groups are shown (if they are in an edge).

Analysis of Figure 8.7 reveals several popular incidents that occurred during the time period. For example, the light orange group in the upper-left corner discusses an incident with a danish woman. The green group in the upper-middle focuses on news from a clinical trial of the National Cancer Institute (i.e., thenci). Other clusters focus on side effects, other languages, the benefits of vaccines, etc.

Another technique is to focus in on a single word and examine the ego-network (i.e., the vertices immediately connected to it). To do so, right-click on a word of interest (e.g., the username itsmepanda1) in the graph pane and choose Select Adjacent Vertices. Next choose Toggle Selection, which will reverse which vertices were selected or not. Finally, right-click on one of the selected ones and choose Edit Selected Vertex Properties and pick Skip from the Visibility drop-down menu. You should then only be able to see itsmepanda and all of the words that directly connect with it. Finally, set the Visibility for itsmepanda to Skip, since that word is connected to all others and will obscure some of the information on the graph. You should end up with something like Figure 8.8.

Figure 8.8
Figure 8.8 A 1.5 degree ego-network graph for the word itsmepanda1. The word is removed from the graph. Colors are based on original groups, Size is based on Degree, and edge Width and Opacity are based on Count.

Analysis of the graph shows that this username is mentioned alongside several other usernames (e.g., joegooding, and most of the green colored words). It also shows that this username is mentioned alongside other words related to skepticism about vaccines, such as takethatdoctors, and kills. Creating additional ego-networks for other important words can reveal additional insights when they are compared.

8.5 Visualizing computing dissertation and thesis connections

Word networks can be particularly valuable when there is a controlled vocabulary describing something. For example, the Proquest Dissertations and Theses GlobalTM database3 includes data on tens of thousands of theses and dissertations. Students are asked to tag their theses and dissertations with a primary keyword and secondary keyword(s) chosen from 405 pre-determine keywords. This dataset was analyzed with data from 2004 through 2014 with a focus on dissertations and theses that related to computing disciplines [1]. A co-word analysis of the keywords helps understand the nature of computing on campuses in the United States. For example, Figure 8.9 shows the strong connections between Computer Science and Information Science, as well as Computer Science and Computer Engineering; despite the fact that there is no connection between Computer Engineering and Information Science. The important bridging role of terms such as Educational Technology, Management, and Computer Science are clearly visible and heighted in the visualization by making Size based on Betweenness Centrality.

Figure 8.9
Figure 8.9 Co-word analysis of keywords used to describe dissertations in the Proquest Theses and Dissertation database from 2009 through 2014. Size is based on Betweenness Centrality. Edge Width and Opacity are based on the number of co-occurrences of the keywords. Only edges with a co-occurrence count greater than 7 are shown. Color is based on clusters identified with the larger network.

8.6 Practitioner’s summary

For social media managers, semantic networks help extract the meaning that consumers give to your brand or company. Semantic network clusters capture a variety of meanings and opinions that different conversations associate with a brand. Semantic analysis can help social media managers identify points of concern, or “red flags,” that may require intervention. Campaigns often aim to set or change opinions about a brand. Semantic networks can be used to evaluate the success of such attempts, by looking at the change of shared meaning before and after a campaign. Additionally, networks of controlled vocabularies (i.e., pre-determined keywords such as those used to tag a specific object like a dissertation) can produce particularly insightful network, or co-word graphs. NodeXL allows you to create, analyze, and visualize such co-word networks using the word and word pair metrics, as well as features described earlier in Part II of the book.

8.7 Researcher’s agenda

The origins of the idea of semantic networks can be traced back to the 1960s, when researchers explored mental and cognitive processes [2], and later meaning construction, cognitive models, representation of knowledge, and semantic memory [3, 4]. Researchers argue that words are hierarchically stored in our memory and the meanings of the words depend on the relationships among them. When individuals talk, they do not only express themselves but also build connections with the audience through language. The use of words in conjunction with other words creates meaning [5]. Semantic analysis is also seen as an extension of traditional content analysis [6]. Semantic networks capture the structure of co-occurring words or concepts, providing an understanding of the meanings that people create as they discuss a topic or an issue. Semantic network analysis has become even more popular as large amounts of user-generated content have become available via the Internet.

Semantic networks have been studied in a wide range of areas, from politics and health to public relations and marketing (see Suggested reading). In an era where information is distributed and consumed from a wide range of sources on social media, semantic networks are increasingly able to provide deep insights into how vocabulary is used and meanings are ascribed. Semantic networks allow scholars to trace the emergent and self-formed meanings that groups of users give to an issue, event, political candidate, health-related topic, etc. Co-word analysis is also possible to apply to other datasets, such as research publications [1]. Related methods, such as co-citation analysis, examine the relationship between publications to gain insights into research community formation and changes over time. The intersection of social networks and co-word networks is a particularly active research area that is bound to bear new fruit in the coming years.

References

[1] Kim S., Hansen D., Helps R. Computing research in the academy: insights from theses and dissertations. Scientometrics. 2018;114(1):135–158.

[2] Collins A.M., Quillian M.R. Retrieval time from semantic memory. J. Verbal Learn. Verbal Behav. 1969;8(2):240–247.

[3] Carley K.M., Kaufer D.S. Semantic connectivity: an approach for analyzing symbols in semantic networks. Commun. Theory. 1993;3(3):183–213.

[4] Rice R.E., Danowski J.A. Is it really just like a fancy answering machine? Comparing semantic networks of different types of voice mail users. J. Business Commun. 1993;30(4):369–397.

[5] McGee M.C. The “ideograph”: a link between rhetoric and ideology. Q. J. Speech. 1980;66(1):1–16.

[6] Kim D., Kim S.Y., Choi M.I. The pivotal role of AJC in the growth of communication research in Asia: a semantic network analysis. Asian J. Commun. 2016;26(6):626–645.

Suggested reading

[Sevin, 2014] Sevin H.E. Understanding cities through city brands: city branding as a social and semantic network. Cities. 2014;38:47–56.

[Hong et al., 2016] Hong Y.J., Shin D., Kim J.H. High/low reputation companies’ dialogic communication activities and semantic networks on Facebook: a comparative study. Technol. Forecast. Soc. Change. 2016;110:78–92.

[Doerfel and Connaughton, 2009] Doerfel M.L., Connaughton S.L. Semantic networks and competition: election year winners and losers in US televised presidential debates, 1960–2004. J. Am. Soc. Inf. Sci. Technol.. 2009;60(1):201–218.

[Lycarião and dos Santos, 2017] Lycarião D., dos Santos M.A. Bridging semantic and social network analyses: the case of the hashtag #precisamosfalarsobreaborto (we need to talk about abortion) on Twitter. Information. Commun. Soc.. 2017;20(3):368–385.

[Yang and Veil, 2017] Yang A., Veil S.R. Nationalism versus animal rights: a semantic network analysis of value advocacy in corporate crisis. Int. J. Business Commun.. 2017;54(4):408–430.

[Kwon et al., 2016] Kwon K.H., Bang C.C., Egnoto M., Raghav Rao H. Social media rumors as improvised public opinion: semantic network analyses of twitter discourses during Korean saber rattling 2013. Asian J. Commun.. 2016;26(3):201–222.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.26.22