21

The Methodology of Online News Analysis

A Quantitative Approach to Ephemeral Media

Helle Sjøvaag and Eirik Stavelin

ABSTRACT

This chapter describes a new method for analyzing online news employing both human and computer coding and based on a triangulation of qualitative and quantitative measures. The chapter illustrates how the method was used to perform a content analysis of the full online news output of the Norwegian Broadcasting Corporation (NRK) for the year 2009 (n = 74430). It explains how the approach was developed, describes the problems we encountered along the way, and provides a rationale for further use of the method. We conclude that our work demonstrates that methodologies designed for measuring broadcast news do not suffice in the ephemeral online news environment. Thus, online research methods need to be redesigned to account for medium-specific news features on the Internet.

In this chapter we describe how we arrived at a methodology for analyzing online news quantitatively. The approach combines human and computer coding and is based on a triangulation of qualitative and quantitative measures. Used to perform a content analysis of a year's online news output of the Norwegian Broadcasting Corporation (NRK), the study reveals how methodologies designed for measuring broadcast news fail to fully address the ephemeral nature of the online news environment. Online research methods need to be redesigned to account for medium-specific features on the Internet. In this chapter we explain how the approach was developed through a mixed inductive–deductive design and describe the problems we encountered along the way – particularly relating to obtaining the data, defining the unit of analysis, designing the coding scheme, and testing reliability. We believe the appropriateness of the method lies primarily in the advantages offered by the computational ability to process large sets of data quickly and without errors, and thus aid in the analysis of online news media.

The Object of Research

The object of study is the online news services of the license-fee-financed public service broadcaster in Norway. NRK was established in 1933. Based on the BBC Reithian model to inform, educate, and entertain, it provides public service content primarily on three platforms – television, radio, and online. Overall, the Norwegian television sector is characterized by moderate competition within a fully digitalized broadcasting spectrum (Lund & Berg, 2009, p. 21). The market is dominated by four national broadcasters, with growing differentiation as a result of an industry-wide strategy that continues to see an increase in the variety of digital offspring and a plethora of international channels transmitted on the satellite and cable platforms. Nevertheless, NRK remains a steady leader in most content segments of the broadcasting markets, with an 86% audience reach for all platforms overall (Medie Norge, 2011).

In the increasingly platform-independent news market, however, NRK also faces competition from the national daily newspapers. As an online medium, NRK is second in the market, with 23.2% and approximately 830,000 unique daily users, behind the national daily tabloid Verdens Gang (VG), with 29.4% (TNS-Gallup, 2011). Our survey of NRK's online news at www.nrk.no was conducted using data from 2009 – data collected just prior to a significant restructuring of nrk.no as a publication platform. At the time, NRK was the fifth most popular online site in the country (TNS-Gallup, 2009). The data we present in this chapter is therefore not necessarily a reflection of the site's content profile today. However, the focus of the following discussion is the methodological development that led to the results, not necessarily the results themselves, although we find it likely that the need for methodological development in this case also led to some of our more interesting findings.

The Data

The following data sets were produced for analysis.

  1. Text-based news articles published by NRK online during 2009 (n = 74,430). Based on the available data, we estimate that this is close to representing the whole production from 2009. Initially we received from NRK an index file containing 502,180 URLs published between 2002 and 2010. A search produced 73,497 texts for 2009, but among these we found only one economy article. This discovery revealed the lack of consistency in publication systems used across newsrooms within the organization. We obtained the economy content from NRK separately, making 74,430 texts in total. We have no way of knowing exactly how many texts are missing from the sample, but we estimate that 74,430 is close to the complete output. The data set was collected and treated by way of computer software custom written for this project, on the basis of an index file of published articles.
  2. News articles published on 10 preselected dates in 2009, abstracted from data set 1 (n = 2,162). This data set consists of a representative selection of text-based news items published across a constructed 10 day period. The dates were preselected from the second week of each month. In three cases sampling had to be moved to the next week due to the lack of a complete set of front-page snapshots for the date in question (see data set 3). As time restraints limited sampling to 10 days, May and July were excluded as the least representative months (May because of the national holiday, and July because the general summer holiday left fewer journalists on staff). Every week day was represented1 and the selection was predetermined – a Monday from January, a Tuesday from February, and so on. Because of lower publication frequency at weekends, the sample contains only one Saturday and one Sunday. Event-sensitive days such as Easter and Christmas were avoided. Data set 2 underwent manual coding and quantitative content analysis.
  3. Top 10 news stories from the front page or main domain (www.nrk.no) (n = 1,192). The sample was compiled from the same dates as data set 2. Top 10 stories were collected once every hour between 09:00 and 19:00, in the form of image snapshots of the entire front page. Data set 3 underwent manual coding and quantitative content analysis.

Items analyzed were limited to the text-based news articles published on nrk.no, and did not include redistribution of audiovisual content from broadcasting to the Internet. Audiovisual news content on nrk.no amounted primarily to radio and television productions ported to the website for streaming. As we were looking to analyze the news dissemination of NRK's online news unit, we chose to overlook audiovisual content not primarily intended for the online platform and to focus on NRK's textually mediated online news. The hereto relevant aims of the project were to assess the degree to which nrk.no presented a continuous and updated news agenda online during 2009; the thematic distribution of its news content; its front page priorities; the depth and perspectives of the news content; and the degree to which nrk.no used interactive or other Internet-specific tools in its news dissemination.

The quantitative and computer-assisted content analyses performed on the collected data confirm that nrk.no presented a continuous and updated news service during 2009.2 The analysis also shows that the news content on nrk.no overall was characterized by a high level of local news. However, its front page bore all the hallmarks of a national online news site and was dominated by international news, politics, crime stories, and popular culture. As an online news medium, nrk.no used interactive elements only to a limited degree (see Choi, 2004; Nguyen, 2008). The website had a low external linking practice, and limited use of video and audio, quizzes, tests, and commentary (see also Barnhurst, 2010; Carpenter, 2010; Ureta, 2011).

All in all the study confirmed assumptions about the online news genre (Chung, 2007; Quandt, 2008), with short news stories averaging 260 words, “breaking news,” and a high publication frequency with over 200 published texts per day. Findings were the result of method triangulation. We used an inductive design on latent content (content requiring judgment to assign coding values to news items, such as sports, politics, business, and crime) and a deductive design on manifest content (content that is observable and countable, such as hyperlinks, pictures, publication date, and author) (Krippendorff, 2004, p. 20; Neuendorf, 2002, p. 23).

The Problems

Overall the research design illuminated problems with transferring content analysis methods from audiovisual media analyses to the online environment. The analysis involves two different aspects of NRK's online publication. Both the front page (the primary domain www.nrk.no) and the website's “inside” – its deeper structure of subsites – were under investigation. Because our aim was to ascertain the extent to which NRK was able to utilize the unique potential of the Web in its news publication, and to establish a profile of its content, we were faced with a large starting point of over 74,000 unique items of data. From the outset we knew we were faced with several methodological challenges in this regard. In the following we expand on these problems under four different headings. We divide the problems we encountered into problems with obtaining the full population of data; problems with defining the unit of analysis; problems with designing an operable coding scheme; and problems with establishing intercoder reliability. In the end, these methodological challenges helped provide some of the more interesting findings of the analysis of the online logic of NRK's news production, the subject of which we will return to in the conclusion.

Problem 1: Obtaining the Full Population of Data

Our first question on assuming the task of designing a workable scheme for analyzing online news was “What exactly is the object we are studying?” The issue is thus one of inclusion and exclusion. What counts as part of the news service we are investigating, and what falls outside the scope of our research interest?3

Most of the portals used by NRK to disseminate content online are produced and maintained using the same content management system (CMS), but not all. This raises problems when it comes to gathering the data for news analysis, where consolidation of overlapping content can quickly become a central task. When there are a multitude of subsites and portals such as on www.nrk.no, the act of choosing what to include in the study will impact on the result, especially if important subsections are overlooked. Our solution to this problem was primarily a pragmatic one. We first moved to ascertain an overview of the total output of material. This can be achieved manually or using a spider – referred to as “software that automatically enables Internet searches” (Deuze, 2003). However we opted to conduct a manual examination. We then chose to focus only on the large portals and subdomains that regularly add new data, making sure they are produced by as few CMSs and design templates as possible. This simply means that there are fewer “templates” from which to automatically extract the variables in the codebook, which simplifies the data collection procedure. As our aim was to analyze news output, we chose to overlook content associated with radio and television programs, blogs, and other aspects of the content production outside the news streams organized by geography, news topics, or the front page.

A further issue that soon emerged was what constitutes an article from 2009. One of the challenges of conducting this project was to collect data of online publication after the fact – from the previous year. Hence, one of our first concerns was whether we could obtain representative data. Contact with NRK enabled us to start with an initial data set consisting of 502,180 news items – given to us as links in an index file in XML format. From this we selected the subset of links marked as published in 2009.4 As we aimed to analyze a year in retrospect, we needed to draw clear boundaries as to what material to include in the sample of cases. A central question here is whether an article published in 2009 and updated in 2010 is to be considered part of the 2009 or the 2010 sample. However you choose to define the appropriate date of publication – original publication date or the date of the update – there are bound to be some, if minor, impacts on the results. We chose to define the sample by its latest date of update. Hence, stories that were published in 2008 and updated in 2009 were included, while stories that were published in 2009 and updated in 2010 were excluded. Excluded stories comprised 4% of the material. Despite the margin of error introduced by this issue, the large sample size of 74,430 items published in 2009 counteracts this problem.

Our initial glance at the material revealed that some of the data that could be expected in the full 2009 collection was missing. For instance, only one of the 74,430 items was classified within the portal www.nrk.no/economy. When we requested the missing data from NRK, we soon discovered that their practice of using several publication systems had led to a lack of integration. Hence the total amount of published material remains unknown to us as well as to NRK. Based on the material we did receive, however, we estimate that the total of 74,430 unique links to articles (where portal pages and nonsingle news items were removed) is likely to be close to the factual total number of online news texts produced in 2009.

Dealing with Mutual Exclusion

An overarching problem with defining the sample based on URL links to news stories is the absence of mutual exclusivity as a publication or storage principle. In the online publication system, a news article can be in several sections at once. A story can be catalogued under both economy and domestic affairs, and it can be published on both subsites. NRK's website and many other websites identify individual stories or texts by a unique identification consisting of a string of numbers easily found in the URL, for example, http://www.nrk.no/nyheter/okonomi/1.6923373 and http://www.nrk.no/nyheter/norge/1.6923373. This unique identification number enables registration independent of the rest of the URL. It makes the database recreation more complex, but keeps a tab on double posting. An alternative is to designate only one of the units as unique and to remove redundant URLs. This also removes the potential value of the categorization given in the URL. However, this phenomenon reveals more about how NRK's employees use the CMS and Web as a publication platform than about the content itself. And while interesting organizational questions can be answered through studying these types of actions, it falls outside the scope of this study. Hence, we removed duplicate URLs by a “distinct query” in our database, and kept only one URL for each story. The double postings amounted to 1,557 stories, approximately 2% of the material. This adds a systemic bias to the sample as we kept the URLs ranked first in alphabetical order. We decided on this course of action for the sake of simplicity and manageability, based on the fact that the number of double-posted URLs was low. By doing so we also kept the database structure on our servers in a single table format, and avoided recreating it with relations and keys. The idea behind the computer-supported analysis was to recreate as much data as possible based on the original database on NRK's servers. This entails collecting news articles from the Web and determining what constitutes the subsequent appropriate content for analysis. This process of delimitation is based on a combination of what we had access to and what we did not have access to, and the data we needed to answer the research questions. This raises questions as to how this transformation can be best operationalized.

Reconstructing the Database

The database itself is always the ideal data set for a content analysis of such a medium (unless the aesthetic or functional/interactive features are the focus of study), given that a website is only a representation of a database. A database dump containing both the structure and the content provides the best condition for an analysis that is precise and reliable. It also guarantees that all columns in the database are transferred. The disadvantages connected with such a database dump have to do with the complexities that occur in large databases of information systems used in news organizations. Resolving such complexities probably requires expert knowledge of the specific routines and practices of the information system and the news organization itself. Allowing a third party full access is itself rare, moreover, as such material comprises one of the more valuable resources of a news organization.

A recently developed and interesting approach in this regard is the (Web) API (application programming interface), an example of which is the Guardian Open Platform.5 NRK has announced it is working on developing such an API, without revealing too much of what it will contain, or if it will indeed be realized (NRK Beta, 2009). The main advantage of using an API is precision. A story is exposed through the API as structured information with metadata, as opposed to the results of a scraping procedure, where this structure needs to be created – a process that that allows for a certain margin of error. The API approach would certainly provide a faster and easier way for researchers to access data, provided it contains the data required to answer the research questions. There is likely to emerge, however, uncertainties as to whether the API content matches the continuing flow of news on the front page or other central portals, but for an analysis of a website's “inside,” an API remains a good option.

A third, less desirable, option for accessing the necessary data is to obtain an index file comprising a list of URLs intended for search engines, typically written in XML. An index will require a scraping operation using custom selectors to fetch the desired data. The advantage of the index file is that it often contains some metadata (e.g., dates) that can be useful. As a last resort, an index can be compiled using a crawler, spider, or robot – a computer program that browses the Web, typically in order to gather data to index for search engines. This method can potentially insure that links missing from an index, or that are not exposed though an API, are obtained, but it adds another methodological step and can be time-consuming and error-prone. This approach also requires a specific set of technical skills and could require a third-party technician. As metadata are lacking prior to scraping, making a subselection in the material (for instance, from a specific year or month) first requires scraping all links to establish the scope for the study. When it comes to obtaining the data, the closer to the original database you can get, the better the results you can expect to achieve. As we were unable to obtain a database dump, and NRK does not have a public API, we consider ourselves lucky that they had an index file. The solution was time-consuming, but worked well in terms of operationalizing our research design.

Analyzing Front Pages

The front page priorities of an online news site are essential for determining the content profile and frequency of publication of a national medium aiming for high visibility among online users. To be able to look back and see how a website looked day by day in the previous year is no small request. While services such as the Wayback Machine by the Internet Archive (http://archive.org/) continuously catalogue snapshots of web pages, it does not necessarily do so at frequent intervals, and thus does not provide material of sufficient scope for a content analysis. A register of website updates must be set up prior to data collection and kept running throughout the desired period of time. As our project launched only in late 2009, we had no possibility of collecting this data ourselves. Regular contact with NRK insured that we discovered that the broadcaster had itself recorded screen dumps of their front page every hour of every day. These records consist of large image files where all the information is stored as pixels, rather than as text and images in markup. We used these images to construct a sample for an analysis of the content priorities for the top 10 stories on the site – roughly equivalent to the “above the fold” priorities of the printed newspaper.

The Web is still a relatively immature and changeable arena. The limited possibilities for, and practices of, archiving online content put strains on the collection, handling, and use of data. Not only does the technology provide significantly larger sets of data, but the data also require a different type of classification and analysis (see Brügger, 2008, for discussion). The ephemeral nature of this material adds to the challenges when the analysis can be performed only the year after publication, as in our project. In order to penetrate the ephemeral nature of the online news flow, moreover, we first need to know what we are looking to analyze, that is, to identify the proper unit of analysis.

Problem 2: Defining the Unit of Analysis

We knew from the outset that we would have to account for the specificities of the online environment when designing our methodology. In the analysis of content in a newspaper or a bulletin broadcast, the data obtained on tape or on paper are fixed in time and unchangeable from the moment of publication. As it is not possible to edit analogue disseminated news ex post, the units of analysis remain open to inspection. Not least, it means the study is possible to replicate, a factor that increases its validity. The technological nature of the Internet means that online news differs from analogue news in at least two significant ways that are relevant to our study. First, the editorial moment is not frozen in time, which means multiple versions of the same story can exist of which only the latest publication is considered the “right” one. This raises questions as to the editorial institution of journalism. Internet technology changes the nature of journalism from a daily summary of the most significant events to a continuous deadline of a potentially limitless number of events. A related second issue is the ephemeral nature of digital information itself. As noted by Michael Karlsson and Jesper Strömbäck in their study of website update frequencies, the question therefore becomes how to “freeze the flow” of online news (Karlsson & Strömbäck, 2010).

Karlsson and Strömbäck observe that there is a lack of viable research methods for investigating empirically the impact of the two primary characteristics of online news – interactivity and immediacy – on news content at the level of the individual report. They call for a strategy that can “freeze the flow” of online news so as to allow for the systematic content analysis of online news and their special characteristics (Karlsson & Strömbäck, 2010, p. 6). The Internet is first of all characterized by a lack of volume or space restrictions found in traditional media. Second, web technology does not make news organizations technologically bound to produce one original copy ready for replication at a specific moment in time. The Internet instead carries the possibility of fast, continuous publication with frequent updates – characteristics that engender traits of immediacy and interactivity in news dissemination (Deuze, 2003, p. 206; Karlsson & Strömbäck, 2010, p. 2; Nguyen, 2010, p. 224). “Freezing the flow” therefore represents the challenge to take the ephemeral nature of online news into consideration in content analysis of web media. As such it relates to a common methodological issue in all content analyses regardless of publication platform, namely the question of the unit of analysis.

Weare and Lin noted back in 2000 how the chaotic structure of the Internet complicates sample selection, and how the integration of technologies – what they refer to as text, video, graphic, or audio – makes the design of valid descriptive categories complicated (Weare & Lin, 2000, p. 273). They observe that

In traditional, linear forms of media, such as newspapers and television broadcasts, the boundaries of messages, such as a news article, are clearly defined. Similarly, the structure of newspapers and broadcasts are sufficiently standardized and their syntax is sufficiently well understood that defining the nature of a message's context is straightforward. In contrast, the nonlinear nature of the WWW obscures the boundaries and environment of messages and involves more complex semantics. (Weare & Lin, 2000, p. 280)

One therefore has to take great care when defining the unit of analysis in the study of online content. For instance, Weare and Lin describe a web page as “a highly aggregated recoding unit,” equivalent to a full newspaper article (p. 282). In order to design a valid coding scheme for an online news article, they say, we need to define smaller recording units based on an understanding of the site's underlying structures (p. 83).

One of the more specific problems we encountered in this regard emerged during the manual coding of the front page of nrk.no. It concerns the boundary between update and publication. The issue was essentially the question of what constitutes an update to a story. At what point is an update significant enough that it becomes a “new story”? This is one of the more problematic discussions regarding the unit of analysis in front page analysis of online news, and as publication frequencies escalate, it will continue to present problems. A heavily covered story, such as a political press conference, from which every update is logged as a new story, will increase the number of political stories in the sample. Compared to logging such new items as updates to one story, this increases the visibility of “politics” in the resulting analysis – possibly making it seem as if the website is primarily dominated by journalistic content of high legitimacy, providing you measure content in “instances” rather than in “volume.” Hence, it must be considered what coding practices most fairly reflect the content of the website. A likely dilemma to consider here is the case of the continuous coverage of a sports event. As an illustration, we decided to draw the line between update and “new story” at the point where the narrative changes. For instance, in the event of a football game there is likely to be an original narrative in terms of “preparing for the game.” Once the game is underway, a new story is logged where reports on goals scored constitute updates rather than a new story. Once the game is over and the narrative changes from “playing the game” to “won the game,” another unit of analysis is collected for coding. The boundaries between narratives are, however, highly permeable and coders need to take great care in the coding process, noting precedence that they subsequently follow diligently.

Ultimately, the unit of analysis on the front page was defined as headline and image. For the “inside” analysis we defined the unit of analysis as the full article after boilerplate removal. Weare and Lin's points regarding syntax and structure of news messages are key entry points to the computer-assisted analysis of online news in this regard. Here, the standardization and syntax of online news as offered by rigid CMSs and standards like HTML provide the predictability necessary to conduct this kind of analysis efficiently. Large websites such as news portals produce a predictable syntax for most of the content that is published through the system. Journalists usually do not write HTML, but rather enter text into forms in CMS that store the text in a database. When the CMS outputs the content from the database to the website it loops though the same tables each time. These are printed in HTML in a design template that provides a custom look to the news brand. The template's structure or syntax can be recognized by software, allowing us to recreate the database – or the underlying structure of the news publication – for research purposes.

Problem 3: Constructing the Coding Design

How a coding scheme is designed depends at least in part on a preconception of the characteristics of the messages you are studying (Weare & Lin, 2000, p. 284). We therefore need some basic assumption of what constitutes news and news-related content. News production is one of the more legitimating production activities a media organization can engage in, and is particularly important for the legitimacy of a public service broadcaster. NRK's bylaws state that the broadcaster “shall support and strengthen democracy” by facilitating public debate, provide sufficient information, cover elections thoroughly, uncover critical issues, and protect the public from the abuse of power, and that it do so with editorial independence, balance, integrity, and credibility. The bylaws also state that NRK should provide an updated service of regional, national, and international news on the Internet and on mobile devices. Provisions specify that these services should provide factual and background information related to news, debates, and current affairs. Moreover, these services should encourage interactive participation and “stimulate knowledge, understanding and use of other media platforms by users of all ages” (NRK, 2012). This is the vantage point for the research design, and the background against which we conducted our analysis.

As a similar study of online news on this scale had, to our knowledge, not been previously undertaken, few model research designs existed that could be used for the purpose of this analysis. A design outline was therefore formed on the basis of a previous quantitative content analysis of Norwegian television news (including NRK news) (Waldahl, Andersen, & Rønning, 2002, 2009). A general orientation within the medium in question, along with a short pilot study, enabled us to ascertain the medium-specific variables that were viable for automatic coding by computer, and the variables that needed human coding. Coding schemes for online specific news markers, such as interactive elements and update frequencies, had to be designed from scratch. This process was facilitated by frequent contact with NRK staff who willingly answered technical questions.

Relevant here were, for instance, questions regarding the subdomain practices within the organization – for example, to what extent were nrk.no/economy and nrk.no/foreign used consistently by publication gatekeepers? NRK has two main publication channels online – a central editorial function for nrk.no that was only semi-operational at the point of our analysis, and 13 district offices each with its own editorial gate. Local offices (for instance, nrk.no/hordaland) have a “front page” of their own, where stories produced by local journalists are published. From there, stories can be moved to the front page if the story is found fit by a central editorial staff in the capital, Oslo. Other questions that were resolved through contact with the organization ranged from technical issues such as how the organization had stored the data, to more journalistic issues such as sourcing routines in cases of wire material. In the latter case, practices turned out to be so inconsistent that data measuring of the number of wire stories in the sample had to be discarded. Contact with the organization can therefore be a great facilitator as well as a time-saver.

Based on a survey of the environment and the blueprint provided by the Waldahl et al. (2002, 2009) design, we ended up with a coding scheme consisting of approximately 50 mainly categorical variables, each with up to eight subcategories. Manually coded data were logged in the statistical analysis tool SPSS. The variables for the three separate analyses are set out in Table 21.1.

Table 21.1 A selection of coding variables for front page and content analysis, and computer-assisted coding

images

Some of the variables provide simple manifest data that log characteristics of the texts – such as publication date, unique identification, and the number of links in the text. Other variables allow for coding of latent content, and contain up to eight subcategories further defining the themes.6 For instance, included in the social issues category are topics related to the health sector and the education sector, stories about working life, and minority, consumer, and environmental issues. The thematic spread of this category caused some problems in the testing of intercoder reliability, as we explain further under Problem 4 below. Doubts would arise in the frequent cases where welfare state issues (for instance, budget deficiencies in the hospital sector or labor strikes) involved politicians and/or government ministers. The codebook specifies that such stories be logged according to sector (health sector, work sector, education sector, etc.) rather than the level of political controversy. The complexities associated with these journalistic areas nevertheless posed challenges for the coding of latent content.

Making Coding Scheme Adjustments

The research design had to be adjusted somewhat as the coding process progressed. Some variables were too time-consuming to log and were removed. A relevant example here was the variable registering human sources. Other aspects of NRK's news dissemination were so specific that we had to establish new categories to capture these profile-building features of NRK's online news coverage. Items that were added include sidebar functions and specific contextual linking practices tying articles to previous stories of a similar topic. We also added a few subcategories under the content variable – most notably the subcategory “media” under the “culture and entertainment” category, the subcategory “traffic” under the “social issues” category, and the subcategory “misdemeanors” under “crime.” These were added to the coding scheme ex post to facilitate differentiation of inflated “other” categories under culture, social issues, and crime. Some of these “other” categories approached 30% of the coverage, which presented a problem for the validity of the coding design. We therefore proceeded to comb through each of these categories in search of topical clusters that could be extracted from the material. Such ex post coding is, of course, problematic. However, it is also necessary in order to establish a viable content profile. Provided all precedents are noted and followed consistently, the result should present few problems in terms of the overall validity of the study.

What we can conclude from such methodological findings is, first, that the online news medium allows for a greater diversity of content than broadcast news. Second, this necessary adjustment of the coding scheme reflects the extent to which NRK's online services communicate the organization's local news profile. NRK is a national institution that is made up of 13 district offices each of which produces local television and radio news bulletins daily, and therefore also publishes local news online on a daily basis. As the necessary coding adjustments reveal, local news carries different characteristics from what the national agenda-based coding scheme allowed for. Whereas traffic violations and road work are reported frequently enough in local news to warrant additional “misdemeanors” and “traffic” subcategories in the coding scheme, such instances appear so seldom on a national news agenda that here they fall appropriately within the category “other.”

When we decided to base the coding scheme on a previous codebook designed for the analysis of television news, the assumption was that latent content categories and other news-related variables appropriate for NRK's traditional broadcast news would also suit the purposes for the study of its online news output. As demonstrated, this proved to be true only to a certain degree. The basic latent content categories used – such as economy, politics, and crime – are comprehensive enough to apply in most studies of news content. However, in this study, assumptions regarding content profile and the latent and manifest variables created problems both in the data collection process, in the coding process, in the analysis, and in the reliability testing. The fact that such design transfer did not create more problems than it actually did – complications that could indeed be solved – was an interesting analytical finding. It speaks to the endurance of news as a particular narrative form. Hence, news media do not only carry different characteristics across publication platforms, but they also share a great many features, particularly in terms of genre traits, communicative forms, journalistic standards, and overarching institutional practices.

Establishing the Computational Coding Design

In the computer-assisted coding process, the selection of manifest variables to map the use of interactive features and linking practices on nrk.no was done more or less in the same way as a web scraping. Here, we parsed HTML to obtain the data we were after by writing selectors that recognize data sought in a markup-based document. The data sought amount to categories in the codebook, or the properties of the articles that can help us answer the research questions. As both research questions and markup will tend to vary from project to project, some custom tailoring is likely needed, also in cases where the object of research uses multiple CMSs or radically different design templates. In this project we wrote the software selectors in free open software: the Python7 programming language with the screen-scraping library Beautiful Soup.8The program looped though the entire folder of downloaded web documents and fetched the items identified by the categories in the codebook. An example of a selector looks like this:

images

In this example we are looking for a div-tag with a class of articles in a BeautifulSoup object named “soup” that contains a single article in it entirety. We loop though the HTML tags inside this div-tag. If one of these tags is an h3 tag (a header) displaying the text “Gi din karakter:” (meaning “Rate now” – the design template insures this header appears every time a poll is used) we add the value +1 to the variables registering “polls” and “interactive features.” In this instance, we considered a poll with a voting option an interactive element. At the end of the loop the values from all the selectors – one for each item in the codebook – are saved and the next article is subjected to the same treatment. The key to this process is to find ways to identify “smaller recording units” (Weare & Lin, 2000, p. 83) in the web documents and to write this identification process as selectors as in the example above. The result of the process is written to a CSV file – a file option that insures that the results can be imported into a variety of computer programs, such as the SPSS statistical program Excel, and to databases (e.g., MySQL or Access). This gives each member of the research team the flexibility to work with the data according to personal preference and methodological orientation.

Problem 4: Measuring Reliability

Quantitative content analyses should always report validity as measured by intercoder agreement. Reliability reporting increases the validity of the analysis and contributes to the further improvement of the method within media and communication studies. A premise for reliability measures is testing the extent to which two or more coders can be said to agree on the meaning, and hence coding, of latent content. For the analysis to be valid, we need to have the same understanding of what constitutes political news, what constitutes an updated story versus a new story, as well as to agree on the differences between an article and a short bulletin. We operated with a wide definition of news and have therefore included items beyond the traditional hard news genre. Exceptions to this inclusive practice have been topics such as gardening, cars, horoscopes, TV guide, and so on. This division can be seen as problematic to the extent that many newspapers today carry separate sections with just this type of content (see Harcup, 2004). However, as our object of study was connected with the broadcasting medium, these content types were associated more with specific NRK programs (and their Internet sites), rather than with the news dissemination of NRK online.

Due to limited resources, we used only one primary coder in the manual coding process. For the content analysis n = 2162, intercoder reliability was tested on 100 units. We also tested 100 units on the front page analysis, n= 1192. The reliability testing displayed some of the problems that might arise from using only one primary coder. Kappa9 was achieved for none of the 14 variables tested in the first reliability test, with 9 variables measuring below .50 (kappa ranged from .20 to .67, and from 60% to 94% agreement). As raw agreement here was acceptable in most cases (with the primary content variable and social issues the only variables scoring below 70%), we assumed low variance to be a contributing factor in addition to inadequate coder training. Cohen's kappa has been criticized for being somewhat conservative, and often results in low variance in studies such as ours. Therefore, we set validity measures at above .70 for kappa and above 70% for raw agreement. To attain reliability above .70, a second reliability test was, due to restricted resources, performed by one of the authors, and resulted in acceptable kappa (between .43 and .86) and raw agreement (between 67% and 97%) for all variables except social issues, which still measured low at .61/67%. A third test measuring only the primary content variable was performed by two of the senior researchers to secure intersubjectivity within the project, and reliability was attained at .75/79%. The struggle with achieving reliability for the content category “social issues” not only revealed the fragility of using only one primary coder; it also illuminated the problems associated with using such a thematically wide category for content coding.

The reliability of the computer-assisted coding was not tested in this study, primarily due to restricted resources. It should, however, ideally be done. We would suggest testing precision and recall values (Baeza-Yates & Ribeiro-Neto, 1999). The main advantage of computer-assisted analysis is scale, as a larger number of items can be processed more quickly and with greater precision by computers than by human effort. In traditional content analyses the size of the sample correlates with the certainty of the numbers. This favors automation as we can perform analyses on more units in less time. It also makes it harder to find and pinpoint errors post run time,10 as the data set very quickly becomes very large. A test of the accuracy of the selections and quantification of attributes could be given as a percentage. This can be done by testing the algorithm on a data set of known manually categorized material, and comparing the manual with the automated result. This is similar to a recall value in information retrieval (Baeza-Yates & Ribeiro-Neto, 1999). Such a statistical classification is a meaningful way of measuring the accuracy of the algorithm and the quality of the compiled data. Particularly if selectors aim for latent content, where the results are expected to be poor, a measure for the algorithm is needed. In such cases, a variation of a precision value would be appropriate.

In formulating the selectors, tests and adjustments are necessary to achieve sufficient flexibility and precision. All selectors in use should be tested. A manual examination of the results produced by the algorithm is required here. Errors will be quantified if allowed, so iterative testing – from single units to smaller samples and random samples of the entire material – will insure consistency throughout the project. Multiple testers should participate by looking for errors in a processed sample. The aim is to insure that the algorithm is precise to a degree that exceeds the precision of human coders.11 For manifest content the aim should be 100% agreement – a feasible goal if the CMS prints content in a consistent way. Testing results will vary from news site to news site, and between different software approaches. A test to formally validate the precision of the algorithm and accompany the results will strengthen the credibility of the research findings.12

Results and Findings

In this chapter we have outlined the major problems we encountered when performing a content analysis of a year's output of online news produced by the public service broadcaster NRK. The problems were primarily connected with obtaining the data, defining the unit of analysis, designing a viable coding scheme, and with achieving intercoder reliability for the coding design. The problems we encountered were specifically related to our attempt to transfer a content analysis designed for television to the online medium. As Karlsson and Strömbäck emphasize, the dilemma with researching journalism online is how to “freeze” the flow of news so that it becomes viable for coding and analysis. Here, we have employed a computer-assisted approach in combination with a traditional content analysis methodology to attain such a freeze – one instant being 2009 in retrospect through an index file, and the other being the front page hour by hour in pixilated format. This computer-assisted intervention into the material has been the method by which we have endeavored to indeed “freeze” the news flow and thereby to handle the ephemeral nature of online news.

Because of this approach, the problems created by the ephemeral nature of online news also produced results in the analysis. As we adjusted the research design from television to online, we also adjusted our methodology to the grammar and logic of online news. Large numbers of stories that defied categorization not only illuminated problems with the national agenda-based coding design; it also helped uncover how the public service broadcaster was adjusting to an online news agenda. When measuring the update frequency among the top 10 stories on the front page of nrk.no, we found that 36% of all stories were new in the sense that they had been published within the last hour. Hence if you were to check nrk.no every hour during 2009, you would on average find that each time you entered the site, 36% of top 10 stories were new. The results also showed the pace of publication, where 79% of top stories were replaced with a new top story, every hour. Thematically, above-the-fold stories largely followed the typical online news agenda of event-centered and easily digestible news. Sports was the largest content category among top 10 stories on the front page with 23% (followed by crime with 19%) but failed to reach the top five as often as crime stories or politics. If we focus on only the very top of the page, Figure 21.1 shows the distribution of content between the top five stories on the front page.

We also found a clear difference in editorial or journalistic priorities between news on the front page and news on the website's “inside.” For one thing, while the front page contained 23% foreign news, only 2% of stories published on the inside were international in scope. This can perhaps be explained by referring to the local office production structure of the website, where approximately 80% of all content published during 2009 was produced by one of NRKs district offices. This production reality was also reflected in the thematic distribution of content, with a predictably heavy emphasis on social issues and local politics on the “inside,” and an overemphasis on murders and suspicious deaths on the front page. Figure 21.2 shows the difference in thematic priorities between the front page and the “inside.”

images

Figure 21.1 Front page priorities, top five stories (n = 119)

images

Figure 21.2 Share of stories on the front page (n = 1192) and the content (n = 2162)

As for the results of the computer-assisted coding, the analysis revealed some predictable patterns in the website's use of features that allow for user involvement, and in terms of external linking practices. Research from the last decade has shown that traditional news media's utilization of unique web-specific aspects such as interactive options for users, multimedia features, and various forms of participation remain scarce (Choi, 2004; Karlsson, 2010; Nguyen, 2008; Rosenberry, 2005). In particular, studies consistently find that the use of external hyperlinks remains low (see Barnhurst, 2010; Carpenter, 2010; Dimitrova, Kaid, Williams, & Trammell, 2005; Engebretsen, 2006; Oblak, 2005; Ureta, 2011). In the case of nrk.no, the analysis of news output for 2009 showed that 52% of stories contained no hyperlinks. Many of the published news items can be described as short notices where linking would not be considered necessary. However, the fact that only 12% of stories contained hyperlinks to external websites reflects the scarcity of the external linking practice in the organization's online journalism. As for interactive features, nrk.no displayed a relatively low use of features that allow for user engagement during their 2009 news coverage. “Interactivity” is, of course, a highly contested and notoriously amorphous term. In practice we chose to include in the measure the (widget) elements that appeared in the sample. Hence we dealt with interactivity as a fixed set of story-centric elements that require the user's action (e.g., a mouse click) to trigger extra content related to a specific story. This includes the use of video, audio, navigable photo galleries, polls, games, and so on. Figure 21.3 shows the frequency with which different interactive elements were utilized in the NRK's online news coverage during 2009.

images

Figure 21.3 Interactive features on nrk.no during 2009 (n = 74,430)

Conclusions

An analysis of digital media is unavoidably also an analysis of the information system through which the media content is distributed and consumed. The efficiency of web publishing, where immediacy emerges as a key news value for online content, relies on the predictability and rigidity of a CMS. The same predictability and rigidity allows researchers to apply computation to analysis of the same material. Understanding the information system underlying news production and dissemination contributes to our understanding of how media is accessed, of the limitations posed on news by technology, and ultimately of the media content itself. Emerging online news production practices can therefore be seen as inherently tactical adjustments conducted by media institutions in adapting to a competitive market. While endless space for elaboration, analysis, discussion, and background information is one of the possibilities offered by the online news endeavor, speed is regularly chosen over depth. While findings from our project may suggests that being first takes precedence over having the broadest coverage, the local–national duality in the content profile also suggest the presence of conflicting ideals in the editorial decision-making routines. For a license-fee-funded broadcaster, the online remit presents a dual challenge to fulfill public service obligations while competing for audiences on the national arena.

What this project, and the convergence between conventional content analysis and computational methods, demonstrate is that whereas the ephemeral nature of online news may create a new grammar of news dissemination that requires new methods and theoretical frameworks, the form of news as an institutional product is also an enduring one that warrants attention in terms of analysis in such studies.

ACKNOWLEDGMENTS

The research presented in this article was funded by the Norwegian Media Authority. The authors wish to thank Dag Elgesem, Hallvard Moe, Maren Agdestein, Joachim Laberg, Linn Lorgen, and Gyri S. Losnegaard.

NOTES

1 The dates were: Monday, January 19; Wednesday, February 11; Thursday, March 12; Friday, April 24 (selection moved one week because of Easter); Sunday, June 21; Tuesday, August 18; Wednesday, September 16; Thursday, October 15; Friday, November 13; Saturday, December 5 – one Monday; one Tuesday; two Wednesdays, Thursdays, and Fridays; one Saturday; and one Sunday.

2 This chapter is based on findings from a project conducted by a team of researchers at the Department of Information Science and Media Studies at the University of Bergen. Previous publications from the project present findings in greater detail. See Elgesem, Moe, Sjøvaag, Stavelin, Agdestein, Laberg, Lorgen, & Losnegaard, 2010; Sjøvaag, Moe, & Stavelin, 2012; and Sjøvaag & Stavelin, 2012.

3 As is the case with many online sites, www.nrk.no has multiple portals (examples include www.nrk.no/sport; www.nrk.no/economy; www.nrk.no/election09) which reorganize content that can also feature in short form on the front page. The same type of organization is found in the site's characteristic geographical subdivisions (such as www.nrk.no/hordaland; www.nrk.no/nordland). This geographical division of published online content is based on the broadcaster's organization into 13 local offices spread around the country. Hence, nrk.no is primarily organized into two subdomain structures – one based on thematic categories, and one based on production location, both of which restructure content easily identifiable as news.

4 These were downloaded to our server by GNU Wget, a process we intentionally slowed down to spare the NRK servers from unnecessary strain. The collected data took approximately 4.2 GB of space, and were collected over a period of five days. Unwanted bits of the downloaded files were removed in a “boilerplate removal process” (Fairon & Naets, 2007) where everything but the news article text was removed from the sample – for instance, headers, footers, and menus. The boilerplate removal was undertaken to save space and time and to remove unnecessary complexity in the further treatment of the collected material.

5 See http://www.guardian.co.uk/open-platform (accessed August 12, 2013).

6 For example, stories logged as culture and entertainment include reports from the arts sector and popular culture, news about the royal family, portraits of celebrities, and curiosities. Topical issues in the content analysis coded for long-term agenda setting and included the financial crisis, the national election, climate change, Barack Obama, the Iranian election, and the Muhammad cartoons crisis.

7 See www.python.org for downloadable installer and information.

8 http://www.crummy.com/software/BeautifulSoup/ (accessed August 12, 2013).

9 We used Cohen's kappa (k) to measure the agreement between two coders who independently assign values to items applying codebook variables that are mutually exclusive. Kappa calculates intercoder agreement by dividing actual agreement by chance or random agreement, arriving at observed agreement (Krippendorff, 2004, pp. 245–247).

10 This is the time during which a program is executing.

11 In all aspects that do not require human judgment. Make sure all selectors aim for manifest content. Latent content selectors will need a more solid documentation of precision.

12 To better reproduce the results, the selector code could be added to the report, or even published as an open-source project. This will further strengthen the credibility of the findings.

REFERENCES

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York, NY: ACM.

Barnhurst, K. (2010). The form of reports on US newspaper Internet sites: An update. Journalism Studies, 11(4), 555–566.

Brügger, N. (2008). The archived website and website philology: A new type of historical document? Nordicom Review, 29(2), 155–175.

Carpenter, S. (2010). A study of content diversity in online citizen journalism and online newspaper articles. New Media & Society, 12(7), 1064–1084.

Choi, Y. (2004). Study examines daily public journalism at six newspapers. Newspaper Research Journal, 25(2), 12–27.

Chung, D. S. (2007). Profits and perils: Online news producers' perceptions of interactivity and uses of interactive features. Convergence, 13(1), 43–61.

Deuze, M. (2003). The Web and its journalists: Considering the consequences of different types of newsmedia online. New Media & Society, 5(3), 203–230.

Dimitrova, D. V., Kaid, L. L., Williams, A. P., & Trammell, D. D. (2005). War on the Web: The immediate news framing of Gulf War II. Harvard International Journal of Press/Politics, 10(1), 22–44.

Elgesem, D., Moe, H., Sjøvaag, H., Stavelin, E., Agdestein, M., Laberg, J., Lorgen, L., & Losnegaard, G. S. (2010). NRKs nyhetestilbud på Internett i 2009 [NRK's news online 2009]. Bergen, Norway: Department of Information Science and Media Studies.

Engebretsen, M. (2006). Shallow and static or deep and dynamic? Studying the state of online journalism in Scandinavia. Nordicom Review, 27(1), 3–16.

Fairon, C., & Naets, H. (2007). Building and exploring web corpora. Leuven, Belgium: Presses Universitaires de Louvain.

Harcup, T. (2004) Journalism: Principles and practices. London, UK: Sage.

Karlsson, M. (2010). Rituals of transparency: Evaluating online news outlets' use of transparency rituals in the United States, United Kingdom and Sweden. Journalism Studies, 11(4), 535–545.

Karlsson, M., & Strömbäck, J. (2010). Freezing the flow of online news: Exploring approaches to the study of the liquidity of online news. Journalism Studies, 11(1), 2–19.

Krippendorff, K. (2004). Content analysis: An introduction to its methodology. London, UK: Sage.

Lund, A. B., & Berg, C. E. (2009). Denmark, Sweden and Norway: Television diversity by duopolistic competition and co-regulation. International Communication Gazette, 71(1–2), 19–37.

Medie Norge (2011). Daglig dekning for nasjonale mediehus [Daily audience reach for national media houses]. Retrieved October 4, 2012, from, http://www.tns-gallup.no/?did=9079742

Neuendorf, K. A. (2002). The content analysis guidebook. London, UK: Sage.

Nguyen, A. (2008). Facing “the fabulous monster”: The traditional media's fear-driven innovation culture in the development of online news. Journalism Studies, 9(1), 91–104.

Nguyen, A. (2010). Harnessing the potential of online news: Suggestions from a study on the relationship between online news advantages and its post-adoption consequences. Journalism, 11(2), 223–241.

NRK Beta (2009). Hva skjer med NRKs API? [What's the status of NRK's API?] Retrieved July 31, 2013, from http://nrkbeta.no/2009/03/19/hva-skjer-saa-med-nrks-api/

NRK (2012). Bylaws for NRK AS. Retrieved July 31, 2013, from http://www.nrk.no/informasjon/about_the_nrk/1.4029867

Oblak, I. (2005). The lack of interactivity and hypertextuality in online media. Gazette, 67(1), 87–106.

Quandt, T. (2008). (No) news on the World Wide Web? A comparative content analysis of online news in Europe and the United States. Journalism Studies, 9(5), 717–738.

Rosenberry, J. (2005). Few papers use online techniques to improve public communication. Newspaper Research Journal, 26(4), 61–73.

Sjøvaag, H., & Stavelin, E. (2012). Web media and the quantitative content analysis: Methodological challenges in measuring online news content. Convergence, 18(2), 215–229.

TNS-Gallup (2009). Topplisten [The top list]. Retrieved July 31, 2013, from http://www.tns-gallup.no/?aid=9076580

TNS-Gallup (2011). Årsrapport Internett 2011 [Annual report Internet 2011]. Retrieved July 31, 2013, from http://www.tns-gallup.no/arch/_img/9100899.pdf

Ureta, A. L. (2011). The potential of Web-only feature stories: A case study of Spanish media sites. Journalism Studies, 12(2), 188–204.

Waldahl, R., Andersen, M. B., & Rønning, H. (2002). Nyheter først og fremst: Norske tv-nyheter: myter of realiteter [News first: Norwegian TV-news: Myths and realities]. Oslo, Norway: Universitetsforlaget.

Waldahl, R., Andersen, M. B., & Rønning, H. (2009). Tv-nyhetenes verden [The world of TV news]. Oslo, Norway: Universitetsforlaget.

Weare, C., & Lin, W. (2000). Content analysis of the World Wide Web: Opportunities and challenges. Social Science Computer Review, 18(3), 272–292.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.237.115