Appendix on Data, Methodology, and Models

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Appendix on Data, Methodology, and Models

APPENDIX TO CHAPTER 4: A FORMAL MODEL OF WEB TRAFFIC AND ONLINE AD REVENUE

Chapter 4 offers a simple formal model of online revenue for websites. In this Appendix, we will walk through the mathematics behind the model. Those familiar with increasing returns models from many other areas of economics will find the approach here broadly similar.

Suppose there are M sites, each producing 1 type of content, but each site produces its own variety p_j, 0 ≤ p_j ≤ 1. There are N consumers, each with a daily consumption budget C, and a preference for the variety pⁱ. For purposes of notation, i will index consumers as a super-script; j will index sites as a subscript, and we will suppress the indexes if the context is clear.

Each site produces content at rate λ_j with quality ω_j. Let be the consumption of site j by individual i. Consumption is determined by the quality of content, the refresh rate (λ_j), and an individual’s preference for variety. Doing the simplest thing:

where is a constant of proportionality.

Each individual’s consumption is subject to the budget constraint C, which is the amount of time the viewer has to enjoy the web. To take the easiest case first, we initially assume C is constant for all consumers. If searching on the web were free, we could say , for each i = 1, …, N. However, we assume there is a fixed cost to the viewer’s attention each time he navigates to a new site. Let s_i be the number of sites in i’s viewing porfolio, and let t₀ be the search cost. Then

In general, many of the , may be zero.

With complete information, then, the consumer knows each site’s content quality, quantity, and variety. By analogy, one can imagine a consumer deciding which markets to shop at. He knows the location of the markets and their wares, and he must choose to travel to many markets to purchase exactly the variety and quality of goods he wants. In this case, he must assume high transportation costs. Or the consumer may lower travel costs by going to a single supermarket while perhaps conceding some preference for quality, quantity, or variety.

We will write π_j for the profit of site j. Profit is just revenue minus the cost. Set the revenue of site j to be a function of the total consumption, . Initially the precise mechanism for turning consumers’ attention into dollars is taken to be exogenous to the model. For now, we assume conservatively that R is increasing. There are two components of the cost to site j: a fixed cost α and a cost of production. Traditionally, production costs are the number of workers in the labor force times the wage rate. Supposing content quantity is proportional to the labor force and content quality is proportional to the wage rate, we write the profit function as

where β and δ are constants of proportionality.

Utility for consumers is derived from consumption of site content. Every possible triple of site quantity, quality, and variety determines the amount of content an individual is willing to consume. That is, consumption measures utility indirectly. Each individual has a preference ordering

for k = 1,…, M. It is rational for an individual to consume sites in preferential order , paying the search cost at each transition until his consumption budget is exhausted.

Example: 2 Sites and 1 Consumer

With two sites, j = 1, 2, we can assume the first site is closer to our lonely viewer’s preferences, |p − p₁|≤|p − p₂|. If the quality/quantity factor of site 2 is sufficiently high, the consumer will consume more of site 2. Precisely, if

we can cross-multiply to get

This means that sites can make a sufficient investment in quantity and quality to draw consumers past sites closer to the consumers’ preferences.

Example: 2 Sites and Many Individuals

Suppose the consumers are distributed uniformly in the variety preference space, and the proportionality constant is fixed, for all i and j. If revenue grows faster than production cost as a function of the quantity/quality factor, profit will grow with investment in production. Formally, calling the production cost W, if R^′ > W^′, then π^′ = R^′− W^′ > 0. Thus, even mild economies of scale in production or revenue will produce profit growth from production investment.

The maximum available revenue for a site occurs when all the viewers spend their entire consumption budgets on that one site, so the maximum possible revenue is R(NC). If this level of production is profitable, then there is a monopolistic equilibrium. That is, all individuals will spend all their budgets on one site. That one site is profitable, and no other site is consumed. This is optimal for consumers, because they all deem the quantity/quality to be sufficiently high that they don’t have to waste precious consumption time changing sites.

Strong Preferences for Variety and Other Extensions

The mathematical core of the model is highly extensible, as the rest of the chapter shows. One extension is to give users limited preference windows, as follows:

Now it is impossible to capture everyone’s complete attention, so long as at least some consumers’ preference windows exclude the midpoint of the variety space.

Alternatively, we might allow sites to produce multiple categories of content, offering an analog to aggregators or portal sites. Here additional assumptions are required: for example, dichotomizing the utility that consumers receive from a given category of content (“all or nothing”), assuming that users’ preferences across categories are independent, and positing that there is at least some diminishing marginal utility for additional content in a given category. The higher the number of categories, the more the central limit theorem takes over, providing an increasing advantage to portal sites or aggregators.

The main chapter goes into additional examples, all of which reframe the core results (or offer different utility payoffs) instead of offering genuinely new math. All of the model extensions reinforce the main point, though: once the core assumptions about increasing returns are accepted, it becomes difficult to find any auxiliary assumptions that will prevent strong market concentration.

APPENDIX TO CHAPTER 5: THE DYNAMICS OF DIGITAL AUDIENCES

The discussion in chapter 5, on change in digital audiences over time, delves into a series a highly technical subjects. This section will briefly discuss power laws and the mathematics of dynamical systems that underlie our simulations.

The first order of business is to address a sometimes unhelpful debate on what counts as a power law in empirical data. In formal terms, a power law distribution is characterized by a density function that is proportional to 1/x^α, where the (negative) power α corresponds to the slope of the line when plotted on a log-log scale. A profusion of papers on power laws from the late 1990s onward, and a subsequent popular press discussion of power laws and “long tails,” has sparked a corrective backlash among some researchers (e.g., Clauset et al., 2009). Some of these papers have chastised researchers for using the term when other distributions may fit better—sometimes a lot better, but usually only slightly.

For the data sources used here, there is little substantive difference whether the distribution of audience is a power law, an extreme log-normal, a power law with exponential cutoff, etc. Most real-world datasets show deviations from a pure power law in the “head” with the largest observations. This volume often uses the term “log-linear distribution” to denote this broad family of related distributions.

In general, though, there are good reasons to prefer the power law label, even when other distributions may fit the data slightly better. Of course other related distributions often fit better: they have two or more parameters, while pure power laws have only one. Parsimony is a cardinal virtue in model building, and each additional parameter provides latitude for mischief. As John von Neumann reportedly said, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”¹

In any case, our data show a good fit to a pure power law, the discussion in the previous paragraphs notwithstanding. A simple way to check the regularity of the traffic distribution is to estimate the exponent of the power law for each day in our data. While an OLS regression line for the log-log web traffic distribution produces a near-perfect R², this technique overweights the smallest observations, as there are an order of magnitude more of them. We use the maximum likelihood techniques described in Newman (2005), which give us an estimated daily exponent (α) that varies between 1.98 and 2.13 for the 1,096 days in our sample.

Day-to-day changes in stochastic dynamical systems, such as stock markets, are often modeled with log-normal distributions. In our modeling we thus consider changes in the logarithms of the web-traffic market shares. We are particularly interested in the day-to-day relative growth rate

Note that “growth,” in this definition, can be negative as well as positive. Additionally, we cannot observe the growth rates for sites falling out of the top three hundred with our data. This skews the distribution of the observed growth rates for the smallest sites, but does not change any of the findings that follow.

The notation x_j(t) will represent the market share of the site occupying rank j at time t. We can thus write the market share on the next day as

On the left-hand side, the j is in parentheses because the website that occupied rank j at time t may occupy a different rank on the next day. The term g_j is a random variable describing the daily growth rate of the sites at rank j. At each time t, a new g_j is sampled. If g_j is negative, the market share decreases such that x_(j)(t + 1) < x_j(t). If g_j is zero, there is no change in market share.

Importantly, g_t for each rank in our data is approximately log-normally distributed, though with heavier tails than a pure log-normal distribution would produce.

APPENDIX TO CHAPTER 7: LOCAL DIGITAL NEWS OUTLETS

A critical component of the analysis in chapter 7 involves identifying local news sites.

For our purposes, local websites are operationalized as sites that have experienced higher levels of usage within a given media market than they do in the rest of the nationwide sample. How much higher? The simplest rule is to look at any usage difference big enough that is unlikely to have been produced by chance. The study uses a standard difference of means test, comparing sites’ mean audience reach within a market to its reach in the rest of the national panel.

Sites where the observed local vs. national gap in usage is at least three times larger than the estimated standard error are examined as possible local sites. Formally, this is equivalent to a t-score > 3. The samples are large enough that z-scores or t-scores are equivalent. Qualitative assessments (detailed later) suggest that this decision rule for discerning local from national content works extremely well for the types of sites we hope to examine. A lower decision threshold—such as 2.5—produces few additional genuinely local sites, with the vast majority of additional sites false positives. As we would expect, much lower decision thresholds, such as t > 1.5, swamp the analysis with false positives. Given that the data are roughly normally distributed, sampling error alone should produce a t-statistic > 2.0 about 2.5 percent of the time. In a dataset of more than a million observations, such a low threshold produces an unmanageably high false positive rate.

If a difference-of-means test provides a powerful heuristic to distinguish local from national content, the other important task is to distinguish between sites that provide news and sites that do not. The initial research design called for this study to use comScore’s proprietary “web dictionary,” which places tracked sites into one of many of categories and subcategories, including a category for “News/Information.”

However, comScore’s coding scheme was discovered to have significant limitations. Substantively identical news sites are often placed in different categories and subcategories. Even within the same market, it is common to find television station sites or newspaper sites spread across several different categories and groupings.

This initially puzzling result likely stems from comScore’s subscriber model. Media organizations that subscribe to comScore get to choose the category or subcategory in which their site is placed. This can result in a “Lake Wobegon effect,” in which subscribing media organizations each choose the category or subcategory that looks most favorable. Most subscribing news organization thus get to say that they are at the top of their chosen category.

Since it is essential that we know the affiliation (if any) between online news sources and traditional media outlets, the comScore data need to be supplemented. Because of these limits with the comScore categorization, the author himself coded sites for news content, locality, and traditional media affiliation.

While the comScore data categories are imprecise and inconsistently applied, they do provide some guidance. A newspaper site might end up in “Regional/Local” or “News/Information: Newspapers” or “News/Information: General News,” but it is unlikely to end up in “Retail.” First, ten broadcast markets were chosen using random numbers generated by random.org. For February, March, and April, all sites with a t-score > 3 in these 10 markets were examined. The comScore category was recorded for all sites that provided local news. News sites were found in the three comScore “News/Information” subcategories, in the “Regional/Local” category, and in the “Entertainment” category (particularly the “Entertainment: TV” and “Entertainment: Radio” subcategories). There was no discernible difference between local TV stations that ended up in the “Entertainment” category and those that ended up in “News/Information” or “Regional/Local.” Even with radio, a number of hard news stations were placed in the “Entertainment” category.

The study also requires setting a consistent, cross-market audience share standard for inclusion in the analysis. Without such a standard, far more local news sites will be found in bigger markets than in smaller ones, since sites that receive five or fewer visitors are not included in the comScore data. For example, a site that got five panelist visits in Burlington, Vermont, would be omitted from the analysis, while a site that got eight visits in New York City would be included, even though the market reach of the Burlington site is eighteen times higher.

Since the study aims to provide the broadest possible survey of online local news, this base threshold is set as low as the data allow. First, minimum standards for inclusion in the analysis are based on monthly audience reach rather than other traffic metrics. Less-trafficked sites usually score far better on audience reach than they do on page views or time spent on the site. Second, these audience reach metrics should be as small as consistent cross-market comparison permits. To repeat, sites must have at least six local market visitors in order to have shown up in the comScore data at all. The smallest markets (such as Burlington or Madison, WI) have between six hundred and seven hundred panelists.

Since six visitors equal 1 percent of six hundred panelists, 1 percent audience reach is the smallest threshold we can use and still include cities like Burlington and Madison in the analysis, at least without worrying about data censoring. Even slightly lower thresholds would affect many or most of our one hundred markets. Requiring sites to receive 0.5 percent audience reach requires a panel size of at least 1,200 respondents to avoid censoring, excluding thirty-three of our markets. Requiring 0.3 percent audience reach requires 1,800 panelists and would impact fifty-four of the one hundred broadcast markets. Moreover, any additional sites found by lowering the threshold would add up to only a tiny fraction of all news consumed in the local market. (Local news sites just above the 1 percent audience-reach cutoff average less than 1/100th of 1 percent of all local page views.)

While the 1 percent threshold here is chosen because of the limits of the data, at least one prominent scholar has proposed a similar threshold on normative as well as pragmatic grounds. Eli Noam (2004, 2009) argues that outlets should have 1 percent market share to count as media voices in his proposed Noam Index of media diversity.² As Noam (2004) explains, “To keep the index practical there should be a cut-off of a minimal size for a voice. One per cent seems a reasonable floor: small but not trivial.”³

Putting these requirements together means that local news site candidates are all sites in the sample with the following characteristics:

• Sites in the “News/Information” category, “Local/Regional” category, or the “Entertainment” category.

• Sites where the difference in audience reach within a market vs. nationally produces a t-statistic > 3.

• Sites that achieve 1 percent audience reach in at least one of the three months examined.

More than 1,800 sites in the data possess all three of the above characteristics. The coding guidelines specify an inclusive definition for news sites. Websites were counted as news and information outlets if they provided regularly updated, independent information about local news, community affairs, local officials, or issues of regional concern. This definition is not format-dependent, and in principle it would include sites such as local blogs. Static content by itself was not enough to count as a news source; sites needed to have had front-page content updated within the preceding two weeks. This coding was labor-intensive, but did provide for an extremely detailed, firsthand look at what local news sites consist of.

The data presented almost no margin cases or difficult coding decisions. The overwhelming majority of sites identified by the aforementioned three-pronged test were traditional media outlets. Ultimately, 1,074 of the candidate sites were classified as local news sites.

Because of the mandate to examine internet-only local news sources in particular, special care was taken to accurately record a site’s affiliation with broadcast or print sources. Every television station was confirmed to have broadcast or cable distribution, and every print outlet was confirmed to have a paper version.

In ninety-five out of the 1,074 news sites, higher usage levels (t > 3) were recorded in more than one media market. These cases overwhelmingly involved a large newspaper or (less often) a regional television station with a statewide or regional audience. Since the focus here is on local news, rather than state or regional news, these secondary regional markets are excluded from the definition of local content. The Seattle Times may have above-average readership in Spokane, WA, but it does not consistently cover Spokane’s local politics.

There are two exceptions to this rule, however: AL.com and Michigan Live. Both are statewide sites that feature content from newspapers in several different broadcast markets. Participating newspapers forgo their own home pages to host content on these statewide platforms. These outlets are thus counted as local in every market with a participating news organization.

Variables for Regression Analysis

Previous work has examined both the demographics of national digital news consumption and structural factors that shape local news consumption in traditional media.⁴ Of particular interest, previous studies have found that larger markets produce more broadcast news.⁵ TV Market Population is each media market’s population, according to the Census Bureau’s American Community Survey.

Some research has shown that broadband users use the web far differently than dial-up users, and that they consume more and richer types of web-based content.⁶ To investigate the potential impact of high-speed access, Broadband Pct. is FCC data on the percentage of the market that subscribes to 768 bps or faster broadband.

Since local newspapers and TV stations provide most online local news, it is particularly important to examine the structure of oﬄine local media. Regarding newspapers, the models test several variables. Newspaper Circ/Capita reports the total newspaper circulation per person in a given market. Daily Newspapers gives the number of daily newspapers that reach at least 5 percent of the market population. Newspaper Parent Companies is the number of parent daily newspaper firms in the television market. These variables hopefully provide some leverage over which (if any) matters most: the number of newspaper firms, the size of the newspaper audience, or just the number of outlets that reach a minimum threshold of readership.

Television broadcasters are approached similarly. Since we find few local news outlets associated with noncommercial broadcasters, we focus on the number and type of commercial TV stations. Commercial TV Stations records the number of full-power commercial television broadcasters. Additionally, we are interested in whether ownership patterns predict the number of online outlets found and the amount of local news consumed. Locally Owned TV Stations measures the number of commercial stations with owners in the market. Minority-Owned Stations captures the number of local television stations under minority ownership according to FCC records. Newspaper-TV Cross-Ownership reports the number of parent companies in the market that own both a daily newspaper and a commercial television station.

While newspaper-TV cross-ownership is of particular concern, the local radio market is also potentially important. News-Format Radio Stations records the number of broadcast radio stations with a news-based format. Radio-TV Cross-Ownership measures the number of parent entities that own both a broadcast TV station and a radio station.

We also investigate possible ties between the racial and ethnic makeup of a market and its online news production. Black Pct. and Hispanic Pct. capture the portion of residents who are African American and Hispanic, respectively. One might hypothesize that markets with large immigrant Hispanic populations may consume less English-language local news. Models thus include interaction effects between racial and ethnic makeup and market size: Hispanic Pct.×Market Pop. and Black Pct.×Market Pop. These terms test whether racial and ethnic diversity has different effects in larger vs. smaller markets, as some of the cases discussed earlier might suggest.

The analysis also controls for effects of income and age. Income uses per-capita earnings data provided by BIA. Age 65+ is the fraction of the population in the market that is sixty-five years or older.

Lastly, because there may be month-specific factors that impact the amount of news consumed, two dummy variables for the months of February and March are included. April is the omitted category. These coefficients should be interpreted as the difference between consumption in April and the listed month.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix on Data, Methodology, and Models

Create new playlist

Sign In

Sign Up

Table of Contents for
Appendix on Data, Methodology, and Models