Chapter 12. Multilevel Analyses

In Chapter 10, Classification with k-Nearest Neighbors and Naïve Bayes, we discussed association with k-Nearest Neighbors and Naïve Bayes. In the previous chapter, we examined classification trees using notably C4.5, C50, CART, random forests, and conditional inference trees.

In this chapter, we will discuss:

  • Nested data and the importance of dealing with them appropriately
  • Multilevel regression including random intercepts and random slopes
  • The comparison of multilevel models
  • Prediction using multilevel modeling

Nested data

If you have nested data, this chapter is essential for you! What is meant by nested data is that observations share a common context. The examples include:

  • Consumers nested within shops
  • Employees nested within managers
  • Teachers and/or students nested within schools
  • Nurses, patients, and/or physicians nested within hospitals
  • Inhabitants nested in neighborhoods

We could imagine way more cases of data nesting. What they all have in common is a data structure similar to the one depicted in the following figure:

Nested data

A depiction of nested data

We will only discuss two levels of data with unique membership in this chapter, but of course, more complex situations can arise. For instance, in all the preceding examples, shops, managers, schools, hospitals, and neighborhoods can be nested within higher level units (for example, companies, cities) which could be a third level in the analyses). Also, crossed memberships could be imagined, for example, patients sharing a hospital but not a neighborhood. This type of data is more complex to analyze and as always, space is scarce in this chapter. Note also that it usually happens that data is collected at both levels, for instance: the job satisfaction of employees (level 1), and the type of leadership of the managers (level 2).

If your data has a hierarchical structure, traditional regression analysis will most likely produce unreliable results. This is because the observations are not independent, but the analysis assumes they are. One of the consequences is that standard errors can be underestimated, which could lead to spuriously significant results.

Another problem is known as the Robinson effect, which refers to the increase in the statistical relationship between attributes when data is aggregated as compared to when they are not. While the phenomenon has not been been named after the castaway, but the researcher who discovered the phenomenon, the result of blindly aggregating data on the higher level and examining relationships between attributes at that level might only lead to the shipwreck of the analysis.

Spurious results can also be due to characteristics of the context that are shared by observations within the groups. Drawing conclusions at one level with data collected at another level is likely to be erroneous because relationships might be different at different levels. The atomistic fallacy is drawing conclusions at the lower level from data at a higher level. The ecological fallacy is just the opposite—drawing conclusions at a higher level from data at a lower level.

Let's examine an example of ecological fallacy visually in the following figure:

Nested data

A depiction of opposing findings at different levels

This plot represents fictional data from seven groups on two hypothetical attributes: Attribute x and Attribute y. Imagine we compute the average values for each of the groups (the mean of the dots in the dashed ovals); these aggregated values would show a strong positive relationship (thick dashed line) between Attribute x and Attribute y. However, if we examine the real relationship within each group, we can see that Attribute x is actually slightly negatively related to Attribute y within each group (thin plain lines).

Another related problem is the Simpson's paradox (named after the statistician, not the cartoon character). In analyzing this dataset, we would also find a positive relationship if we simply considered all the groups together in a regression analysis at level 1; the positive relationship is due to the fact that the groups differ in the values of Attribute y because of an unmeasured attribute, and concurrently, they also differ in the values of Attribute x. In other words, the level of the unmeasured attribute (here related to an increase in both x and y) is shared by the observations within each group, but not between groups. Not making the distinction between groups in the analysis would also lead to inaccurate results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.14.245