Advantages of survival modeling

In our illustration, the business problem could be defined as the need to understand the probability of a machine being operational across its lifespan. Did all the machines fail at some point in the period of evaluation? No. The data spans 20 years and we have 55% of the instances when the machines haven't stopped functioning in the given time span. So how can we find the survival rate or hazard rate of machines when they haven't stopped functioning? Well, we could use the data of the machines that did stop functioning. We could maybe look at the common characteristics (such as the manufacturer, functional area the machine has been deployed for, the location of the plant where it has been installed, servicing history, and so on) of the failed and the functioning machines and extrapolate the knowledge gained from analyzing failed machines to the functioning ones.

This opens up the possibility of using multiple regression (which was covered in Chapter 2, Forecasting Stock Prices and Portfolio Decisions using Time Series Data, while looking at portfolio investment decisions) to find the drivers that led to 45% of the machines not lasting until their 20th anniversary. One of the fundamental requirements to use regression is that our data is normally distributed. One of the implicit assumptions in such a scenario would be that all the machines in the data were operationalized at the same time and given a chance to perform for the tenure of 20 years. Notwithstanding this assumption, there could be some issues in achieving normal distribution of the variable. If the machines didn't get operationalized at the same time, there would be issues in inferring the output from the model. What happens when all the failed machines got installed at year four? Did something go particularly wrong at this point? What if the machines were all installed in no particular order in the 20-year period and that some of the functioning machines were installed in the last few years of the observation period? Survival modeling is adept at dealing with data when there is a challenge in identifying the start or the end period of data constituents.

Another advantage of survival analysis is to incorporate the effects of covariates. Not all survival modeling techniques can incorporate the effects of covariates or what we termed as independent variables while introducing multivariate regression in Chapter 2, Forecasting Stock Prices and Portfolio Decisions using Time Series Data. The inclusion of covariates adds an important explanatory part to survival analysis.

It is also possible to compare survival curves between various strata or groups. For instance, if we had five different brands of machines, we could plot different survival curves for each brand. This gives us the flexibility of comparing across groups. In the other predictor models, we will have to construct different models for each of the groups to be able to perform inter-group comparison.

Multivariate regression is unable to inform on how likely a machine is going to be functional or what is the risk of a machine being non-functional at a given time. Using multivariate regression, we can just model the drivers and their importance in determining failure but that won't give us the time element information. Even if we built a logistic regression model with a binary outcome of 1 (functioning) and 0 (machine not functioning), we will not be able to deal with instances when the non-functioning has happened due to external decisions. The logistic model will assume that every instance of 0 is bad and that the machine has failed. In survival modeling, we don't begin with an assumption that every instance of a machine non-functioning is solely due to a mechanical failure. There could be other reasons, such as over-production, which may have decided to pull the plug and reduce the number of operational machines. The survival analysis models deal with this issue by censoring.

But what is censoring a variable? Let's go back to the definition of probability. Probability is the chance of an event occurring. What could be an event? The tossing of a coin can have two events-heads or tails. If we are interested in quantifying the chance of getting heads in the tossing of the coin, we need to perform the tossing a certain n number of times. Let's say, we toss the coin 100 times and observe that the event heads was observed for 55 times. So we can say the probability of observing heads is 55/100 or 0.55 or the probability of not observing heads is 45/100 or 0.45. The 0.45 probability of heads can also be explained as 1 - 0.55. In determining the probability of a 0.55, we used both the chance of something happening and not occurring at all.

Similarly, in the probability of machine failure, we either have machines functioning or those that aren't. The final outcome is assumed to be the non-functioning of a machine at some point in time. What is unknown is when the functioning machines will stop. In the coin toss illustration, the outcome was heads or not heads (tails). There was never a possibility of the outcome tails turning into heads at a later stage. Hence, the observed experimentation tosses couldn't be predicted. Whereas, in the machine data, we have to assume that the active machines are going to stop at some point and this assumption has to be built in before we can try and understand what point they may stop. When an event hasn't occurred for a part of the population and we want to call out such data points, we censor them. Against each machine ID, we can simply add a new variable called censored with values 1 and 0. Value 1 can be coded for each active machine and 0 for the non-functional machine. It would be a flawed survival analysis model if none of the machines in the database churned.

This would mean that we are trying to model the good (active) behavior of machines without any negative event information. In this instance, we have done a right-censoring where the expected life cycle of the machines is expected to be same. An instance of a left-censoring data could be when we expect a particular machine to have a shorter lifespan compared to the observed. This could be if we don't know when the machine stopped working but we can assume that the observed functioning time is greater than the actual time.

When we censor machine A, it will be called right censoring, as it has survived the study period of five years and we don't have an actual time to failure. In the instance of machine B, we are doing left censoring, as we don't know the exact time when it stopped functioning. We only know that at the two-year mark, the machine was recorded as being non-functional. The censoring in such cases is known as left censoring. There is also a type of censoring called interval censoring. In the case of machine C, it was functioning when observed at the two-year mark but had stopped working somewhere before the three-year mark recorded on the data system. In this case, we don't know at which point during the interval of the two observed points the machine stop functioning:

Figure 6.2: Types of censoring

Any censoring can only work if we assume that the underlying subjects in the study have an equal survival probability. Irrespective of the time that they entered the study, it is assumed that the subject still has a chance of experiencing the event. If we are undertaking a clinical trial, we assume that subject x who enters the trial in the last two years of the drug study will have an equal survival probability when compared to the subjects who entered the trial at its inception. Also, it assumed that subject x has a chance to experience the event even though it has entered the trials at a later stage. Subject x does require censoring, as it entered the trial at a later stage and has still not encountered an event.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.73.127