The main concepts of probability
The concept of probability distribution
Types of probability distribution
Other probability distributions
As we saw in Chapter 9, pricing derivatives, particularly securities with embedded optionality, requires assumptions on their future behaviour. Whether it is just the range of final values the security is likely to have, the paths it took to get there, or the future behaviour of several variables that impact on the value of derivative, they all require the use of probability and statistics. Hence this chapter is devoted to the main concepts of probability, together with probability distributions most relevant to financial modelling.
Very few things in life are certain. That is why we need some probability measure in order to estimate the likelihood of uncertain events. Probability theory aims to facilitate decision making where there are several possible outcomes. It was first developed for use in gambling, hence most examples use cards, coins, dice etc.
The financial world is very uncertain and decisions on investments have to be based on ‘educated guesses’. Hence probability theory plays a big part in pricing and trading financial instruments.
The probability of an event E occurring in a trial where there are several equally likely outcomes is:
P(E) = Number of ways E can occur / Total number of possible outcomes
When rolling a dice, what is a probability of getting six?
As there are six equally likely numbers, the probability of throwing any one of them is 1/6.
This works well with cards, coins etc., i.e. whenever all the possible outcomes are known with certainty. But what happens when this is not the case?
The probability definition has to be extended to cater for such cases. The only approach that can be taken is to observe a sufficiently large number of trials and assume that the same frequency of a different event would continue into the future. In other words, probability can be defined as:
P(E) = | Number of observed occurrences of E / Total number of observed occurances |
If 100 people took a driving test and 54 passed, we can conclude that the probability of passing the test is:
P(pass) = 54/100 = 0.54 or 54%
Normally a much larger sample would be used, but the same logic would apply.
It is worth pointing out that the above discussion applies to ‘countable’ events (events where the number of instances can be observed exactly). In order to allow for events that are not discrete, the concept of distribution functions has to be introduced (discussed later).
From the above examples it can be seen that probability ranges from 0 (0 per cent) to 1 (100 per cent). This means that event E that will certainly not occur has probability P(E) = 0 and the event that will certainly occur will have probability P(E) = 1. It follows from these definitions that the probabilities of all possible outcomes must add up to 1, as it is certain that one of them will occur. For example, when tossing a coin, the probability of both head and tail is 0.5, hence the probability of both events is 1, as it is certain that one will occur. The above definitions can be further extended to include the concept of NOT (probability of something not happening).
When rolling a dice, the probability of six is 1/6. The probability of NOT throwing six must then be the probability of all other events (1, 2, …, 5), i.e. 5/6.
However, as we know that the probability of all events is 1, then the probability of all events but six must be
P(NOT 6) = 1 − P(6) = 1 − 1/6 = 5/6
In the above example either solution was equally easy, but the concept can come in really useful if not all the possible outcomes, or the probabilities of some of the outcomes, are known.
The probability of occurrence of two or more events often needs to be calculated, which requires the knowledge the relationship between those events. The tests in the experiments can be:
Independent, i.e. the outcome of one test has no impact on another
Dependent, the outcome of one test is affected by the preceding tests.
Furthermore, the outcomes from any one of the tests undertaken can be:
Mutually exclusive, i.e. they cannot occur simultaneously
Not mutually exclusive, i.e. they may occur together.
A card randomly selected from a pack will affect all the future draws from the same pack unless it is replaced. Two random draws from the same pack are only independent if the card is returned to the pack.
Drawing a five and a spade are two non-exclusive events, as there is a card five of spades. However, drawing five of spades twice in a row (without returning the card into the pack) are two mutually exclusive events, as there is only one such a card, and can be drawn either on the first draw, on the second or not at all, but never twice.
These statements can be expressed by mathematical formulae as below.
The probability of either of the two independent events occurring is equal to the sum of their individual probabilities:
P(A or B) = P(A) + P(B)
We are throwing two dice. What is the probability of getting number three on either of the two dice?
P(3) = 1/6
Since two throws are independent, the probability of the second draw is unaffected by the first outcome. Hence, P(3 or 3) = P(3) + P(3) = 1/3 Mathematically, the summation rule corresponds to the operation OR.
The probability of two independent events occurring simultaneously is equal to the product of their individual probabilities:
P(A and B) = P(A) × P(B)
Using the same example as above (throwing two dice) what is the probability of getting number three on both?
P(3) = 1/6
Since two throws are independent, the probability P(3 and 3) = P(3) × P(3) = 1/36
Mathematically, the multiplication rule corresponds to the operation AND.
The probability of either one of the two dependent events occurring is equal to the sum of their individual probabilities minus their joint probability. This is because if both events can happen simultaneously, we will count their joint probability twice. Mathematically we can write this as:
P(A or B) = P(A) + P(B) − P(A and B)
What is the probability of drawing a five or a diamond from a pack of cards?
There are 13 diamonds and 4 fives. Hence, it could be concluded that there are 17 ways of drawing either a five or a diamond. In fact, since one of these cards is a five of diamonds, it would be counted twice. Hence:
Probability of drawing a diamond is: P(D) = 13/52
Probability of drawing a five is: P(5) = 4/52
Probability of drawing a five and a diamond is: P(5 of diamonds) = 1/52
P(5 or diamond) = P(D) + P(5) – P(5 D) = 13/52 + 4/52 – 1/52 = 16/52 = 4/13
Rules describing dependent events deal with conditional probability, as occurrence of one event influences the outcome of the other.
When two events are dependent, the probability of both occurring is:
Students had two maths tests. 25% of students passed both tests and 42% passed the second test. What percentage of students who passed the first test also passed the second test?
A − first test B − second test
P(A and B) = 0.25
P(B) = 0.42
P(B|A) = P(A and B)/P(B) = 0.25/0.42 = 60%
Another example of a dependent event is drawing cards at random without replacing them:
In any pack of cards the probability of drawing a spade is 13/52 = 1/4. However, the probability of drawing a spade after a spade has been drawn is:
1st draw: P(spade) = 13/52
2nd draw P(spade) = 12/51 (not 13/52 as it would be if we replaced the first card)
hence drawing two spades without replacing is:
P(2 spades) = 13/52 × 12/51 = 3/51 = 1/17
which is lower than the probability without replacing (1/16).
When tossing a coin, based on its probability, the tail would be expected 50 per cent of the time. However, if 100 tosses were made, it is possible that only 46 would occur. If this experiment was repeated many times, most likely the results would range from as low as 10 to as high as 90. If the results were plotted on the graph, we would in fact be drawing a distribution, with the majority of the results concentrating around 50, but with a very wide spread.
Probability distributions are typically defined in terms of the probability density function. There are number of probability functions used in real-life applications.
For a continuous function, the probability density function (pdf) is the probability that the variable has the value x. Since for continuous distributions the probability is equivalent to the area under the curve, the probability at a single point is zero. For this reason probability is often expressed in terms of an integral between two points.
For a discrete distribution, the pdf is the probability that the variate takes the value x.
The cumulative distribution function (cdf) is the probability that the variable takes a value less than or equal to x. That is:
For a continuous distribution, this can be expressed mathematically as:
There are several types of probability distributions:
Each distribution is characterised by several parameters:
The above list of distributions is certainly not exhaustive. All these distributions have their application as they describe particular types of behaviour. What is common to them all is that they are trying to quantify random and uncertain events. Given that in finance events are best described using binomial, normal or log-normal distributions, the following sections will concentrate on those. Poisson, student and chi-square distributions will be described only briefly, whilst others will not be covered.
The binomial distribution is used when there are exactly two mutually exclusive outcomes of a trial. These outcomes are appropriately labelled ‘success’ and ‘failure’. The binomial distribution is used to obtain the probability of observing x successes in N trials, with the probability of success in a single trial denoted by p. Following from the earlier observation that probabilities of all outcomes must add to 1, the probability of a failure is 1 – p. The binomial distribution assumes that p is fixed for all trials.
If three cards are selected at random from a pack (each card being returned to the pack before the next one is drawn), what is the probability of drawing two spades?
Since there are four symbols in each pack (13 cards with each), the probability of any symbol (and therefore spades) is P(S) = 0.25. The probability of drawing any other symbol is P(O) = 1 – P(S) = 0.75. Since we are returning the cards to the pack, the draws are independent and the probabilities remain the same at each draw.
If we consider all possible outcomes:
SSS, SSO, SOS, OSS, SOO, OOS, OSO, OOO
We can see that their probabilities are:
However, instead of listing all possible outcomes (which can be very time-consuming), we can use the binomial expression. The formula for the binomial probability function is:
where the number of possible combinations of x successes in n trials is given by:
and the probability of x successes and n – x failures in a trial is given by:
Applying this expression to the above example, we have:
which is much simpler than listing all the outcomes!
Graphical representation of binomial distribution – a histogram – is a plot of all the outcomes along the x-axes with their corresponding frequency/number of trials along the y-axes. Probability of observing a value between two points a and b is a sum of probabilities of all the points between a and b. If p = 0.5, the distribution is symmetric (the histogram is centred around p = 0.5), with p < 0.5 it is positively skewed (more items lie above the most frequent item), whilst for p > 0.5 it is negatively skewed.
Below are some statistics for the binomial distribution:
Binomial distribution is widely used in building binomial trees (for pricing options). Here it is assumed that the price can only go up or down and each move has a probability assigned to it. The advantage of a binomial model is that it is tractable and easy to understand, but building trees can be very time-consuming and inefficient. An example of building a binomial tree to price options was given in Chapter 9 and is further shown below.
Binomial distribution deals with discrete data with only two possible outcomes. Normal distribution enables us to calculate probabilities of outcomes when the variable is continuous (data can take any value). There are many similarities between normal and the symmetrical binomial distribution, so that binomial distribution can be approximated by the normal distribution when there is a large data sample. However, whilst binomial distribution dealt with discrete values and its graphical representation was given by a histogram, normal distribution is represented by a continuous curve. The probability of observing a value between any two points is equal to the area under the curve between those points.
The main characteristics of normal distribution are:
The general formula for the probability density function of the normal distribution is:
where – is the location parameter (centre of the distribution) and – is the scale parameter (determines total area under the curve). The case where – – 0 and – – 1 is called the standard normal distribution.
The equation for the standard normal distribution is:
Some of the distribution statistics are:
Mean | location parameter μ |
Median | location parameter μ |
Mode | location parameter μ |
Standard dev. | scale parameter σ |
The location and scale parameters of the normal distribution can be estimated with the sample mean and sample standard deviation respectively.
The normal distribution is widely used. This is partly due to the fact that it is well behaved and mathematically tractable. However, the central limit theorem provides a theoretical basis for why it has wide applicability; it states that as the sample size (N) becomes large, the following occur:
It may be interesting to note that the probability of observing a value in various intervals is as follows:
Interval | Probability |
---|---|
μ ± 0.67σ | 0.5 |
μ ± σ | 0.683 |
μ ± 2σ | 0.955 |
μ ± 2.33σ | 0.99 |
μ ± 3σ | 0.997 |
Rather than using the distribution formula above, when the population is normally distributed the probability of observing a value in a particular range can be calculated using the z value:
i.e. the distance of the observed value x from the mean μ expressed in standard deviation σ. This way all the values are standardised and probability can be looked up in normal distribution tables.
There are several types of distribution tables, but the most common ones give the probability of observing a value beyond a point when moving away from the mean (in either a positive or a negative direction).
If a stock price is 200 and the standard deviation is 40, what is the probability of:
To answer these questions, the Z value for each case has to be calculated and the probability value found in the distribution table.
The most famous application of normal distribution is in the Black–Scholes formula. It is used for pricing stocks, interest rate derivatives etc. In principle, products that have some built-in optionality will have to assume a type of distribution that the underlying variable takes. Then the stochastic behaviour of the variable can be modelled and the product priced depending on the probabilities of all possible states. Good examples are:
Whether normal or log-normal (or shifted log-normal) distribution is used is mostly a matter of the practitioner’s choice and the empirical evidence of variable behaviour.
In the standard normal distribution the horizontal scale is linear, which means that for a mean = 100 the probability of 80 is the same as the probability of 120. In log-normal distribution the x-axis would be linear if logarithms of values were plotted. This implies that for a mean = 100, the probability of 200 is the same as the probability of 50.
More formally, a variable X is log-normally distributed if Y = ln(X) is normally distributed, with ln denoting the natural logarithm. The general formula for the probability density function of the log-normal distribution is:
where σ is the shape parameter (allowing for different shapes of distribution), θ is the location parameter (centre of the distribution) and m is the scale parameter (determining the area under the curve). The case where θ = 0 and m = 1 is called the standard log-normal distribution. The case where only θ = 0 is called the two-parameter log-normal distribution.
The equation for the standard log-normal distribution is:
The most important distribution statistics for θ = 0 and m = 1 are:
Many models (the interest derivative ones in particular) rely on log-normal distribution. In most cases interest rates are modelled as if they behave log-normally, hence this distribution is used directly in equations as well as in model calibration. Often in modelling shifted log-normal distribution is used (the larger the shift, the more it resembles normal distribution). There is ample empirical evidence to suggest that the stock prices also behave log-normally.
The Poisson distribution is a discrete distribution that takes on the values x = 0, 1, 2, 3 etc. It is often used as a model for the number of events (such as the number of telephone calls at a business centre or the number of accidents at an intersection) in a specific time period. It is also useful in ecological studies, particle physics etc. The useful characteristic of the distribution is that it is not invalidated if one event replaces another, e.g. when counting photons hitting an x-ray target in a given time interval any missed photon can be replaced by a new incoming one.
The Poisson distribution is determined by one parameter, lambda. Lambda is a shape parameter specifying the average number of events in a given time interval. The probability density function for the Poisson distribution is given by the formula:
Some of the Poisson distribution statistics are:
mean = λ
Range = 0 to +∞
Standard deviation =
The t-distributions were discovered by statistician William Gosset who – being employed by the Guinness brewing company – could not publish under his own name, so he wrote under the pen name ‘Student’.
T-distributions are normally used for hypothesis testing and parameter estimation, and very rarely for modelling. They arise in simple random samples of size n, drawn from a normal population with mean μ and standard deviation σ. Let x1 denote the sample mean and s, the sample standard deviation. Then the quantity
has a t-distribution with n–1 degrees of freedom.
Note that there is a different t-distribution for each sample size, in other words, it is a class of distributions. For any specific t-distribution, the degrees of freedom have to be specified. They come from the sample standard deviation s in the denominator of the equation.
The t-density curves are symmetric and bell-shaped like the normal distribution and have their peak at 0. However, the spread is wider than in the standard normal distribution. This is due to the fact that, in the above formula, the denominator is s rather than σ. Since s is a random quantity varying with various samples, there is greater variability in t, resulting in a larger spread.
The larger the number of degrees of freedom, the closer the t-density is to the normal density. This reflects the fact that the standard deviation s approaches σ for large sample size n.
The chi-square distribution results when ν independent variables with standard normal distributions are squared and summed. For simplicity the formula for probability density function is not included in this text. The chi-square distribution is typically used to develop hypothesis tests and confidence intervals and rarely for modelling applications.
Some of the χ2 distribution statistics are:
mean = ν
Range = 0 to +∞
Standard deviation =
Binomial trees are used to model the behaviour of a variable that can take only discrete values, e.g. to estimate the probability of the fuel price reaching a certain level in the future. The price can ultimately either go up or down, i.e. change in only two possible ways, hence the name binominal. We can crudely break down the timeline to only now and the future date and take a guess at the possible up and down price movements. But a more practical approach would be to divide the timeline until the future date into a number of steps. The probability of up and down moves is assigned and at each move a certain price change is allowed. Those values can be constant for the entire tree, or change at every step. The size of the price movements reduces by increasing the number of time steps.
If the current price is 100, two months later it might be as high as 120, or as low as 90. If we divide this period into monthly intervals, we will have only two steps in our tree.
We assign:
p is the probability of an upward move
1 – p is the probability of a downward move
u is the amount by which the price moves up
d is the amount by which the price moves down.
The above variables can be estimated from the market conditions, but for illustration purposes we will assume that:
p = 0.7
1 –p = 0.3
u = 10
d = 5
Hence the probability tree will look like:
The probability of reaching D is the probability of moving up and then up again, i.e. 0.7 × 0.7 = 0.49. For E we can go up and down or down and up, hence 0.7 × 0.3 = 0.21 and so on.
The tree showing possible prices is as follows:
As we are only interested in the price at the end of one year, we can see that it can take (in our estimate) three possible values ranging from 90 to 120, but the least likely outcome (only 9% probability) is 90, whilst it is over 90% certain that it would increase from the original value.
The above was a very trivial example, but this principle can be extended to any real-life problem to assess the probability of different outcomes.
3.21.93.139