Skip to main content

Section 8.1 Probability

Subsection 8.1.1 Probability Density Functions

I’ve done two weeks so far with applications of integration: parametric curves and volumes. The last week of these applications is using integration for probability.
The study of probablity can be roughly broken up into two kinds of data: discrete and continuous. In discrete data, there are a finite number of seperate possible measurements, each of which has a finite probability. The study of discrete probability can be entirely accomplished with finite sums (though even there, caluculus can give surprising insights.) Exam marks are a typical discrete measurement: there are only a finite number of possibilities.
Continuous probability involves measurements which can vary anywhere within a range. Heights in a population are a typical continuous measurement: assuming sufficient precision, a height can be any real number in a particular range. At least mathematically, there are infinitely many possible measurements. As opposed to discrete probability, I can’t assign a specific likeliness to any particular measurement in a continuous situation. Instead, I can assign a probability to a range of measurements. For example: there is a 20% change that an adult female caribou stands more than 125 cm tall at the shoulder.
The main application of integration to probability is to understand continuous data. To that end, let me make the key definition.

Definition 8.1.1.

A probability density function or probability distribution is an integrable function \(f(x) : [a,b] \rightarrow [0, \infty)\) such that
\begin{equation*} \int_a^b f(x) dx = 1 \end{equation*}
A probability distribution is a very new kind of thing, so let me make some important notes about its properties and interpretation.
  • First and most counter-intuitive: the quantity that is being measured is the input of the probability. When talking about, say, height, it is familiar for height to be the dependent variable. Height depends on time, or position, or something else. For probabilities of heights, height is the independent variable. The function \(f(x)\) is asking about the probability of the measurement \(x\text{.}\)
  • The output \(f(x)\) is a measure of how likely the measurement \(x\) is, but only very vaguely. For continuous probability, since infinitely many measurements are possible in a range, each individual precise measurement actually has zero probability. Instead, a probability is only given for a range of measurement. For continuous probability, I can never ask: how likely is a height of 6 meters? Instead, I can only ask: how likely is a height between 5 and 6 meters.
  • The probability distribution is positive. A negative probabity has no meaning.
  • The interval \([a,b]\) is the range of all possible measurements. The integral condition in the definition says that all measurements fall in this range (with probability 1). Then, if \(x_0\) and \(x_1\) (with \(x_0 \lt x_1\)) are in the interval \([a,b]\text{,}\) the probability of a measurement between \(x_0\) and \(x_1\) is given by the integral of the probability distribution.
    \begin{equation*} P([x_0,x_1]) = \int_{x_0}^{x_1} f(x) dx \end{equation*}
    This notation, where \(P([a,b])\) is the probability of a measurement in the range \([a,b]\text{,}\) is conventional. I’ll use this notation throughout this section.
  • This probability \(P\) of a measurement in some range will have \(0 \leq P \leq 1\text{,}\) depending on the endpoints and the particular nature of the distribution. Continuous probabilities are not given in percentages; instead, they are given in fractions. If the integral produces the valyue \(P = \frac{1}{2}\text{,}\) that’s equivalent to the colloquial \(50\%\) probability. Percentage language is quite common in discrete probability, but much less common in continuous probability. I won’t use percentage langauge at all in this course. Instead, a probability of \(1\) is an even that is sure to happen, and all other probability will be decimals or fractions between \(0\) and \(1\text{.}\)
  • Even though the integral is bounded above by \(1\text{,}\) the probability distribution itself is not bounded by \(1\text{.}\) It can take arbitrarily large values, as long as the spike is narrow enough such that the area under the curve remains \(1\) over the whole interval.
  • Both the terms, `probability density function’ and `probability distribution’, are conventional. I’ll mostly use the term `distribution’. Technically, a distribution is an extension of the notion of a function; there are distributions which are not actually functions. However, that subtlety isn’t important for the examples in this course, so I’ll just keep using distribution to refer to the functions that measure probabilility.

Subsection 8.1.2 Examples

Example 8.1.2.

Figure 8.1.3. The Exponential Distribution
Let \(\alpha \gt 0\) and \(f(x) = \alpha e^{-\alpha x}\) on the domain \([0, \infty)\text{.}\) I will calculate the integral of this function over this domain.
\begin{align*} \int_0^\infty \alpha e^{-\alpha x} dx \amp = \lim_{a \rightarrow \infty} \int_0^a e^{-\alpha x} dx\\ \amp = \alpha \lim_{a \rightarrow \infty} \frac{e^{-\alpha x}}{-\alpha} \Bigg|_0^a\\ \amp = \alpha \left( \lim_{a \rightarrow \infty} \frac{-e^{-\alpha a}}{\alpha} + \frac{1}{\alpha} \right) = \frac{\alpha}{\alpha} = 1 \end{align*}
I’ve verified that this is, indeed, a probability distribution on \([0,\infty)\text{.}\) It is called the exponential distribution. The fact that \(f\) is a decay function means that measurements gets less and less likely as values get large. It is common to get measurements close to zero and uncommon to get large measurements. The decay coefficient \(\alpha\) controls this effect: for large \(\alpha\text{,}\) the decay is faster and measurements near zero are even more common.
The probability of an event between \(0\) and \(1\) is calculated by the integral of the density on that range.
\begin{equation*} \int_0^1 e^{-x} dx = e^{-x} \Bigg|_0^1 = -e^{-1} + 1 = 1 - \frac{1}{e} \doteq 0.632\text{.} \end{equation*}
Likewise, the probability of a measurement bewteen \(1\) and \(2\) is calculated by the integral of the density on that range.
\begin{equation*} \int_1^2 e^{-x} dx = e^{-x}\Bigg|_1^2 = -e^{-2} + \frac{1}{e} = \frac{1}{e} - \frac{1}{e^2} = \frac{e-1}{e^2} \doteq 0.233\text{.} \end{equation*}

Example 8.1.4.

Figure 8.1.5. The Gaussian Distribution
The most well-known probability distribution is the bell curve, which is also called the gaussian distribution or normal distribution. In full generality it depends on two parameters \(\mu\) and \(\sigma\) and has the following form.
\begin{equation*} f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \end{equation*}
The constant \(\frac{1}{\sigma \sqrt{2\pi}}\) ensures that the integra over all of \(\RR\) is \(1\text{,}\) as required for the definition. (Again, this is called the normalization constant). \(\mu\) is the centre point of the bell curve, and \(\sigma\) measure the width of the bell curve (in a way that will be formalized below). For a specific example, I can take \(\mu = 0\) and \(\sigma = \frac{1}{\sqrt{2}}\) to get a bell curve centred at \(0\text{.}\) This specific bell curve is shown in Figure 8.1.5.
\begin{equation*} f(x) = \frac{1}{\sqrt{\pi}} e^{-x^2} \end{equation*}
The fact that this function has the correct integral value for the definition is actually very difficult to establish. Specifically, look at this integral.
\begin{equation*} \int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi} \end{equation*}
The function \(e^{-x^2}\) has no elementary antiderivative. Given the important of the gaussian distribution, this may be the most notable function without an elementary antiderivative. Actually proving this integral requires techniques well beyond this course.

Example 8.1.6.

The following distribution describes a situation where there is an equal probability of any measurement in a fixed range. This is called the (continuous) uniform distribution.
\begin{equation*} f(x) = \left\{ \begin{matrix} \frac{1}{b-a} \amp x \in [a,b] \\ 0 \amp x \notin [a,b] \end{matrix} \right. \end{equation*}
In probability theory, it is often important to have these bases cases to compare again. Understanding the behaviour and properties of the uinform distribution, where all measurements (in a range) are equally likely, leads to a better understanding of more complicated distributions by comparison to this base case.

Subsection 8.1.3 Means

For discrete probability, a mean or average of the expected measurements is relatively intuitive: I just add up the measurements multiplied by their probabilities. What a mean should be for a continuous probability isn’t as immediately obvious, since I can’t add up the infinitely-many measurements. However, I can still take inspiration from the discrete case.
Let me consider finite probability for a moment. Say that there are \(n\) events with probabilities \(p_i\) and measurements \(m_i\text{.}\) In order to ensure this is a valid probability situation, the sume of all the probabilities must be \(1\text{.}\)
\begin{equation*} \sum_{i=1}^n p_i = 1\text{.} \end{equation*}
The mean is the sum of the measurements multiplied by their probabilities.
\begin{equation*} \sum_{i=1}^n m_i p_i \end{equation*}
These discrete calculations give inspiration for the continuous case. The major difference, for continuous probability, is that the sums are now integrals. Changing the sums to integrals, the steps are essentially the sum: the sum of the probabilities because the integral of the probabilitiy density function. Multipyling by the actually measurement in the sum is equivalent to multiplying by the independent variable, since the independent variable is the measurement. This leads to a definition.

Definition 8.1.7.

Let \(f(x)\) be a probability distribution on the domain \([a,b]\text{.}\) The average or mean of \(f(x)\) is defined to be the result of the following integral.
\begin{equation*} \mu = \int_a^b x f(x) dx \end{equation*}
Let me calculate the means of the three exmaples that I used in the previous section.

Example 8.1.8.

I’ll start with the exponential distribution with \(\alpha = 1\text{,}\) that is, \(f(x) = e^{-x}\) on \([0, \infty)\text{.}\) I use integration by parts to calculate the mean.
\begin{align*} \mu \amp = \int_0^\infty x e^{-x} dx\\ \amp = - x e^{-x} \Bigg|_0^\infty + \int_0^\infty e^{-x} dx\\ \amp = 0 + e^{-x} \Bigg|_0^\infty = 1 \end{align*}
This mean makes some sense for a decay function. Even though very large measurements are possible, they become very unlikely. The most likely measurements are near \(0\text{,}\) so the mean works out to \(1\text{.}\)

Example 8.1.9.

Next I’ll calculate the mean of the uniform distribution: \(\frac{1}{b-a}\) on the interval \([a,b]\text{.}\)
\begin{align*} \mu \amp = \int_a^b \frac{1}{b-a} x dx\\ \amp = \frac{1}{b-a} \frac{x^2}{2} \Bigg|_a^b\\ \amp = \frac{b^2-a^2}{2(b-a)} = \frac{b+a}{2} \end{align*}
The mean is exactly halfway between the endpoints. Since the probability is constant, this makes perfect sense.

Example 8.1.10.

Now I’ll calculate the mean for the gaussian distribution in full detail, leaving the parameters \(\mu\) and \(\sigma\) as unknown. First, I do a small substitution to shift to the distribution to centred at zero.
\begin{align*} f(x) \amp = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\\ \mu \amp = \int_{-\infty}^\infty \frac{xe^{-\frac{(x-\mu)^2}{2\sigma^2}}}{\sigma \sqrt{2\pi}} dx\\ \amp v = x-\mu\\ \mu \amp = \int_{-\infty}^\infty \frac{(v+\mu)e^{-\frac{v^2}{2\sigma^2}}}{\sigma \sqrt{2\pi}} dv\\ \amp = \int_{-\infty}^\infty \frac{(v)e^{-\frac{v^2}{2\sigma^2}}}{\sigma \sqrt{2\pi}} dv + \int_{-\infty}^\infty \frac{(\mu)e^{-\frac{v^2}{2\sigma^2}}}{\sigma \sqrt{2\pi}} dv \end{align*}
After this substitution, I split up the sum in the numerator to give two interals.
\begin{align*} \mu \amp = \int_{-\infty}^\infty \frac{(v)e^{-\frac{v^2}{2\sigma^2}}}{\sigma \sqrt{2\pi}} dv + \int_{-\infty}^\infty \frac{(\mu)e^{-\frac{v^2}{2\sigma^2}}}{\sigma \sqrt{2\pi}} dv \end{align*}
There are two integral to do here. Let me deal with the first integral first. Even though \(e^{-x^2}\) has no antiderivative, now I have something similar to \(xe^{-x^2}\) which does have an elementary antiderivative found by substitution. I could do that substitution and sovle this integral that way, but there is an easier method. The function \(ve^{-\frac{v^2}{2\sigma^2}}\) is an odd function: it’s positive part if a mirror of its negative part, but changes from below the axis to above the axis. Therefore, there is an equal area under the curve for the positive and negative pieces, but with different signs. In the limit for the improper integrals, these two areas will perfectly cancel each other out. This first integral must be zero.
\begin{align*} \amp = 0 + \frac{\mu}{\sigma \sqrt{2\pi}} \int_{-\infty}^\infty e^{-\frac{v^2}{2\sigma^2}} dv \end{align*}
Now I have the second integral left. I’ve already pulled out all the constants. Now let me make another substitution.
\begin{align*} w \amp = \frac{v}{\sigma \sqrt{2}}\\ \mu \amp = \frac{\mu}{\sigma \sqrt{2\pi}} \sigma \sqrt{2} \int_{-\infty}^\infty e^{-w^2} dw = \frac{\mu}{\sqrt{\pi}} \int_{-\infty}^\infty e^{-w^2} dw \end{align*}
The resulting integral is exactly the integral I discussed original in Example 8.1.4. The integral evaluates to \(\sqrt{pi}\)/. This lets me finish the integarl.
\begin{equation*} \mu = \frac{\mu}{\sqrt{\pi}} \sqrt{\pi} = \mu \end{equation*}
Unsurprisingly, given the choice of notation, the parameter \(\mu\) (which is the centre point of the bell curve) is the mean of the distribution. The guassian distribution is the first and most important symmetry distribution: there is a mean at the centre of the distribution and the probability decays away from the mean equally on both sides. This means that measurements in symmetric ranges above and below the mean are equally likely.

Subsection 8.1.4 Central Tendencies

The mean or average is only one of several possible measures of what is the most likely outcome of a measurement. In general, a central tendency is any mathematical calculation of a ‘typical’ value; for most distributions, there are several different central tendencies which I can consider. It isn’t always obvious which is the most appropriate. The three most common and most well-known central tendencies are mean (average), median and mode.
An important examples of the use of these central tendencies is in statistical data like income and wealth. The distributions for income and wealth are similar to the exponential distribution: most measurements are near the minimum, but there are a small num er of measurements that are very, very high. In these distributions, different central tendencies gives significantly different result.
Specifically, look at the income data for Canada. (Rough figure here are taken from the Stats Canada website in March, 2021). The mean/average income in 2018 for Canadian 16 and over was \(\$48,000\text{.}\) However, the mediam income was \(\$36,400\text{.}\) That’s a pretty substantial difference. Which of these measure the ‘typical’ Canadian?
It is difficult to say which central tendency is more appropriate. In particular, the judgement of which central tendency to use is external to mathematics. Mathematics doesn’t give moral guidance for which type of central tendency is the best it just tells you how to calculate each type. For data like income, which has a very long tail on the positive side of its distribution, typical practice is to record median instead of average. The community has judged it to be a more reasonable measure. But, again, the point here is that such a judgement happens outside of the mathematics.
Let me actually now define median for continuous probability.

Definition 8.1.11.

For continuous probability and probability density \(f(x)\) on \([a,b]\text{,}\) the median is defined to be the unique number \(c\) such that the following equation holds.
\begin{equation*} \int_a^c f(x) dx = \int_c^b f(x) dx = \frac{1}{2} \end{equation*}
Since integrals are areas under the curve, and the total area on \([a,b]\) is \(1\text{,}\) the median is the place which exactly divdes the area under the curve into halves.

Example 8.1.12.

I’ll calculate the median for the exponential distribution \(f(x) = e^{-x}\text{.}\)
\begin{align*} \int_c^\infty e^{-x} dx \amp = -e^{-x}\Bigg|_c^\infty\\ \frac{1}{2} \amp = e^{-c}\\ -c \amp = \ln \frac{1}{2} = - \ln 2\\ c \amp = \ln 2 \lt 1 \end{align*}
The median of this distribution is \(\ln 2\text{,}\) which is smaller than the mean of \(1\text{.}\) As mentioned above for income data, this difference between mean and median is typical for distributions with a long tail on one side. The very high values pull up the mean, but not the median.
One of the reasons that the bell curve is very commonly used to understand probability is that it is very well behaved for central tendencies. Basically any central tendency you can calculate for a bell-curve will give \(\mu\text{,}\) the mean.

Subsection 8.1.5 Expectation Values

The mean measure, in some sense, the most likely measurement. (This is never precisely true for continuous probability, since there are infinitely many possible measurements). It answer the question: what do I expect the measurement to be? I expect the measurement to be the mean or at least something close to it.
This leads to new terminology. The mean is often called the expectation value of the measurement. This also has some new notation. If \(f(x)\) on \([a,b]\) is a probability distribution, the mean or expectation value is written \(\left\langle x\right\rangle\text{.}\)
It is also possible to have some other quantity which depends on the measurement. I can think of some function \(g(x)\) which has the same independent variable as the probability distribution (because, again, the independent variable is the measurement). These related functions of the measurements also have expectation values and these expectations values are also calculated by integrals of the probability distribution.

Definition 8.1.13.

Let \(f(x)\) be a probability distribution on the interval \([a,b]\text{.}\) Let \(g(x)\) be some other integrable function on the same independent variable. The likely outcome of \(g(x)\) is called the expectation value of \(g(x)\) and is calculated by the following integral.
\begin{equation*} \left\langle g(x) \right \rangle = \int_a^b g(x) f(x) dx \end{equation*}
Modern quantum mechanics is all based on probability. Measurables such as position, velocity, momentum and energy are all functions on the probability space. The actual values referened to are not strict values, but expectation values. The previous definitin of expectation value calculates all these measurables. Once this interpretation is in place, the physics of the situation is understood by knowing the time development of the probability distribution (strictly speaking, of the a closely related function called the wave function in quantum mechanics). Schrodinger’s equation, the heart of quantum mechanics, is precisely the differential equation that describes the time development of the probability distribution. The only sense of measurement is expectation value.

Subsection 8.1.6 Standard Deviation

One I’ve chosen a central tendency, such as the mean, I can ask about the variance of the distribution. The variance is the spread of the distribution: how far away measurements are from the mean. Am I likely to get measurements very near the mean, or very far away? The standard deviation of a probability distribution is the first tool to measure variance. A low standard deviation means that most measurements are close to the mean, and a high standard deviation means that measurements can be very spread-out.
Following the bell curve, I will write \(\mu\) for the mean of a probability distribution \(f(x)\) on \([a,b]\text{.}\) The distance of a measurement from the mean is given by \(|x-\mu|\text{.}\) I could ask for the expectation value of this distance, which would be a measure of variance. However, the convention is to isntead ask for the expectation value of the square of this distance.

Definition 8.1.14.

Let \(f(x)\) be a probability distribution on \([a,b]\) with mean \(\mu\text{.}\) The standard deviation of the distribution is written \(\sigma\) and calculated by this integral.
\begin{equation*} \sigma^2 = \left\lt (x-\mu)^2 \right> = \int_a^b (x-\mu)^2 f(x) dx \end{equation*}
This integral calculate \(\sigma^2\text{.}\) The squares of \((x-\mu)\) inside the integral can be though of as a pythagorena sum. In discrete probability, this would exactly be the case: the square of the standard deviation is the sum of the squares or the variance of each measurement. In continuous probability, this sum becomes an integral. This pythagorean combination is a reasonable way of adding up the individual variances to produce the standard deviation if I think of each measurement as being something independent, like independent directions. Let me proceed to calculate some examples.

Example 8.1.15.

The standard deviation of the exponential distribution \(f(x) = e^{-x}\) on \([0 , \infty)\) is calculated as follows. (Some of the integrals in this calculation have already been calculated when I did the mean of the exponential distribution. I won’t repeat all the integration steps.)
\begin{align*} \sigma^2 \amp = \int_0^\infty \left( x - 1 \right)^2 e^{-x} dx\\ \amp = \int_0^\infty \left( x^2 - 2x + 1 \right) e^{-x} dx = \int_0^\infty x^2 e^{-x} dx - 2 \int_0^\infty xe^{-x} dx + \int_0^\infty e^{-x} dx\\ \amp = x^2 e^{-x} \Bigg|_0^\infty + \int_0^\infty 2 x e^{-x} dx - \int_0^\infty 2 x e^{-x} dx + 1\\ \amp = 0 + 1\\ \sigma \amp = 1 \end{align*}
Even with the long tail of high measurements, the typical distance to the mean is still \(1\text{.}\)

Example 8.1.16.

The standard deviation of the uniform distribution \(\frac{1}{b-a}\) is a surprisingly difficult calculation. Recall the mean is \(\frac{a+b}{2}\text{.}\)
\begin{align*} \sigma^2 \amp = \int_a^b \left( x - \frac{a+b}{2} \right)^2 \frac{1}{b-a} dx\\ \amp = \int_a^b \frac{x^2}{b-a} - \frac{(a+b)x}{b-a} + \frac{(a+b)^2}{4(b-a)} dx\\ \amp = \frac{x^3}{3(b-a)} \Bigg|_a^b - \frac{(a+b)x^2}{2(b-a)} \Bigg|_a^b + \frac{(a+b)^2x}{4(b-a)} \Bigg|_a^b\\ \amp = \frac{b^3-a^3}{3(b-a)} - \frac{(a+b)(b^2-a^2)}{2(b-a)} + \frac{(a+b)^2(b-a)}{4(b-a)}\\ \amp = \frac{b^2 + ab + a^2}{3} - \frac{a^2 + 2ab+ b^2}{2} + \frac{a^2 + 2ab + b^2}{4}\\ \amp = b^2 \left( \frac{1}{3} - \frac{1}{2} + \frac{1}{4} \right) + ab \left( \frac{1}{3} - 1 + \frac{1}{2} \right) + a^2 \left( \frac{1}{3} - \frac{1}{2} + \frac{1}{4} \right)\\ \amp = \frac{b^2}{12} - \frac{ab}{6} + \frac{a^2}{12} = \frac{b^2-2ab+a^2}{12} = \frac{(b-a)^2}{12}\\ \sigma \amp = \sqrt{ \frac{(b-a)^2}{12}} = \frac{b-a}{2\sqrt{3}} \end{align*}
This is a believable result, since it shows some distance from the mean but is still within the interval.

Example 8.1.17.

Lastly, I will calculate the standard deviation of the gaussian distribution, with \(\mu\) and \(\sigma\) undetermined. (As with the mean, the notation is giving away the end of the calculation: this \(\sigma\) parameter in the defintiion will turn out to be, exactly, the standard deviation.) This is a long integral; I do two substitutions to try to simplify, then I do a tricky integration by parts.
\begin{align*} \sigma^2 \amp = \int_{-\infty}^\infty (x-\mu)^2 \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} dx\\ v \amp = x - \mu\\ \amp = \frac{1}{\sigma \sqrt{2\pi}} \int_{-\infty}^\infty v^2 e^{\frac{-v^2}{2\sigma^2}} dv\\ w \amp = \frac{v}{\sigma \sqrt{2}}\\ \amp = \frac{1}{\sigma \sqrt{2\pi}} \int_{-\infty}^\infty \sigma^2 2 w^2 e^{-w^2} dw = \frac{2\sigma^2}{\sqrt{\pi}} \int_{-\infty}^\infty w^2 e^{-w^2} dw\\ g(x) \amp = w \\ \frac{df}{dx} \amp = we^{-w^2} \\ \frac{dg}{dx} \amp = 1\\ f \amp = \frac{-e^{-w^2}}{2}\\ \amp = \frac{2\sigma^2}{\sqrt{\pi}} \left[ -\frac{w(e^{-w^2})}{2} \Bigg|_{-\infty}^\infty + \int_{-\infty}^\infty \frac{e^{-w^2}}{2} dw \right]\\ \amp = \frac{\sigma^2}{\sqrt{\pi}} \int_{-\infty}^\infty e^{-w^2}dw = \frac{\sigma^2}{\sqrt{\pi}} \sqrt{\pi} = \sigma^2\\ \sigma \amp = \sigma \end{align*}
The guassian distribution, with its two parameters, is very nicely defined. If I need a distribution with a given average and given standard deviation, I can very directly write down the matching guassian distribution.