Probability and the probability distribution is the base for mot statistical inference techniques, therefore machine learning, artificial intelligence, and data analytics. I wrote an overview of the discrete probability distribution methods and their R implementation before.
I promised at the end of the article above that I will write about the continuous probability distribution methods in another article. This is the one. In this article, I will try to provide a clear idea about some very common continuous probability distribution.
I will explain it in a very simple regular language not too much in mathematical or calculus terms and focus on Examples and R implementation
Let’s just dive into it!
Continuous Distributions — Uniform
This is the simplest continuous distribution and almost similar to the discrete uniform distribution.
With the continuous uniform distributions in a range [a, b], the probability of occurrence of a random variable X is uniform throughout the range. Let x be a random number between 1 to 100. Here ‘a’ is 1 and ‘b’ is 100.
The probability that x is any number between 1 to 100 inclusive will be the same. That is 1/100 or 0.01. It’s uniform throughout the range. The probability that x is 10 is 0.01, Again, the probability that x is 90 is 0.01. The probability of a random number occurring is called a probability density function(PDF).
So, we can say that the PDF of any number occurring randomly from 1 to 100 is 0.01.
What is the probability of a number occurring is 10 or less?
Well, that is the PDF of 1 +PDF of 2+ PDF of 3 + PDF of 4 + …. PDF of 10. As this is the uniform distribution, the PDF of all of them is 0.01. So, the probability of a number occurring from 1 to 10 is 0.01 * 10 = 0.1.
This is called a cumulative distribution function. Makes sense, right? It is cumulative.
These two concepts PDF and CDF will be used over and over again in each of the distribution methods.
The formula for the probability density function (PDF) for the random variable X for a uniform continuous distribution is:
I already explained an example before while explaining the concepts of PDF and CDF above. Let’s have a look at one more example and the implementation with R.
Suppose that the stock price of a certain company follows a uniform distribution between 50 to 90 dollars. What is the probability that stock price is 63 dollars?
You can calculate it very easily for uniform distribution. Because it’s the same throughout the range as I explained before. But here I want to introduce the R functionalities to calculate the uniform distribution.
We can easily do it by using the ‘dunif’ function that takes the random variable, minimum, and maximum. We can calculate the PDF by using this one line of code.
dunif(63, min = 50, max = 90)
What is the probability that the stock price will be 60 dollars or less?
Here we need to calculate the cumulative distribution function (CDF). Here, the probability that the stock price will be 60 dollars or less means the probability of stock price is 50 +the probability of 51 + the probability of 52 + …… + the probability of 60.
R has the ‘punif’ function to calculate the CDF like this.
punif(60, min = 50, max=90)
What is the probability that the stock price is at least 70 dollars?
The stock price is at least 70 dollars means the stock price is either 70 dollars or more. So we need the CDF of 71 to 90 dollars.
Look, the total probability is 1. If we just deduct the CDF of 69(that means the cumulative probabilities of 50 to 69) we will get the cumulative probabilities of 71 or more.
1 - punif(70, min = 50, max = 90)
That’s all for the uniform continuous distribution
The most common continuous distribution. That famous bell-shaped curve or Gaussian distribution represents the normal distribution.
The normal distribution is determined completely by the mean and the standard deviation. If two normal distributions have the same mean and standard deviation, they are identical.
Like uniform continuous distribution, normal distribution also involves a range. The probability is the area under the curve.
The area under a complete bell-shaped curve is 1.
The reason normal distribution is so important and widely used is, a lot of natural phenomena of population samples follow a normal distribution. Like the age, height, and BMI of a representative sample should follow a normal distribution.
Let’s work on some example problems in R
Suppose the age of a population is normally distributed between 12 to 72. The mean of the population age is 40 and the standard deviation is 8. What is the probability of a random person being 35 years old?
R has the ‘dnorm’ function that takes the random variable(in this example 35) mean, and standard deviation as inputs and gives you the PDF.
dnorm(35, mean = 40, sd = 8)
The probability of a random person being 35 years old is 0.041 in the above mentioned normal distribution.
What is the probability that a random person is at most 30 years old?
That means the probability of a random person being 30 years or less. That can be anything between 12 to 30. This is the cumulative probability of 12 to 30. We will use the ‘pnorm’ function that is used to calculate the CDF of a normal distribution.
pnorm(30, mean = 40, sd = 8)
What is the probability that a random person is at least 30 years old in the same distribution?
Here we need to calculate the probability of a random person being 30 years or more. The cumulative probability of 30, 31, 32….72.
As we know that the total probability is always 1. If we subtract the CDF of 29 (that means the probability of 12, 13, 14……, 29) from the total probability, we will get the CDF of 30 to 72.
1 - pnorm(29, mean = 40, sd = 8)
This type of analysis gives a much clear idea about a population rather than just mean and standard deviation, right?
Have you heard of the 68–95–99 rule?
This rule of a normal distribution makes the probabilistic inference a lot easier.
68.27% of the variables lie within one standard deviation of the mean.
95.45% of the variables lie within two standard deviations of the mean.
99.73% of the variables lie within three standard deviations of the mean.
Let’s test it!
Calculate the probability of a random person being in the two standard deviation of the mean.
Here we need to calculate the probability of a random person being in between mean+2 *sd and mean — 2* sd. So, this is also a CDF calculation. If we first find the CDF of 12 to mean+2*sd and then subtract the CDF of mean — 2* sd, that should give us the CDF of in between.
sigma = 8
pnorm(mu + 2*sigma, mean = 40, sd = 8) - pnorm(mu - 2*sigma, mean = 40, sd = 8)
or 95.45%. Do you see it? For a perfectly normal distribution 95.45% population or variables lie in between two standard deviations from the mean. You can prove the other two in the same way. Please try it yourself.
All this time we calculated the probabilities of a random variable.
Now we will do the otherwise.
That means now the probability will be given and we will match the population proportions accordingly. Here is an example.
Suppose you have a probability of 75% and you need to know the CDF of the age. It will be clearer after we do the calculation.
qnorm(0.75, mean = 40, sd = 8)
That means the CDF of 45 is 75%. In more detail, if we calculate the probability of a random person being 12 + the probability of 13 + the probability of 14 + ……+ the probability of 45, it will be 75%.
Lastly, I want to demonstrate how to generate a set of normally distributed data and plot them.
Suppose we want to generate 1000 random numbers that have a mean of 75 and a standard deviation of 11.
rnorm(1000, mean = 75, sd = 11)
This line of code will generate a set of thousand numbers that has a mean of 75 and sd of 11. I am not showing the output here because it will take too much space. Instead, let’s plot it and I will show the plot here.
I will round the numbers before plotting to get a smoother curve
y = round(y)
plot(table(y), type = "h")
Here, the table(y) gives you the frequency of each number. Please check the output of the table(y), if this is new to you.
This curve will look more normal and more smooth if you generate more data.
Exponential Continuous Distribution
The range of the exponential distribution is from zero to positive infinity. This is defined by a single parameter, the mean number of occurrences per unit of time which is denoted by lambda.
This distribution is used commonly in queuing theory for the distribution of waiting time. It mainly about the amount of time until a certain event occurs. Patients entering hospitals, the length of time between arrivals or the number of sales calls we get every day, the amount of time until an earthquake happens.
The PDF can be calculated manually using the following formula:
In this article, I will use R to calculate the PDFs and CDFs.
Let’s work on an example to learn it better.
Suppose 25 customers arrive per hour at a retail store on an average. If a customer arrived just now, what is the probability that the next customer will arrive in the next 3 minutes?
Here, the rate is 25 per hour. So, we need to be careful about the unit.
Notice it is asking in the span of the next 3 minutes. We need to calculate the CDF. R has the ‘pexp’ function to do that for an exponential continuous distribution.
pexp(3/60, rate = 25)
What is the probability of the next customer arriving in the next 3 to 7 minutes?
For this, we need to calculate the CDF of 7 and subtract the CDF of 3 from it. That should provide us with the CDF in between.
pexp(7/60, rate = 25) - pexp(3/60, rate = 25)
The probability of the next customer arriving between 3 to 7 minutes is 23.23%.
I tried to explain three very common continuous probability distributions using some simple examples in R. There are so many other types of distributions available. It’s pretty hard to learn and remember all the probability distributions. My idea is to learn the most common ones and looking over the books or Google for the rest whenever necessary. Please check the discrete probability distributions in the article I mentioned in the beginning if you haven’t already. They are also very commonly used probability distributions.
#statistics #probability #programming #RProgramming #datascience #DataAnalytics #MachineLearning