## Basic Continuous Distributions

26 Jan 2015

Curriculum Think Stats Statistics Book Distribution Exposition

### Continuous Distributions

As a Data Scientist/Analyst, a huge part of the job is to characterize the data you’re dealing with. This abstraction step can help you apply functions to your data, since the overall structure has been smoothed out and simplified. Similarly, it can make communicating the shape of your data much easier, since by this characterization you can remove unhelpful noise in the data.

The main method of data characterization is to use a well-known probability distribution to describe it. Doing so can add extra insight into your data, help you communicate the structure of your data, and give you a tailored toolset for working with the data. You’ll want to be sure your data set is a true fit, but if it is, then there is great benefit to abstracting your data in this way.

In this post, we’ll look at 4 common continuous distributions.

### The Uniform Distribution

The Uniform Distribution is the easiest distribution to understand. For all valid values of the distribution, the probibility is the same. If the minimum value is $$a$$ and the maximum value is $$b$$, then the CDF will look like this:

$CDF(x) =$ \begin{eqnarray} &0& \text{ }\text{ }\text{ for }\text{ }\text{ } &x& < a \\ &\frac{x - a}{b - a}& \text{ }\text{ }\text{ for }\text{ }\text{ } &a& \le x < b \\ &1& \text{ }\text{ }\text{ for }\text{ }\text{ } &x& \ge b
\end{eqnarray} The discrete version of this distribution is all too common in probability lectures. Just think back to all of those coin flips, die rolls, and card draws. As the number of outcomes grows beyond 2, 6, and 52, however, it can be considered continuous at a certain point. You might say that even 1,000,000 discrete outcomes would not a continuous distribution make, but considering that in data science we are usually measuring real-world values, and since our measurements are not infinitely accurate, there will necessarily be some level of discretization in nearly everything we do anyway.

### The Exponential Distribution

The Exponential Distribution is a good model for events that occur and random, independent intervals at a constant rate for some span of time. In reality, a pure constant rate can be hard to find, but narrowing the time span or considering confounding effects in your analysis can help correct these issues. For example, incoming phone calls to a call center won’t come at a constant rate, but probably the incoming calls from 2pm to 4pm on a weekday come close.

The CDF of an exponential distribution is

$CDF(x) = 1 - e^{-\lambda x}$

and the $$\lambda$$ parameter defines the shape of the distribution. The mean of a exponential distribution is $$\frac{1}{\lambda}$$. Using the call center example from earlier, if phone calls arrive at an average of 1 call per 2 minutes, then the distribution graph (with minutes on the x axis) would match the exponential distribution where $$\lambda = 0.5$$.

### The Pareto Distribution

The Pareto Distribution looks similar to the exponential distribution, but it has a heavier tail, meaning that as you move along the x-axis of the PDF, the probabilities don’t shrink as rapidly (meaning that the CDF doesn’t trend to 1 as rapidly). The reason for that is clear when looking at the CDF for this distribution. Unlike the exponential which approaches 1 at (gasp!) an exponential rate, the Pareto distribution rises at a fixed power rate.

$CDF(x) = 1 - \left(\frac{x}{x_m}\right)^\alpha, \text{ where } x_m \text{ is the minimum value of the distribution}$

Just like $$\lambda$$ defined the shape of the exponential distribution, here, $$\alpha$$ determines the shape. Since the Pareto distribution is more heavily tailed than the exponential distribution, there are a range of examples that typically fall into this distribution. The most common are resources which follow the 80-20 rule, where 80% of the resources are owned by 20% of the people. This rule also extends to many other common real-world examples. The fact that the 80-20 rule also follows a power law distribution landed it its alternate name, the apropos Pareto principle.

### The Normal Distribution

If you were familiar with any continuous distribution prior to reading this, then it was probably the normal distribution. Its PDF is the ubiquitous bell curve that we all know and love. So far we’ve been looking at CDFs, so we’ll stick with that here, too. The normal CDF has a sigmoid shape, which should be just a recognizable to a data scientist as the bell curve. Its shape is defined by its mean and standard deviation, $$\mu$$ and $$\sigma$$, respectively.

#### The Central Limit Theorem

The reason that the normal distribution is so ubiquitous is due to something called the Central Limit Theorem. It states that, under certain conditions, if you take a bunch of values independently from any distribution and sum those values up, then repeat this many times, the collection of resulting sums will be normally distributed. This theorem that applies to a wide range of distributions, including all of the ones we’ve looked at up to this point.

For example, if you sum 100 values drawn from a uniform distribution, and then repeat the process 10,000 times, the resulting plot will resemble a normal distribution.

import random
#Generate a 2D list of uniformly distributed data points, 100 by 10000
data = [[random.random() for _ in range(100)] for _ in range(10000)]
sums = [sum(row) for row in data]

Running that python snippen and plotting sums in a histogram, we get the following: This fact is actually quite remarkable, and it’s the reason that the normal distribution shows up in so many different areas. Due to limited knowledge, we are often performing density estimation on unobservable, underlying PDFs.

From Wikipedia: “In cases like electronic noise, examination grades, and so on, we can often regard a single measured value as the weighted average of a large number of small effects. Using generalisations of the central limit theorem, we can then see that this would often (though not always) produce a final distribution that is approximately normal.” So even though we may not be able to see the underlying distribution, if we can measure some value which is an accumulation of underlying factors, then that posterior measurement will probably come together as a normal distribution with which we can work.