Section 1.5 Variation of Data (B5)
Knowing averages or center of data sets only tells part of the story. We also should know how the data varies or spread. As in the idea of averages, there are different ways to approach this.
In this section, we will show how to identify and compute different notions of variations.
Run the following code to download the ncbirths.csv
data set:
In 2004, the state of North Carolina released to the public a large data set containing information on births recorded in this state. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from this data set.
Subsection 1.5.1 Variance and Standard Deviation
Exploration 1.5.1.
Consider the data in the lists L1=[13, 13, 12, 10, 10, 14, 12, 11, 10, 14, 15, 10, 11, 13, 14, 14, 14, 15, 14, 11]
and L2=[19, 12, 13, 9, 15, 5, 7, 12, 9, 14, 8, 20, 19, 15, 13, 8, 14, 14, 7, 17]
.
(a)
Show that the mean, median, and mode for the data in the first list are the same as the second list.
(b)
Draw a dot plot or histogram visualizing the data in each list.
Hint. Desmos(c)
What does the visualization tell you about the data that the measures of center do not?
Activity 1.5.2.
Often we are interested in the spread of values in a dataset. The simplest measure of spread is the range of data: the difference between its largest value (its maximum) and its smallest value (its minimum). For example, the range of the values in \(3,3,6,7,12,12,16,19\) is \(\text{max}-\text{min}=19-3=16\text{.}\)
(a)
Find the maximum, minimum, and range for the values \(9, 2, 3, 5, 3, 6, 2, 8\text{.}\)
(b)
Find the maximum, minimum, and range for the values \(5, 0, 4, 500, 2, 5, 4, 2\text{.}\)
(c)
What do you notice about the previous tasks? How does the range relate (or not relate) to the majority of the data?
Activity 1.5.3. Measuring the Spread of Data.
Intuitively, we likely think of the “spread” as a sort of summed distance of data points from the center of the data. The more points are far away from the center, or the farther away they are, the more “spread” out the data is. We will walk through a formulation of this notion.
(a)
Consider the data set S=[1,2,3,4,5]
. Compute the mean of \(S\text{,}\) \(\bar{x}\text{.}\)
(b)
Compute a list of differences of each data point from the mean: \([1-\bar{x}, 2-\bar{x}, 3-\bar{x}, 4-\bar{x}, 5-\bar{x} ]\text{.}\)
(c)
As a proposed measure of spread, compute a sum of all these values: \((1-\bar{x})+ (2-\bar{x})+ (3-\bar{x})+ (4-\bar{x})+ (5-\bar{x}) \text{.}\)
(d)
What do you think of the proposed measure we computed? What, if anything, may be wrong with it?
(e)
To modify this proposed measure of spread, compute a sum of all the differences squared: \((1-\bar{x})^2+ (2-\bar{x})^2+ (3-\bar{x})^2+ (4-\bar{x})^2+ (5-\bar{x})^2 \text{.}\) What is this change meant to fix?
(f)
What do you think of the modified measure we computed compared to what we did in (c)?
(g)
Consider the data set T=[1,1,2,2,3,3,4,4,5,5]
. Compute the mean of \(T, \bar{x}'\text{.}\) Do we think \(T\) is more, less or equally spread out than \(S\text{?}\)
(h)
Repeat the process done in (e) but for \(T\text{,}\) that is \((1-\bar{x})^2+ (1-\bar{x})^2+ (2-\bar{x})^2+(2-\bar{x})^2+ (3-\bar{x})^2+(3-\bar{x})^2+ (4-\bar{x})^2+(4-\bar{x})^2+ (5-\bar{x})^2+(5-\bar{x})^2 \text{.}\) How does this value compare to the value found in (e)? What is accounting for this difference?
(i)
Repeat the process done in (e) but for \(T\text{.}\) How does this value compare to the value found in (e)? What is accounting for this difference?
(j)
Divide the value found in (e) by the number of data points in \(S\) and the value found in (h) by the number of data points in \(T\text{.}\) How do these values compare?
Definition 1.5.1.
The population variance of a data set is a measure of “spread” denoted by
\begin{equation*} V_P:=\frac{\sum (x_i-\mu)^2}{n} \end{equation*}A theoretically useful measure of spread with the same units as the original variable would be the population standard deviation denoted by
\begin{equation*} \sigma:=\sqrt{V_P}=\sqrt{\frac{\sum (x_i-\mu)^2}{n}} \end{equation*}
Naturally, if we have a notion of a population variance or standard deviation, then we should have a notion for sample deviation or variance.
Definition 1.5.2.
The sample variance is denoted by
\begin{equation*} V_S:=\frac{\sum (x_i-\bar{x})^2}{n-1} \end{equation*}The sample standard deviation is denoted by
\begin{equation*} s:=\sqrt{V_S}=\sqrt{\frac{\sum (x_i-\bar{x})^2}{n-1}} \end{equation*}
Activity 1.5.4. Simulated Sample Standard Deviation.
One may wonder at this point, why compute population and sample variances or deviations differently? here, we illustrate why.
(a)
Compute the population and sample deviations for L=[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5]
.
(b)
Run the following code to:
Generates a sample of size 10 from
L
above and computes the sample standard deviation.Repeats the above 1000 times.
Plots the distribution of the above 1000 deviations, and computes the average deviation.
(c)
These values are all sample standard deviations. Which of the deviations computed in (a) do they most closely resemble?
Activity 1.5.5. Birth Weight Standard Deviation.
(a)
Run the following code to display the standard deviation of the birth weight of babies from North Carolina in 2004 in pounds:
(b)
Run the following code to display the histogram birth weight of babies from North Carolina in 2004 in pounds, along with the mean, plus and minus 1 standard deviation:
Activity 1.5.6. Standard Deviation of Variable of Choice.
(a)
Follow this link and identify a numerical vairable whose standard deviation you wish to find. https://www.openintro.org/data/index.php?data=ncbirths
.
(b)
Fix the following code to display the standard deviation of a variable of your choosing:
(c)
Fix the following code to display the histogram of your chosen variable, along with the mean, plus and minus 1 standard deviation:
Subsection 1.5.2 Quartiles and IQR
Activity 1.5.7. Introducing Quartiles.
Consider the set of data [40, 19, 33, 42, 48, 23, 13, 16, 47, 6, 32, 9, 12, 31, 4, 43, 49, 25, 37, 26]
(a)
Write out this list in order and identify the median. (Not just the value but specifically which term, or between which terms). Label this \(Q_2\text{.}\)
(b)
If the median is a part of the data set, then remove it and seperate the terms before and after \(Q_2\) into two seperate equally sized set.
(c)
Find the median of the set of data before \(Q_2\text{.}\) Call this value \(Q_1\text{.}\)
(d)
Find the median of the set of data after \(Q_2\text{.}\) Call this value \(Q_3\text{.}\)
(e)
What proportion of the data is less than \(Q_1\text{?}\) Between \(Q_1\) and \(Q_2\text{?}\) Between \(Q_2\) and \(Q_3\text{?}\) Greater than \(Q_3\text{?}\)
(f)
Compute the \(IQR\) where
Definition 1.5.3.
The First Quartile usually denoted \(Q_1\) is a value for which 25% of the data lies below this value. The Third Quartile usually denoted \(Q_3\) is a value for which 25% of the data lies above \(Q_3\text{.}\) The \(IQR\) is the difference between these values, and is also a measure of \(spread\text{.}\)
Remark 1.5.4.
Note that a quarter of the data (hence the name) falls below \(Q_1\text{,}\) a quarter between \(Q_1\) and the median \(Q_2\text{,}\) a quarter between \(Q_2\) and \(Q_3\) and the final quarter above \(Q_3\text{.}\)
Definition 1.5.5.
A boxplot is a plot summarizing the min, max and 3 quartiles of a data set:
Activity 1.5.8. Identifying Outliers.
Quartiles and the \(IQR\) provide a useful heuristic to identify outliers to our data. Outliers are data points that are outlandishly distinct from the rest of the data set and can potentially screw the results.
Consider the set of data [2,66,8,14,9,2,3,11,62,8,45,19,12,41,5,5,5,85,52,8,96,5,2,2,15,12,0,1]
(a)
Identify the min, \(Q_1\text{,}\) median, \(Q_3\) and max for this data set. This is called the 5 number summary.
(b)
We define right outliers to be any data points in the set greater than \(Q_3+1.5\cdot IQR\text{.}\)
We define left outliers to be any data points in the set less than \(Q_1-1.5\cdot IQR\text{.}\)
Does this set have any outliers? What (if any) are they?
(c)
Remove the outliers from the set and recompute the min, \(Q_1\text{,}\) median, \(Q_3\) and max for the modified data set.
(d)
Plot a boxplot for the original data set.
(e)
Click Exclude Outliers
, how does this boxplot compare to what we found in (c).
Activity 1.5.9. Birth Weight Box Plot.
(a)
Run the following code to display a summary of the birth weight of babies from North Carolina in 2004 in pounds:
(b)
Run the following code to display the boxplot for birth weight of babies from North Carolina in 2004 in pounds. Note that this command automatically removed the outliers.
(c)
Look at the 5 numbers listed below the $stats
output. This is the 5 number summary of the data set with the outliers removed. What are the min, \(Q_1\text{,}\) median, \(Q_3\text{,}\) max of this data with outliers removed?
Activity 1.5.10. Box Plot of Variable of Choice.
(a)
Follow this link and identify a numerical vairable whose box plot you wish to find. https://www.openintro.org/data/index.php?data=ncbirths 1 .
(b)
Fix the following code to display a summary of the numerical variable of your choice:
(c)
Fix the following code to display the boxplot the numerical variable of your choice. Note that this command automatically removed the outliers.
(d)
Look at the 5 numbers listed below the $stats
output. This is the 5 number summary of the data set with the outliers removed. What are the min, \(Q_1\text{,}\) median, \(Q_3\text{,}\) max of this data with outliers removed?
Activity 1.5.11. Comparing Subgroups.
Box plots can be used to quickly compare the distribution of a random variable across different groups.
(a)
Run the following code to display a boxplots of the birth weight of babies from North Carolina in 2004 in pounds, as seperated into smoker and nonsmoker categories:
(b)
What do these boxplots tell you about birth weights between smoking and nonsmoking mothers?
Activity 1.5.12. Comparing Subgroups of Chosen Variable.
(a)
Follow this link and identify a numerical vairable whose box plots you wish to find, and a categorical variable you wish to seperate the data by. https://www.openintro.org/data/index.php?data=ncbirths 2 .
(b)
Fix the following code to display a boxplots of the numerical variable of your choice, as seperated by your chosen categorical variable:
(c)
What does these boxplots tell you about your chosen variables?
Activity 1.5.13. Robustness of Measurements.
Mean and standard deviation are typically what are used to measure centrality and spread, and have many theoritical properties which make them well suited for this purpose. One may wonder what purpose median and \(IQR\) serve. We will explore that here.
Consider the set of data [7,5,1,2,10,9,4,1,8,3,9,1,4,3,4]
(a)
Compute the mean, and sample standard deviation, then the median and \(IQR\text{:}\)
(b)
Add the data point 100
to this list and recompute the mean, and sample standard deviation, then the median and \(IQR\text{.}\)
(c)
Which values shifted more by adding the extremal value? Which stayed closer to their original values?