Descriptive Statistics

The principal goal of this chapter is to help you understand and use descriptive statistics with a normal distribution. This chapter will prepare you to:

Explain the purpose of descriptive statistics.

Compute measures of central tendency.

Compute measures of variability.

Choose the best central tendency and variability statistics for different levels of measurement.

Describe the normal distribution and associated statistics and probabilities.

Apply concepts of point and interval estimates to central tendency and variability.

We have seen how nurses in practice, leadership roles, and research can present data and statistical analyses in graphs, charts, and tables. Because the data are presented more concisely, these visual displays are very useful, but we lose the details in these displays, especially around the distribution of data that are measured at the interval and ratio level (continuous variables). Additionally, visual displays may not be helpful with questions such as: “What is the average length of stay in a nursing home?”“What is the range of comorbid diagnoses among patients admitted to the medical intensive care unit?” or “What is the most frequently reported reason for patients coming to urgent care?” In combination with the visual representation of data, we also need an approach that allows exploration and understanding of the details. Descriptive statistics are used to communicate important information about the characteristics of the participants and phenomena in a study.

When the data are measured at the interval or ratio level, it is important to present the distribution of data in terms of central tendency (i.e., the average case) and variability (i.e., the range and spread of the data from the center). For example, Figure 6-1 shows a histogram of incomes of recent graduates in family nurse practitioner (FNP) programs. Questions we might ask about graduates are: “What would be the typical or average measurement value of a hypothetical person selected from this group?” and “How far are data values spread from the average?” These are difficult questions to answer with visual displays such as graphs, charts, and tables. We need numerical measures of central tendency and variability so that we can understand the distribution of the data on an objective basis. Descriptive statistics are numeric measurements of central tendency and variability, and they help us explain the data more accurately and in greater detail than visual displays alone.

Histogram of incomes of recent graduates in FNP programs.

A histogram shows the incomes of recent graduates in family nurse practitioner programs.

Reprint Courtesy of International Business Machines Corporation, © International Business Machines Corporation. “IBM SPSS Statistics software (“SPSS”)”. IBM^®, the IBM logo, ibm.com, and SPSS are trademarks or registered trademarks of International Business Machines Corporation.

Data may be distributed in many different ways, depending upon where the average is located and how far the data values are spread from that average (variability). The center of the distribution can be located in the middle, or it may be shifted to the left or right. The data can present with a high peak, where most of the data values are close to each other, or the peak may be low and the data points spread far apart. In this chapter, you will learn how to compute and interpret measures of central tendency and variability. Three common measures of central tendency—mode, median, and mean—will be explained first. Then, we discuss the four common measures of variability—range, interquartile range, variance, and standard deviation. We will also explain the use of descriptive statistics for understanding normal distributions. In normal distribution, the percentages of data values are equally dispersed from the center of the distribution. Many statistical procedures that we will discuss, in later chapters, assume that the data follow normal distribution.

CASE STUDY

Data from Johnson, K., Razo, S., Smith, J., Cain, A., & Soper, K. (2019). Optimize patient outcomes among females undergoing gynecological surgery: A randomized controlled trial. Applied Nursing Research, 45, 39-44.

In 2019, Kari Johnson and colleagues published a study evaluating if length of stay, readmission rates, and patient satisfaction changed after implementing an evidence-based education bundle for female patients having gynecological surgery, as compared with standard care (Johnson et al., 2019). A team of nurses, physical therapists, and a surgeon developed the education bundle based on the best evidence available. They used a prospective, comparative, and randomized design to test the patient education intervention in a 28-bed medicalμsurgical unit at a community hospital.

The authors used two tables to describe key demographics and variation in the groups. The first table shows the side-by-side numeric data for the education bundle and the standard education participants. The reported characteristics are age groupings ( 50, 51-64, 65-74, 75-84), race (Caucasian, non-Caucasian), marital status (married, divorced, single, widowed), and tobacco use (yes, no). All variables in both groups have a sample size of 25 participants (n = 25). In the instance of tobacco use, the demographic data for both the intervention group and control group was yes = 2 and no = 23.

In the second table, the groups are compared for each category to demonstrate if they are similar by using sample size (n), mean (M), and standard deviation (SD). For example, the statistical data in this table for the variable of age is

Education bundle; n = 25, M = 2.36, SD = 0.907

Standard education; n = 25, M = 2.24. SD = 0.969

Although these tables do not tell us if the education bundle was superior to standard education for patients who had gynecological surgery, it does provide helpful information to determine if it is reasonable to compare these groups who received the education bundle or standard care.

We often encounter descriptions of central tendency, or averages, in newspapers and journal articles. Here are few typical examples.

Mean age at index date was 83.2 years (SD = 7.2) among cases and 83.7 years (SD = 7.2) among controls (Morin et al., 2019)

In the total sample of patients with actigraphs, the mean age was 85.5 years (SD = 7.3), 76% were female, the mean Cornell Scale for Depression in Dementia (CSDD) score was 11.2 (SD = 3.7), the mean Mini-Mental State Examination (MMSE) score was 7.6 (SD = 6.0), the mean Mobilization-Observation-Behaviour-Intensity-Dementia-2 (MOBID-2) score was 2.8 (SD = 2.1), and 54.7% had a MOBID-2 score ≥ 3 (Blytt et al., 2018).

These authors all used a single number to describe an important aspect of data—an average. There are multiple ways of computing and presenting averages, but we will describe the three most common measures of central tendency: mode, median, and mean.

The Mode

The mode is simply the most frequently occurring number in a given data set. Let us examine the following data set of seven systolic blood pressure (SBP) measurements:

120 114 116 117 114 121 124

Notice that 114 appears twice and all other measurements appear only once. Therefore, the SBP measurement of 114 is the mode in this data set because it is the most frequently occurring value. This particular distribution of SBPs may be described as a unimodal distribution because there is only one mode. However, it is possible to have more than one mode in a data set. To explain, let us examine the following data set:

117 120 114 116 117 114 121 124

This data set has two modes, 114 and 117, because they each appear twice and the others appear only once. A data set with two modes is a bimodal distribution, and a multimodal distribution is a distribution with more than two modes.

As you probably noticed, the mode is useful primarily for variables measured at the nominal level because it is merely the most frequently occurring number in the data set. For example, if we have assigned the following numbers to the sex of participants, 1 for male and 2 for female, and out of a sample of 100 there are 75 females, the mode is 2. The mode will not be useful with continuous levels of measurement, or as the data set gets larger.

The Median

The median is the exact middle value in a distribution and divides the data set into two exact halves. Let us consider the following data set that consists of five income levels for registered nurses:

35,000 39,500 42,000 47,500 52,000

In this data set, the value of 42,000 is the median, because it divides the data set into exactly two halves, with an equal number of values below and above it. It was easy to find the median in this data set, but notice that this data set is ordered from the smallest to the largest data value. Finding the median may be difficult and misleading if the data values are not ordered consecutively. Consider the following data set:

47,500 39,500 32,000 52,500 42,000

It will not make sense to choose 32,000 and report it as the median, because it is the smallest data value in this data set. Therefore, ordering the data from the smallest to the largest (or vice versa) is the first and the most important step in finding the median of any given data set. After ordering the values, it is easy to see that the median for this data set should be 42,000.

Notice also that the previous two data files had odd numbers of data values. Finding the median in a data set with an odd number of values is easy because you will end up with an equal number of data values above and below the median. However, it is more challenging to find the median when there are an even number of data values in the data set. Consider the following data set:

24 29 32 35 39 40

The data values represent age in years of six individuals, and there is no number that divides this data set into two exact halves. Theoretically, such a number should be between 32 and 35, leaving three data values above and below it. However, such a number does not actually exist in the data set. In this case, you must calculate the median by summing the two middle numbers, 32 and 35, and dividing the sum by 2. You are basically computing the average of those two middle values as the median, as follows:

Remember that we still must order the data from the smallest to the largest value before finding a median.

The Mean

The arithmetic mean (often called the average) is the sum of all the data values in a data set divided by the number of data values and is shown in the following equation:

The mean involves the mathematical operations of addition and division, and so is an appropriate measure of central tendency for interval and ratio levels of measurement. For example, we might use it to calculate an average of a patient’s SBP over time to determine the effect of medication. However, it is not possible to find a meaningful interpretation for an arithmetic mean for the categorical, or nominal, variable, such as political affiliation, with categories of Republican, Independent, and Democratic. Let us consider the following data set of sodium content levels, measured in milligrams per liter:

20 18 16 22 27 11

For this data set, the mean will be:

We have computed a mean of 19 for a group of six sodium content levels. How should we interpret this finding? Remember that the mean is the average score in the data set. Therefore, the mean tells us that, on average, the sodium content level in this data set is 19 mg per liter.

Choosing a Measure of Central Tendency

We have discussed three types of central tendency—the mode, the median, and the mean—and examined how they differ in terms of finding the center of a data distribution. The next legitimate question is, “When do we use which measure?”

The mode is simply the most frequently occurring data value(s) in the data set. Therefore, it is mainly useful for variables at the nominal level of measurement. Both median and mean are useful when the variable being measured can be quantified at the interval or ratio level. However, one important thing to note here is that the mean is extremely sensitive to unusual cases. To explain this further, let us consider the following data sets:

Data set #1: 108 112 116 120 124

Data set #2: 108 112 116 120 205

In both data sets, the median is 116, as it is the number that divides the data set into two exact halves. However, you will notice that the mean is not identical in both data sets.

For the first data set, the mean is equal to:

whereas the mean of the second data set is equal to

Notice how the mean of the second data set has been influenced by the presence of an unusual case in the data set. If we were to say that the mean is equal to 132.2 for the second data set and it represents a typical case, this will not make much sense because the majority of data values are less than 120. Therefore, the mean should not be used when unusual, or outlying, data values are present in the data set, as the mean tends to be extremely sensitive to the unusual values. Rather, the median should be reported in this case. This is why the average housing price is always reported with the median, because even a single million-dollar house can distort the average housing price when most of the houses are in the range of $200,000 to $350,000.

Measures of central tendency allow us to know the typical value in the data set. However, we know that when we measure a variable, there will be differences between and among the values in the data set. For example, if we were measuring SBP among a group of research participants, we would expect that there would be a range of values among individuals. Furthermore, we would also expect similar variations on SBP measurements over different times in any individual participant. In other words, some level of variation among data values in any data set is expected. Our understanding of the characteristics of the data is enhanced when we also understand the nature of variability in the data set. For example, if a sample of patients has a mean SBP of 130 mm and most patients are within 10 mm of that value, then that data set will look very different than if most patients are within 50 mm of the mean. Whenever we measure central tendency, we also need to calculate variability to achieve a thorough grasp of the data.

We also use measures of variability to provide information about how well a measure of central tendency represents the “middle/average” value in the data set. The computed measure of central tendency will be most accurate when the data values vary only a little, but accuracy of the mean declines as the variation in data values increases. There are multiple ways of computing and presenting variability, but we describe the four that are most commonly used: range, interquartile range, variance, and standard deviation.

Range

Range is simply the difference between the largest and the smallest values in the data set. For example, suppose that a researcher measured patients’ level of pain after vascular surgery on a scale of 0 to 10. These data set is:

9 3 2 6 7 8 7 5

The first step is to sort the data from the smallest to the largest values, as it will make our job of finding these two values easy. After sorting, the range of this data set is 9 - 2 = 7.

Range is simple to calculate. As seen in the previous example, the range is calculated simply by subtracting the smallest value from the highest value. However, we should be cautious about using range as a measure of variability, as it only uses the highest and lowest values in computation. In other words, the range is extremely sensitive to unusual data values. Therefore, it does not accurately capture information about how data values in the set differ if the data set contains an unusual value(s).

Consider the following data set:

3 4 2 3 3 4 2 9

This data set is still a collection of pain level measurements of patients who underwent vascular surgery, but notice that the value of 9 seems unusual in this data set. Here, the range is 9 - 2 = 7 after sorting. Does this make sense? Most of the values are between 2 and 4, and claiming that the variability is 7 does not really make sense in the context of this data set. To get around this problem, sometimes researchers will simply report the range as the lowest and highest values, “reports of pain intensity ranged from 3 to 9,” rather than computing a range.

Interquartile Range

Interquartile range is the difference between the 75th percentile and 25th percentile. Percentile is a measure of location and tells us how many data values fall below a certain percentage of observations. Therefore, the 25th percentile is the data value that the bottom 25% falls below, and the 75th percentile is the data value that the bottom 75% falls below. In results, the interquartile range is less sensitive to an unusual case(s) in the data set, as it does not use the smallest and the largest values as the standard range. For example, suppose the number of patient falls per week at a local nursing home have been measured:

1 1 2 2 2 3 3 3 4 4 5

Note that the data set has already been sorted from the smallest to the largest. It is easier to find the median first, and then to find the 25th and 75th percentiles, because it is more difficult to directly identify the percentiles.

The median of this data set is 3, because 3 is the middle value that divides this data set into two exact halves. From the median, the 25th percentile is equal to 2 and the 75th percentile is equal to 4, as they divide the lower and upper halves of the data set into two exact halves. The interquartile range is then the difference between the 25th percentile and the 75th percentile, which is 4 - 2 = 2.

Let us now consider the next data set:

1 1 2 2 2 3 3 3 4 4 24

As you can see, it is the same data set as before, except for the highest value, 24, which seems to be an unusual value. Notice that the interquartile range is still 4 - 2 = 2 and is not affected by the unusual data value. Therefore, interquartile range is less sensitive to unusual or outlying values.

Variance and Standard Deviation

Although range provides a rough estimate of the variability of a data set, it does not use all of the data values in computation and is very sensitive to an unusual value in the data set. Interquartile range is an improvement, but it still does not account for every data value in the set. On the other hand, the next two measures of variability, variance and standard deviation, use all of the data values in the set when computing variability and may capture information about variability more precisely than the range or the interquartile range. As standard deviation is simply the square root of variance, we will explain variance first.

Variance is the average amount that data values differ from the mean and is computed with the following formulas:

In this equation, we compute the difference between each raw value and the mean (X -

X

, square the result, sum (Σ) those values, and then divide by the total number of values in the data set (N). Note that the denominator will be changed to n - 1 when working with samples.

Degrees of Freedom

Calculations of variance and many other statistics require an estimate of the range of variability, known as degrees of freedom. From a sample, degrees of freedom are always equal to n - 1. Here is an analogy that might help you understand degrees of freedom. Envision a beverage holder from any fast food restaurant—most of these hold four drinks. In this case, the degrees of freedom would be equal to 4 - 1, or 3. As each section of the holder is occupied by a drink, there is a chance of varying in what section of the holder any given drink is placed, top left or top right for example, until three of the sections are filled; at this point, there is only one section left where a drink may be placed, and no variation is possible. Each statistical test or calculation has a variation of degrees of freedom. Watch for these throughout the text.

To compute variance, let us consider the following data set of toddler weights in an outpatient clinic, assuming that the data values were taken from a population:

19 22 24 26 19

The computation steps are shown in Table 6-1.

Table 6-1 How to Compute the Variance

A table is titled how to compute the variance.

Computed variance for this data set is 7.6. What does this mean? Recall that the values represent toddler weights in an outpatient clinic, measured in pounds. Because the deviation of each observation from the mean has been squared, the unit for the variance is now in (pound)². What does (pound)² mean? If we were to say that data values differ from the mean on average about 7.6 (pound)², would this claim make sense? The claim probably does not make sense because there is no such unit as (pound)².

Why do we then take the square of the deviation if the (unit)² will not make sense to interpret at the end? The answer is simple: If you do not square the deviation and sum each deviation, it will always add up to zero, no matter what data set you work with. We suggest that you try this with small data sets in this text or other sources.

How can we then talk about variability if the measure of variability turns out to be equal to zero? This is why we take the square of the deviation to compute the variance first, and then take the square root of it that result compute the standard deviation, bringing us back to the original unit of measurement.

We get the standard deviation of 2.76 by taking the square root of 7.6; we can then say that the data values differ from the mean (22 pounds) on an average of about 2.76 pounds. We can interpret this finding to mean that, on average, about two-thirds of the weights fall between 19.24 and 24.76 pounds. This makes more sense when you look at the data set compared with the variance. Note that the mean and standard deviation should always be reported together!

Choosing a Measure of Variability

We have shown you how to compute four measures of variability—range, interquartile range, variation, and standard deviation—and how they differ. The next question to ask is, “When do we use which measure?”

You should use the range only as a crude measure because it is extremely sensitive to unusual values in the data set. Interquartile range is not as sensitive to unusual data values, and standard deviation is very sensitive to unusual values. Therefore, the interquartile range should be used with the median when the data contain unusual data values. However, the standard deviation should be used with the mean when the data are free of unusual data values.

Obtaining Measures of Central Tendency and Variability in Excel

In order to explain how to analyze the data in Microsoft Excel, we will first explain how to enable and use the Data Analysis ToolPak. Go to File > Options, as shown in Figure 6-2. Then, click on the Add-In category, select Excel Add-Ins in the Manage box, and click “Go,” as shown in Figure 6-3. Next, check the Analysis ToolPak check box in the Add-Ins box and then click OK (Figure 6-4). Now you should see the “Data Analysis” ToolPak under the Data pull-down menu (Figure 6-5).

Selecting the Option in Excel.