Descriptive Statistics Using Microsoft Excel
These tutorials briefly explain the use and interpretation of standard statistical analysis techniques. The examples include how-to instructions for Excel. Although there are different version of Excel in use, these should work about the same for most recent versions. They also assume that you have installed the Excel Analysis Pak which is free and comes with Excel (Go to Tools, Addins... if it is not already installed in your version of Excel.)
See www.stattutorials.com/EXCELDATA for files mentioned in this tutorial, � TexaSoft, 2008
Descriptive Statistics in ExcelUsually when you create or acquire a set of numbers you will want to examine the data to learn more about its distribution and to discover information such as the minimum and maximum values and to determine if there are outliers. This is an important step in any analysis since it helps you understand if you data meet assumptions required by other analyses such as t-tests and regression.
For this example, we�ll look at the data set called EXAMPLE.XLS. The first few records are shown here:
5As an initial example, we�ll examine the variable AGE.
1. In Excel, select Tools/Data Analysis/Descriptive Statistics. (If the Data Analysis option is not on your Tools menu, you must first install it using Tools/Add ins�)
2. Select the input range for the AGE variable. In this case it is $B$2:$B$51.
3. Be sure to select the check boxes Summary Statistics and Confidence level for mean (95% is okay). The output created is shown here:
0.689499422Information you should notice includes:
1. Search for outliers: Look at the Minimum and Maximum values to see if these values fall within your expected range for these data. If a value is unexpectedly small or large, you should examine your original data to see if it was miscoded. If there are corrections that need to be made, make them before continuing. If you have values that are unexpectedly large are small, but are actual values, it may indicate that your data are not normally distributed. In that case you may consider using nonparametric procedures in further analyses. It may also indicate that the mean is not the best value to report to describe the central tendency of this data set.
2. Symmetry: Another measure that helps you decide normality is Skewness and Kurtosis. The Skewness measure indicates the level of non-symmetry. If the distribution of the data are symmetric then skewness will be close to 0 (zero). The further from 0, the more skewed the data. A negative value indicates a skew to the left. How do you tell if the skewness is large enough to case concern. Excel doesn�t give you this value, but a measure of the standard error of skewness can be calculated as =SQRT(6/N) or =SQRT(6/50) which is 0.346. If the skewness is more than twice this amount, then it indicates that the distribution of the data is non-symmetric. In this case 0.346 * 2 = 0.69. The skewness reported by Excel is -0.512, so the data can be assumed to be fairly symmetric (although somewhat marginally so.) However, this does NOT indicate that the data are normally distributed.
3. Kurtosis is a measure of the peakedness of the data. Again, for normally distributed data the kurtosis is 0 (zero). As with skewness, if the value of kurtosis is too big or too small, there is concern about the normality of the distribution. In this case, a rough formula for the standard error for kurtosis is =SQRT(24/N) = 0.692. Twice this amount is 1.39. Since the value of kurtosis falls within two standard errors (-0.26) the data may be considered to meet the criteria for normality by this measure. These measures of skewness and kurtosis are one method of examining the distribution of the data. However, they are not definitive in concluding normality. You should also examine a graph (histogram) of the data and consider performing other tests for normality such as the Shapiro-Wilk or Kolmogorov-Smirnov test (not provided by Excel).
4. Estimate of central tendency: For normally distributed data the mean (arithmetic average) is the typical value to use in a report or journal article. However, this point estimate is of limited value without some estimate of the varibility of the data. Therefore you should at least report three values � the mean, the standard error of the mean, and the sample size. In this case, those values are 10.5, 0.34 and 50. (Typically report means to one more decimal place than what was measured and the standard error to two decimal places beyond the unit measure.) Many journals also prefer that you report a confidence limit on the mean. This is calculated (estimated) as MEAN + or � t(.05,N-1) * SEM. The value of t is reported by Excel as the confidence level = 0.6895. Thus, the lower bound for the 95 % CI is given as = 10.46 � 0.6895 = 9.77 and the upper limit � 10.46 + 0.6895 = 11.15. The median is another measure of central tendency and is usually reported when the data are not normally distributed. The mode, or the most frequent value, is a third measure of central tendency. In the case of the median or mode, the range is often given as a measure of variability, although a better measure is the interquartile range (not reported by Excel).
5. Measures of variability: The measures of variability reported in Excel�s descriptive statistics include the standard error, the standard deviation and the variance as well as the range. The first three measures are related in the following ways:
Standard Deviation = SQRT(Variance)
Standard Error = Standard Deviation / SQRT(N)
6. Coefficient of Variation: Another measure often reported is the coefficient of variation. This measure provides a unitless measure of the variation of the sate by translating it into a percentage of the mean value. This measure not provided by Excel, but is easily calculated by the formula:
CV = (Standard Deviation / Mean) * 100
For this data, the value, expressed as a percentage, is
CV = (0.343/10.46) * 100 = 3.279
This CV value can be used when comparing two samples that have different means and standard deviations. When the mean is close to 0, the CV value becomes of little use.
7. Visualizing your data: It is always a good idea to examine your data visually to understand what�s going on. To produce a histogram of the AGE variable, select Tools/Data Analysis/Histogram. Select the range for the AGE variable, $B$2:$B$51, and check the option Chart output. This produces the following table and chart:
This is not the best looking histogram because Excel selects its own bin sizes (class intervals) on the X-axis, which in this case are less than desirable. You can improve on the plot by creating a list of bin sizes such as in the table below:
15.00Redo the histogram, this time indicating the bin sizes as the �Bin Range.� This produces the following histogram:
This plot is much cleaner looking and suitable for reporting. It shows that the data are visually not symmetric with more values appearing to the right of the mean (10.46) than we would expect in normally distributed data.
This tutorial shows ways for you to examine a variable for normality, symmetry, and to visually inspect the distribution of your data. These are important beginning exercises you should perform on your data before using it any other analysis.
End of tutorial
For more information... we recommend:
- WINKS -- a simple to use and affordable statistical software program that will help you analyze, interpret and write-up your results. Download a free trial copy. WINKS reads Excel files, so you can read your Excel data into WINKS and do many more types of analysis than in Excel alone.
- SAS Essentials: Mastering SAS for Research
- provides an introduction to SAS statistical software, the premiere statistical data analysis tool for scientific research. Through its straightforward approach, the text presents SAS with step-by-step examples. SAS Essentials introduces a step-by-step approach to mastering SAS software for statistical data analysis. It's also a valuable reference tool for any researcher currently using SAS. Designed for those new to SAS and filled with illustrative examples, the book shows how to read, write and import data; prepare data for analysis; use SAS procedures; evaluate quantitative data; analyze counts and crosstabulation tables; and compare means using the t-test. The book also provides instruction and examples on analysis of variance, correlation and regression, nonparametric analysis, logistic regression, creating graphs, controlling outputs using ODS, as well as advanced topics in SAS programming.ISBN: 0470461292. Order from publisher -Jossey-Bass/Wiley. Barnes & Noble. Amazon.
- Against All Odds VIDEOS - Now in DVD format -- Teaching Videos from Annenberg/PBS --"This highly engaging primer on statistical methods and inference introduces the practical applications of statistics. Produced by the Consortium for Mathematics and Its Applications and Chedd-Angier." Click here for info
- BeSmartNotes Reference sheets for SAS, SAS ODS, SAS Functions, SPSS and WINKS - Click here for info. (www.besmartnotes.com)
- Statistical Analysis Quick Reference Guidebook: With SPSS Examples is a practical "cut to the chase" handbook that quickly explains the when, where, and how of statistical data analysis as it is used for real-world decision-making in a wide variety of disciplines. It contains examples using SPSS Statistics software. In this one-stop reference, the authors provide succinct guidelines for performing an analysis, avoiding pitfalls, interpreting results, and reporting outcomes. Paperback. Sage Publishers ISBN: 1412925606 Order book from Amazon