Determination of normality in Excel - Pearson's goodness-of-fit test. Pair correlation coefficient in Excel

The PEARSON function (enter PEARSON in English) is designed to calculate the Pearson correlation coefficient r. This function used in work when it is necessary to reflect the degree linear dependence between two data sets. Excel has several functions that can be used to get the same result, but the versatility and simplicity of the Pearson function make it a choice.

How does the PEARSON function work in Excel?

Let's look at an example of calculating the Pearson correlation between two data sets using the PEARSON function in MS EXCEL. The first array represents temperature values, the second is pressure at a certain summer period. An example of a completed table is shown in the figure:

The task is as follows: it is necessary to determine the relationship between temperature and pressure for the month of June.

An example of a solution with the PEARSON function when analyzed in Excel


This indicator -0.14 according to Pearson, which was returned by the function, indicates an unfavorable relationship between temperature and pressure in the early hours of the day.



PEARSON function step by step instructions

The correlation coefficient is the most convenient indicator of the contingency of quantitative characteristics.

Task: Determine linear coefficient Pearson correlations.

Example solution:

Thus, according to the calculation result, the statistical conclusion of the experiment revealed a negative relationship between age and the number of cigarettes smoked per day.

Pearson Correlation Analysis in Excel

Task: schoolchildren were given tests of visual and verbal thinking. The average time for solving test tasks was measured in seconds. The psychologist is interested in the question: is there a relationship between the time it takes to solve these problems?

Example solution: let's present the initial data in the form of a table:


Interpretation of the result of the calculation according to Pearson

The value of the Pearson linear correlation coefficient cannot exceed +1 and be less than -1. These two numbers +1 and -1 are the boundaries for the correlation coefficient. When the calculation results in a value greater than +1 or less than -1, therefore, an error has occurred in the calculations.

If the correlation coefficient in modulus turns out to be close to 1, then this corresponds to high level connections between variables.

If a minus sign is received, then a larger value of one characteristic corresponds to a smaller value of another. In other words, if there is a minus sign, an increase in one variable (sign, value) corresponds to a decrease in another variable. This dependence is called inversely proportional dependence. It is very important to clearly understand these provisions for the correct interpretation of the resulting correlation dependence.

In today's article we'll talk about how variables can be related to each other. Using correlation, we can determine whether there is a relationship between the first and second variable. I hope you find this activity just as fun as the previous ones!

Correlation measures the strength and direction of the relationship between x and y. The figure shows Various types correlations in the form of scatter plots of ordered pairs (x, y). Traditionally, the x variable is placed on the horizontal axis and the y variable is placed on the vertical axis.

Graph A is an example of a positive linear correlation: as x increases, y also increases, and linearly. Graph B shows us an example of a negative linear correlation, where as x increases, y decreases linearly. In graph C we see that there is no correlation between x and y. These variables do not influence each other in any way.

Finally, Graph D is an example of non-linear relationships between variables. As x increases, y first decreases, then changes direction and increases.

The remainder of the article focuses on linear relationships between the dependent and independent variables.

Correlation coefficient

The correlation coefficient, r, provides us with both the strength and direction of the relationship between the independent and dependent variables. Values ​​of r range between - 1.0 and + 1.0. When r is positive, the relationship between x and y is positive (graph A in the figure), and when r is negative, the relationship is also negative (graph B). Correlation coefficient close to zero value, indicates that there is no relationship between x and y graph C).

The strength of the relationship between x and y is determined by whether the correlation coefficient is close to - 1.0 or +- 1.0. Study the following drawing.

Graph A shows a perfect positive correlation between x and y at r = + 1.0. Graph B - ideal negative correlation between x and y at r = - 1.0. Graphs C and D are examples of weaker relationships between the dependent and independent variables.

The correlation coefficient, r, determines both the strength and direction of the relationship between the dependent and independent variables. r values ​​range from - 1.0 (strong negative connection) to + 1.0 (strong positive relationship). When r = 0 there is no connection between the variables x and y.

We can calculate the actual correlation coefficient using the following equation:

Well well! I know this equation looks like a scary mess strange characters, but before we panic, let's apply the exam grade example to it. Let's say I want to determine whether there is a relationship between the number of hours a student devotes to studying statistics and a final exam score. The table below will help us break down this equation into several simple calculations and make them more manageable.

As you can see, there is a very strong positive correlation between the number of hours devoted to studying a subject and the exam grade. Teachers will be very happy to know about this.

What is the benefit of establishing relationships between similar variables? Great question. If a relationship is found to exist, we can predict exam results based on a certain number of hours spent studying the subject. Simply put, the stronger the connection, the more accurate our prediction will be.

Using Excel to Calculate Correlation Coefficients

I'm sure that when you look at these terrible calculations of correlation coefficients, you will be truly delighted to know that Excel program can do all this work for you using the CORREL function with the following characteristics:

CORREL (array 1; array 2),

array 1 = data range for first variable,

array 2 = data range for the second variable.

For example, the figure shows the CORREL function used to calculate the correlation coefficient for the exam grade example.

Task 1.

Using the Pearson test, at a significance level a= 0.05 check whether the hypothesis about normal distribution population X with empirical sample size distribution n = 200.

Solution.

1. Let's calculate and sample standard deviation .
2. Let's calculate the theoretical frequencies, taking into account that n = 200, h= 2, = 4.695, according to the formula
.

Let's create a calculation table (function values j(x) are given in Appendix 1).


i

3. Let's compare empirical and theoretical frequencies. Let's create a calculation table from which we will find the observed value of the criterion :


i
Sum

According to the table of critical distribution points (Appendix 6), by significance level a= 0.05 and the number of degrees of freedom k = s– 3 = 9 – 3 = 6 we find the critical point of the right-hand critical region (0.05; 6) = 12.6.
Since =22.2 > = 12.6, we reject the hypothesis about the normal distribution of the population. In other words, the empirical and theoretical frequencies differ significantly.

Problem 2

Statistical data is presented.

Diameter measurement results n= 200 rolls after grinding are summarized in table. (mm):
Table Frequency variation series of roll diameters

i

xi, mm

xi, mm

Required:

1) compile a discrete variation series, ordering it if necessary;

2) determine the main numerical characteristics of the series;

3) give graphical representation series in the form of a polygon (histogram) of distribution;

4) construct a theoretical normal distribution curve and check the correspondence of the empirical and theoretical distributions using the Pearson criterion. When testing the statistical hypothesis about the type of distribution, accept the significance level a = 0.05

Solution: We will find the main numerical characteristics of a given variation series by definition. The average diameter of the rolls is (mm):
x avg = = 6.753;
corrected dispersion (mm2):
D = = 0,0009166;
corrected mean square (standard) deviation (mm):
s = = 0,03028.


Rice. Frequency distribution of roll diameters

The original (“raw”) frequency distribution of the variation series, i.e. correspondence ni(xi), is distinguished by a fairly large spread of values ni relative to some hypothetical “averaging” curve (Fig.). In this case, it is preferable to construct and analyze an interval variation series, combining frequencies for diameters falling into the corresponding intervals.
Number of interval groups K Let's define it using the Sturgess formula:
K= 1 + log2 n= 1 + 3.322lg n,
Where n= 200 – sample size. In our case
K= 1 + 3.322×lg200 = 1 + 3.322×2.301 = 8.644 » 8.
The width of the interval is (6.83 – 6.68)/8 = 0.01875 » 0.02 mm.
The interval variation series is presented in table.

Table Frequency interval variation series of roll diameters.

k

xk, mm

An interval series can be visually presented in the form of a histogram of frequency distribution.


Rice. Frequency distribution of roll diameters. The solid line is a smoothing normal curve.

The appearance of the histogram allows us to make the assumption that the distribution of roll diameters obeys the normal law, according to which the theoretical frequencies can be found as
nk, theory = n× N(a; s; xk)×D xk,
where, in turn, the smoothing Gaussian curve of the normal distribution is determined by the expression:
N(a; s; xk) = .
In these expressions xk– centers of intervals in the frequency interval variation series.

For example, x 1 = (6.68 + 6.70)/2 = 6.69. As center assessments a and the parameter s of the Gaussian curve can be taken:
a = x Wed
From Fig. it can be seen that the Gaussian normal distribution curve generally corresponds to the empirical interval distribution. However, you should make sure statistical significance this correspondence. To check the correspondence of the empirical distribution to the empirical distribution, we use the Pearson goodness-of-fit criterion c2. To do this, calculate the empirical value of the criterion as the sum
= ,
Where nk And nk,theor – empirical and theoretical (normal) frequencies, respectively. It is convenient to present the calculation results in tabular form:
Table Pearson test calculations


[xk, xk+ 1), mm

xk, mm

nk,theor

Critical value we will find the criterion using the Pearson table for the significance level a = 0.05 and the number of degrees of freedom d.f. = K – 1 – r, Where K= 8 – number of intervals of the interval variation series; r= 2 – number of theoretical distribution parameters estimated based on sample data (in in this case, - options a and s). Thus, d.f. = 5. The critical value of the Pearson criterion is crit(a; d.f.) = 11.1. Since c2emp< c2крит, заключаем, что согласие между эмпирическим и теоретическим нормальным распределением является статистическим значимым. Иными словами, теоретическое нормальное распределение удовлетворительно описывает эмпирические данные.

Problem 3

Boxes of chocolate are packed automatically. According to the random non-repetitive sampling scheme, 130 of the 2000 packages contained in the batch were taken and the following data on their weight was obtained:

It is required to use the Pearson test at a significance level of a=0.05 to test the hypothesis that random value X – the weight of the packages – is distributed according to the normal law. Construct a histogram of the empirical distribution and the corresponding normal curve on one graph.

Solution

1012,5
= 615,3846

Note:

Basically as a variance normal law distributions should take the corrected sample variance. But because the number of observations - 130 is large enough, then the “ordinary” one will do.
Thus, the theoretical normal distribution is:

Interval

[xi ; xi+1]

Empirical frequencies

ni

Probabilities
pi

Theoretical frequencies
npi

(ni-npi)2

Laboratory work No. 6. Testing the hypothesis about the normal distribution of the sample using the Pearson criterion.

The lab work is performed in Excel 2007.

The purpose of the work is to provide skills in primary data processing, constructing histograms, selecting a suitable distribution law and calculating its parameters, checking the agreement between the empirical and hypothetical distribution law using the Pearson chi-square test using Excel.

1. Formation of a sample of normally distributed random numbers with given values mathematical expectation and standard deviation.

Data → Data Analysis → Random Number Generation → OK.

Rice. 1. Dialog box Data analysis

In the window that appears Random number generation enter:

Number of variables: 1 ;

Number of random numbers: 100 ;

Distribution: Normal.

Options:

Average = 15 (expected value);

Standard Deviation = 2 (standard deviation);

Random Scatter: do not fill out(or fill out as directed by the teacher);

Output interval: address of the first cell of the random number array column - $ A$1 . OK.

Rice. 2. Dialog box Random number generation with filled input fields

As a result of the operation Random number generation a column will appear $ A$1: $A$100 containing 100 random numbers.

Rice. 3. Fragment of an Excel sheet of the first few random numbers $A$1: $A$100.

2. Determination of sampling parameters, descriptive statistics

From the Excel main menu, select: Data → Data Analysis → Descriptive Statistics → OK.

In the window that appears Descriptive Statistics enter:

Input interval– 100 random numbers in cells $ A$1: $ A$100 ;

Grouping- by columns;

Output interval– address of the cell from which the table begins Descriptive Statistics - $C$1 ;

Summary statistics- tick. OK.

Rice. 4. Dialog box Descriptive Statistics with completed input fields.

A table will appear on the Excel sheet - Column 1

Rice. 5. Table Column 1 with procedure data Descriptive Statistics.

The table contains descriptive statistics, in particular:

Average– estimation of mathematical expectation;

Standard deviation– estimation of standard deviation;

Excess And Asymmetry– estimates of kurtosis and asymmetry.

The approximate equality of kurtosis and skewness estimates to zero, and the approximate equality of the average estimate to the median estimate gives preliminary grounds for choosingH 0 distribution of elements of the general population is a normal law.

Interval– sample range;

Minimumminimum value random variable in the sample;

Maximummaximum value random variable in the sample.

In a cell F15 - length of partial interval h, calculated as follows:

Number of grouping intervals k in Excel it is calculated automatically using the formula

where the parentheses mean rounding down to the integer part of the number.

In the variant under consideration n = 100 , hence, k = 11 . Really:

This formula is entered in the cell F15: =($D$13-$D$12)/10

Results of the procedure Descriptive Statistics will be required later when constructing a theoretical distribution law.

1.Open Excel

2.Create data columns. In our example, we will consider the relationship, or correlation, between aggression and self-doubt in first-graders. 30 children participated in the experiment, the data is presented in the Excel table:

1 column - subject number

2 column - aggressiveness in points

3 column - diffidence in points

3.Then you need to select an empty cell next to the table and click on the icon f(x) in the Excel panel

4.The function menu will open, you must select among the categories Statistical , and then among the list of functions alphabetically find CORREL and click OK

5.Then a menu of function arguments will open, which will allow you to select the data columns we need. To select the first column Aggressiveness you need to click on the blue button next to the line Array1

6.Select data for Array1 from the column Aggressiveness and click on the blue button in the dialog box

7. Then, similarly to Array 1, click on the blue button next to the line Array2

8.Select data for Array2- column Diffidence and press the blue button again, then OK

9. Here, the r-Pearson correlation coefficient has been calculated and written in the selected cell. In our case, it is positive and approximately equal to 0,225 . This speaks about moderate positive connections between aggressiveness and self-doubt in first-graders

Thus, statistical inference experiment will be: r = 0.225, a moderate positive relationship between the variables was revealed aggressiveness And diffidence.

Some studies require the p-level of significance of the correlation coefficient to be specified, but Excel, unlike SPSS, does not provide this option. It’s okay, there is (A.D. Nasledov).

You can also attach it to the research results.