Search This Blog

Friday 24 March 2017

Checking for Normality


Normality test is used to determine whether our sample data has been drawn from a normal distribution population. 
A number of statistical tests such as T-Test and the one-way ANOVA and two-way require a normal distribution sample population. 
If the assumption of normality is not valid, the results of the tests will be unreliable.

When we do a normality test?
A lot of statistical tests require that our data are normally distributed and therefore we should always check if this assumption is violated. For example, T-Test

Example:

Given a set of data, we would like to check if the distribution is normal or not.
In this example, we assume that the null hypothesis is that the data is normally distributed and the alternative hypothesis is that the data is not normally distributed. The data set can be obtained here. 

The data to be tested in stored in the first column.


  •      Select Analyze > Descriptive Statistics > Explore





After click, new window will pops out
  •      From the list above on the left side, select the ‘Data’ variable to the “Dependent List”



  • Click “Plots” on the right side and new window will pops out. Click ‘None’ for the “Boxplot” and unclick everything but make sure for descriptive click ‘Normality plots with tests"
  • click "OK" 


  •      Last, click “Continue” and the results will pops out in output window



  • After the result of normality test is out, we can interpret the result







The test statistics above shown in the third table are the two tests that always run for normality. If the data set is small than 30, we use the Shapiro-Wilk test but if the data set is bigger than 30 the Kolmogorov-Smirnov test is used. In this data, we used Shapiro-Wilk since the number of data set is 20. From the result, the p-value show 0.316 > 0.05 so, we reject the alternative hypothesis and conclude that the data set is normal distribution.


By: Nur Fariza

References:

Book

Pallant, J. (2013). SPSS survival manual: A step by step guide to data analysis using SPSS (5th ed.) Maidenhead: Open University Press/ McGraw-Hill.


Website


Analyzing Skewness & Kurtosis

Descriptive statistic is an information that being used to describe the datasets. there are many type of description being used such as the normality test and also measures of distributions. As for the measures of distribution there is two indicators being highlighted as an important information that must be seen in analysing a datasets.

Skewness
- Is a measure of symmetry, or more precisely the lack of symmetry. A distribution or data set is symmetry if it looks the same to the left and right of the centre point. If the skewness value are close to zero point its indicates that the distribution are symmetrical. While if the values are large and positive, it shows that the distribution is 'positively skewed distribution' and vice versa.




Kurtosis
- Is a measure of whether the data are heavy-tailed or light-tailed relative to be a normal distribution. In kurtosis, the data considered as normal distribution if the value are much closer to zero point. If the value is large and positive the distribution are leptokurtic while if the value is large and negative the distribution indicates as platykurtic.



How to analyze it?

In order to see the value of skewness and kurtosis, it is important to run a descriptive statistics test. Firstly, all you should do is checking for a distribution and normality. 

Click on 'Analyze' in the toolbar and click 'Descriptive Statistics' and also 'Descriptive'.


There will be a 'pop-up' of descriptive box, make sure you double click the variable being measured in the test and drag it to the 'Variable(s)' box. As for this example let's assume that the 'AGE OF RESPOND' as the variable to be test. It's followed by clicking the 'Option' button.


In the 'Options' box, click only the 'Kurtosis' and 'Skewness' box as the objective is to find both of it. You might as well click on the 'Variable List'. Then, click 'Continue' before it's back to the 'Descriptive Statistic' box and you can proceed with 'OK'.


The data will be process and you will get one table of 'Descriptive Statistic' with the output. The table will show the measure of distribution and also normality of the data by using skewness and kurtosis to identify it.


It is simple in order for you to understand how to analyze and read the data once you are familiar with the steps. The example of graph above can be used as a guidelines to indicates the normality of the data.


by : Hanis Jefry



References

  1. https://www.researchgate.net/post/What_is_the_acceptable_range_of_skewness_and_kurtosis_for_normal_distribution_of_data
  2. http://libguides.library.kent.edu/SPSS/Explore
  3. http://stats.idre.ucla.edu/spss/output/descriptive-statistics/
  4. http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
  5. https://www.graphpad.com/guides/prism/6/statistics/index.htm?stat_skewness_and_kurtosis.htm

Friday 17 March 2017

Statistic & Basic Terminology

Statistic
- can be defined as the science of collecting, describing and interpreting data.
- provides a process of managing a data in terms of choosing the correct statistical technique.

Inferential Statistic
- conclusions of what the results indicates and allow researcher to generalise the characteristic of a population from the observed characteristic of a sample.

Descriptive Statistic
- description of something happen in the particular study.

Population
- collection of individual or objects that must be analysed based on the nature of the study.

Sample
- part of the population and will be classified by each sampling technique.

Data
- can be in form of facts, observation and information that being gathered by doing an investigations.
- it also divide into two types which is quantitative and qualitative data.

Variable
- is any object that can take on difference values.
There is a lot of variables that being used:
- Categorical : nominal & ordinal
- Numerical : interval & ratio

Mode
- most frequent score occur in the distribution list.

Median
- is the score that divides the set of data into half. Median also can be defined as the score at the 50th percentile in the distribution.

Mean
- the most common measure of the central tendency and it can be mathematically manipulated.

by - Hanis Jefry

References
Book
Pallant, J. (2013). SPSS survival manual: A step by step guide to data analysis using  SPSS (5th ed.). Maidenhead: Open University Press/McGraw-Hill.

Websites
http://bobhall.tamu.edu/FiniteMath/Module8/Introduction.html
http://data-planet.libguides.com/dataterminology









Sunday 12 March 2017

What is Variance?

The variance (σ2) is defined as a measure of how far each value in the data set is from the mean. Here is how it is defined:
  1. Subtract the mean from each value in the data. This gives you a measure of the distance of each value from the mean.
  2. Square each of these distances (so that they are all positive values), and add all of the squares together.
  3. Divide the sum of the squares by the number of values in the data set.

I guess all of you are familiar enough with what are average (mean), variance and standard deviation. Averages, variance and standard deviation are the three most basic in statistics. This post is more about how to teach variance. Let’s say there are 8 test scores, the average is 46 and, the variance is 16. 

What if each test scores are doubled? Average? Sure, still easy. It will be 92. How about variance? or the standard deviation? I am not sure how many can answer this question right away.

In order to write the equation that defines the variance, it is simplest to use the summation operator, "Σ". The summation operator is just a shorthand way to write, "Take the sum of a set of numbers."

Data
X1
X2
X3
X4
X5
X6
X7
Value
3
4
9
13
17
22
23

Think of the variable (X) as the measured quantity from your experiment and think of the subscript as indicating the trial number (1-7). To calculate the average, first we have to add up the values from each of the seven trials. Using the summation operator, we will write it like this:


X1 + X2 + X3 + X4 + X5 + X6 + X7

or:

3+ 4 + 9 + 13+ 17 + 22 + 23 


Defining Variance:

Now you know how the summation operator works, you can understand the equation that defines the variance:
The variance (σ2), is defined as the sum of the squared distances of each term in the distribution from the mean (μ), divided by the number of terms in the distribution (N). You take the sum of the squares of the terms in the distribution, and divide by the number of terms in the distribution (N).

How to do the calculation:

    1)      First, add your data points together:
    3 + 4 + 9 + 13 + 17 + 22 + 23 = 91
    next, divides your answer by the number of data: 91 ÷ 7 = 13.
    Sample mean, x̅ = 13.

*You can think of the mean as the "centre-point" of the data. If the data clusters around the mean, variance is low. If it is spread out far from the mean, variance is high.

   2)     Subtract the mean from each of data. Each answer will tells that number's deviation from the mean, or in plain language, how far away it is from the mean.

X{\displaystyle x_{1}} - X̅ = 3 - 13 = -10
X{\displaystyle x_{1}} - X̅ = 4 - 13 = -9
X{\displaystyle x_{1}} - X̅ = 9 - 13 = -4
X{\displaystyle x_{1}} - X̅ = 13 - 13 = 0
X{\displaystyle x_{1}} - X̅ = 17 - 13 = 4
X{\displaystyle x_{1}} - X̅ = 22 - 13 = 9
X{\displaystyle x_{1}} - X̅ = 23 - 13 = 10

   3)   To solve this problem, find the square of each deviation. This will make all the number became positive numbers, so the negative and positive values no longer cancel out.

(-10)2 = 100
(-9)2 = 81
(-4)2 = 16
02 = 0
42 = 16
92 = 81
102 = 100

{\displaystyle ^{2}=1^{2}=1}
   4)    Find the sum of the squared values. Now calculate the entire numerator of the formula ∑(X - x̅)2. The upper-case sigma, “∑”, tells you to sum the value of the following term for each value of. You've already calculated for each value of in your sample, so all you need to do is add the results together.

   100 + 81 + 16 + 0 + 16 + 81 + 100 = 394.

    5)      Divide by n - 1, where n is the number of data points. As it turns out, dividing by “n – 1” instead of “n” gives you a better estimate of variance of the larger population. 
          
There are seven data points in the sample, so n = 7. Variance of the sample σ2=  394 ÷ 6 = 65.67



Data set 1: 3, 4, 4, 5, 6, 8, 10.

Data set 2: 1, 2, 4, 5, 7, 9, 11.


As an example, let's go back to the distributions where we started our discussion with:


What is the variance of each data set above?

First, try to follow the step above to find the variance for results from your experiments or you can construct using a table to calculate the values.
(Answer: Data 1 : 6.24 and Data 2 : 13.29)

*Although both data sets have the same mean (μ = 5), the variance (σ2) of the second data set, 13.29, is a little more than two times the variance of the first data set, 6.24.

It might be so easy to memorize for you, but not for them. Any questions? Post on our comments. We will be happy to answer any statistics problem and will try to help you to solve the problem. 


By: Nur Fariza

References:
  1. http://www.wikihow.com/Calculate-Variance
  2. http://www.sciencebuddies.org/science-fair-projects/projects_data_analysis_variance_std_deviation.shtml
  3. http://www.mathsisfun.com/data/standard-deviation.html




Thursday 9 March 2017

Definition

      Standard Deviation
Standard deviation is a measure of the dispersion of a set of data from its mean. If the data points are further from the mean, there is higher deviation within the data set. Standard deviation is calculated as the square root of variance by determining the variation between each data point relative to the mean.

Type 1 Error
Type 1 error, known as a “false positive”. The error of rejecting a null hypothesis is when it is actually true. In other words, this is the error of accepting an alternative hypothesis (the real hypothesis of interest) when the result can be attributed to chance. Plainly speaking, it occurs when we are observing a difference when in truth there is none (or more specifically- no statistically significant different).

Alpha
Alpha set the standard for how extreme the data must be before we can reject the null hypothesis. The alpha level is the probability of rejecting the null hypothesis when the null hypothesis is true. Once you have chosen alpha, you were ready to conduct your hypothesis test.

Reliability
Reliability of a scale indicates how free it is from random error. Two frequently used indicators of a scale reliability are test-retest reliability it refer to temporal stability and internal consistency.

Validity
The validity of a scale refers to the degree to which it measure what it is supposed to measure. Unfortunately, there is no one clear-cut indicator of a scale’s validity. The validation of a scale involves the collection of empirical evidence concerning its use. The main types of validity are content validity, criterion validity and construct validity.

Repeated measure
Repeated measure are collected in a longitudinal study in which change over time is assessed. Other (non-repeated measure) studied compare the same measure under two or more different condition.

ANOVA
The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of two or more independent (unrelated) groups (although ten to see when are a minimum of three, rather than two groups).

T-test
There are two of different types of t-test available in IBM SPSS that is:
  •  Independent-samples t-test, used when to compare the mean score of two different groups of people or condition.
  •  Paired-sample t-test, used when to compare the mean scores for the same group of people on two different occasions, or when matched pairs.

            Mixed between MANOVA
MANOVA is an extension of analysis of variance for more than one dependent variable. Dependent should be related in some way, or should be some conceptual reason for considering together. Independent variable should consist of two or more nominal or categorical, independent group.

MANOVA
MANOVA test for the difference in two or more vector of means. MANOVA is useful experimental situations where at least some of the independent variable are manipulated.

By - alias su'aidah

References


Book
Pallant, J. (2013). SPSS survival manual: A step by step guide to data analysis using  SPSS (5th ed.). Maidenhead: Open University Press/McGraw-Hill.
Website

http://userwww.sfsu.edu/efc/classes/biol710/manova/MANOVAnewest.pdf