WGU Intro to Probability and Statistics 5.0 (3 reviews) Students also studied Terms in this set (107) Social SciencesPsychology Save Statistics for researchers WGU D582...38 terms Matthew_Mahoney Preview
WGU C955 - Module 4 & 5: Descript...
74 terms Kyle_Candelario Preview WGU Neuropsychology Pre-Assess...Teacher 60 terms awesomelenabean Preview WGU D 458 term ksu PopulationThe entire group that is the target of interest, not just people. Eg, "the population of 1 bedroom apartments" SampleA subgroup of the population. Eg, "the 1 bedroom apartments with dishwashers." Steps in the statistics process1. PRODUCE DATA (by studying a sample of the population)
- EXPLORATORY DATA ANALYSIS (Summarize data.)
- PROBABILITY ANALYSIS (Determine how the sample may differ from the
- INFERENCE (draw conclusions)
population.)
Datapieces of info about individuals organized into variables Individuala particular person or object Variablea particular characteristic of the individual Dataseta set of data identified with particular circumstances. Typically displayed in tables with rows as the individuals and columns as the variables Quantitative vs Categorical/Qualitative variables Quantitaive: Numerical values. Represent a measurement.
Categorical: category or label values into which individuals are grouped.
Three steps in Exploratory Data Analysis1. Organize and SUMMARIZE raw data
- DISCOVER important features and patterns and striking deviations.
- INTERPRET findings in the context of the problem
Examining Distributionsexploring data obtained from one variable at a time
Examining Relationshipsexploring data obtained from two variables at a time Distributionwhat values the variable takes, how often Three types of graphical displays of categorical distributions
- Pie Charts
- Bar Charts
- Pictogram
- Stemplot
- Dotplot
- Boxplot
- ways to interpret a histogram1. Shape - Symmetry/Skewness, Peakness (Modality)
- Center - midpoint
- Spread - approx range covered by all the data
- Outliers - observations that fall outside overall pattern
- Bimodal (double peaked) distribution
- Uniform distribution (Many peaks, all the same)
- Draw a line to the right of the list
- Write all the leaves next to the stem, and rearrange them in increasing order
- when rotated looks like a histogram
Binsranges of data to make charting easier, like a bar chart where each bar shows a range like 70-80% Numerical Summariescategory counts and percentages Four types of Graphical displays of Quantitative Variables 1. Histogram
Histogramlike a bar chart but the x axis is numerical, in order. Eg: x axis is years, y axis is Men's income and Women's income. Or, the x axis is number of hours studied, and y axis is number of students falling into each number of hours studied category.
Symmetric distributions (on a histogram)look symmetric. can be multi-peaked, but symmetrical Skewness (on a histogram)data is skewed to the right or left because outliers. (Careful because the histogram looks heavy to the opposite side than to that which it is skewed. Think of the outliers as pulling a long tail out from the main data, making it not symmetrical.) Peakedness (on a histogram) (three types)1. Unimodal (single peaked) distribution
Stemplot (or stem and leaf plot)1. Write all the "stems" down in a list, in ascending numerical order. (The stems are all the numbers but the right most number. Eg: dataset 34 35 36 347 367 the stems are 3, 3, 3, 34, 36, but you only use each identical stem once, so it would be 3, 34, 36)
two Virtues of a stemplot1. preserves the data while sorting it
Dotplota stemplot with dots instead of leaves
BoxplotShows the "five number spread": min, Q1, Median, Q2, Max
Y axis is range Drawn box is interquartile range Points for outliers, minimum and maximum Is most useful for showing side by side comparisons The Five Number Spread1. "Upper limit" = Q3 through Max
- 75th percentile = Median through Q3
- 50th percentile = Q1 through Median
- 25th percentile = Q1 (this doesn't make sense)
- Lower limit = Minimum through Q1
- Measures of Center1. Mode - the value most often found (not sensitive to outliers)
- Median - the center value (or average of the two center values) (not sensitive to
- Mean - the average (sensitive to outliers)
- measures of spread1. Range - the distance between max and minimum values
- Inter-Quartile Range - the range of the middle 50%
- Standard Deviation - how far the observations are from their mean. (The
- Find median of bottom 50% (Q1, "The first quartile)
- Find median of top 50% (Q3, "The third quartile")
outliers)
average may be 9, but the real average is 4 away from 9.) Calculate rangemax - min Calculate inter-quartile range1. Find median (by arranging data in increasing order)
4. Q3-Q1=IQR
The 1.5(IQR) criterion for outliers1. Q1-1.5(IQR)
2.Q3+1.5(IQR)
- Any datapoints outside of these two points are possible outliers.
- Discard if produced by a different process and your purpose is to understand
- Discard if produced by an error or typo that cannot be fixed.
- Find distances between observations and the mean
- Square each deviation
- Add up the squares of each deviation and divide by the number of deviations
- Find square root of result
Outliers - when to keep, when to discard?1. Keep if could happen again, produced by essentially same process.
the process which produced most of the data.
Notations for Standard DeviationSD, s, Sd, St Dev Calculate Standard Deviation1. Find the mean
minus 1
EXPLANATION
We can't average the deviations because they add up to zero.The reason we average the squares of the deviations minus 1 is beyond the scope of this course to explain.The average of the squared deviations is called the variance of the data.
Is the "standard deviation" or "variance of the data" influenced by outliers?yes, strongly The "standard deviation rule"Approx 68% of observations fall within 1 standard deviation of the mean Approx 95% of observations fall within 2 standard deviations of the mean Approx 99.7 of observations fall within 3 standard deviations of the mean (3 standard deviations = the standard deviation x 3) Notation for meanan x with a line over it Choose between using mean and standard deviation verses the five number summary
- use mean and SD for relatively symmetrical distributions with no outliers
- use five number summary for all others
- Identify the explanatory/independent variable (x) and the response/dependent
- Is the explanatory variable categorical or quantitative?
- Is the response variable categorical or quantitative?
- Notate it C-C, C-Q, Q-C, or Q-Q
- Select approach based on above
Steps to choose which data display and numerical summary is best
variable (y)
Select data display and numerical summary approach for case C-C, C-Q, Q-C, or Q-Q
1. Case C-C: Two way table or double bar chart using conditional percents.
2. Case C-Q: Box plots and five number spread
3. Case Q-C: Not covered in the text
- Case Q-Q: Scatterplot (explanatory on x, response on y) or labelled scatterplot
Correlation CoefficientMeasures the strength and direction of a linear relationship between two quantitative variables. Does not tell you IF a relationship is linear. A curvalinear relationship can include a linear relationship or not.The correlation coefficient tells you the strenghth of the linear relationship, not the curvalinear relationship Notation of the correlation coefficientr Correlation Coefficient and OutliersOutliers strongly effect the r-value, so the CC should only be used after seeing the scatterplot.Range of values in the correlation coefficient -1 to 1 -1 is the strongest negative linear relationship +1 is the strongest positive linear relationship Close to zero is a weaker linear relationship Regression and Linear RegressionThe technique that specifies the dependence of the response variable on the explanatory variable. If it's a linear dependence, then it's linear regression. It's finding the line that best fits the pattern of the linear relationship.