Introduction to Statistics

Introduction to analyzing mathematical relationships between variables. Underlined words are marked in the glossary in Chapter 3.

Part 1 – Data collection

(Hypothesis – Data Sources – Sampling Data)

  Before analyzing data, one may first create a hypothesis, a statement that is either true or false, allowing predictions to be made before testing out the theory. For example, “Trees that are higher than 1000m above sea level have a longer average lifespan than those that are not”. Creating useful hypotheses can help a scientist or other type of researcher improve deductive reasoning ability after statistics are applied to evaluate the theories created.

  To analyze data, the researcher must first gather enough accurate information. Generally, there are two types of data sources are primary and secondary data available to researchers. Primary data sources are data sources gathered by the researcher themselves, as opposed to secondary data sources which are gathered by others for another purpose(s).

Examples of Primary Data Sources :

  Wildlife observations, scientific experiments, census, surveying, etc.

Examples of Secondary Data Sources :

  Government statistic agencies, university reports, news reports, blogs, etc.

(Surveying is a primary source of data)

  Usually, a census of the entire population is too time-consuming and expensive. In such a case, a random sample should be considered (a non-random sample would be easier, but often results in bias). Some basic methods of random sampling include :

Simple random sampling

  Selecting a number of people or items randomly selected from the entire population. For example, drawing random numbers from a hat.

Systematic random sampling

  Selecting members from a fixed interval from a randomly chosen point.

Stratified random sampling

  Selecting the same fraction from different groups of a population.

Part 2 – Analyzing Data

(Scatter Plots – Predictions – Distance Time Graphs)

  Once sufficient and accurate data is gathered, we can organize this data with graphs. A simple method of doing so is using scatter plots. These can be used to show two variable data with points on an xy-plot. A scatter plot should contain independent variable data on the horizontal axis (x-axis) and dependent variable data on the vertical axis (y-axis).

After the points are plotted on a scatter plot, inferences can be made from the given info. A line of best fit or curve of best fit (use a line of best fit for a linear relation only) can be drawn on the graph for further predictions. The line or curve should go in the general direction of the points and should go through or be close to as many points as possible.

Now that the line or curve of best fit has been drawn, we can now determine any outliers and remove them to avoid inaccurate data representation. An outlier is a point that is significantly differs from the rest of the data, one that has a much larger distance its closest point than others.

(Can you find the outlier?)

  Also, from the scatter plot correlations may be observed. If one variable increases as the other increases, there is a positive correlation. If the variable decreases while the other increases (and vice versa), there is a negative correlation.

       and    

           Positive Correlation                                         Negative Correlation

  From the line or curve of best fit, interpolation and extrapolation can be used. Interpolating is prediction within the given data set, extrapolation is prediction outside of the given data set. To interpolate, one may follow the line or curve of best fit for one of the variables and find out the other. For extrapolating, the line or curve may be extended until it reaches the necessary place.

  Finally, a distance-time graph is another type of graph that shows relationship between the time (dependent variable) and distance (independent variable). If there is a flat line, it shows that the person or object is stationary. The steeper the slope, the faster the object is travelling. If there is a curve, it means the speed is non-constant and the person or object is either accelerating or slowing down.

3. Glossary of Terms

Hypothesis : A theory or statement that is either true or false.

Statistics : Numerical data or the collection, organization, and analysis of numerical data.

Primary Data : Original data provided by the researcher for a specific reason.

Secondary Data : Data taken from another researcher for different use.

Census : A survey of the entire population.

Population : All items or people being studied.

Random Sample : A method that does not contain bias.

Non-Random Sample : A sample that is not random, but does not necessarily contain bias.

Bias : Inaccurate representation of the population by selecting a poor sample.

Scatter Plot : A graph that shows relationship between two different variables.

Independent Variable : A variable that influences the value of another variable.

Dependent Variable : A variable that is influenced by an independent variable.

Inference : Reasoning based on evidence and information given.

Line of Best Fit : The line that is passes through or is closest to all the points of a scatter plot.

Curve of Best Fit : A curve that closely represents the positions of the points on a scatter plot.

Linear Relation : A relationship where a straight line can be formed to represent the data.

Outlier : Member of data set that does not fit with the rest of the data.

Interpolation : Predicting values within the given data set.

Extrapolation : Estimating values out of the given range.

Distance-Time Graph : A graph that visually represents the relationship between distance over time.

2
Liked it

No Responses to “Introduction to Statistics”

Post Comment