Applied statistics: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Nick Gardner
imported>Nick Gardner
Line 15: Line 15:
The taking of samples<ref>[http://www.stats.gla.ac.uk/steps/glossary/sampling.html#stratsamp Valerie Easton and John McCall: ''Sampling'', STEPS 1997},</ref> reduces the cost of collecting observations and increases the opportunities to generate false information. One source of error arises from the fact that every time a sample is taken there will be a different result. That source of error is readily quantified as the sample's ''standard error'', or as the ''confidence interval'' within which the ''mean'' observation may be expected to lie<ref>[http://stattrek.com/AP-Statistics-4/Confidence-Interval.aspx?Tutorial=AP Robin Levine-Wissing and David Thiel, ''Confidence Intervals'', AP Statistics Tutorial]</ref>. That source of  error cannot be eliminated, but it can  be reduced to an acceptable level by increasing the size of the sample. The other source of error arises from the likelihood that the characteristics of the sample differ from those of the ''"population"'' that it is intended to represent. That source of error does not diminish with sample size and cannot be estimated by a mathematical formula. Careful attention to what is known about the composition of the "population"  and the reflection of that composition  in the sample is the only available precaution. The composition of the respondents to an opinion poll, for example, is normally chosen to reflect as far as possible  the composition of the intended "population" as regards sex, age, income bracket etc. The remaining difference is referred as the ''sample bias'', and undetected  bias has sometimes  been a major source of misinformation.
The taking of samples<ref>[http://www.stats.gla.ac.uk/steps/glossary/sampling.html#stratsamp Valerie Easton and John McCall: ''Sampling'', STEPS 1997},</ref> reduces the cost of collecting observations and increases the opportunities to generate false information. One source of error arises from the fact that every time a sample is taken there will be a different result. That source of error is readily quantified as the sample's ''standard error'', or as the ''confidence interval'' within which the ''mean'' observation may be expected to lie<ref>[http://stattrek.com/AP-Statistics-4/Confidence-Interval.aspx?Tutorial=AP Robin Levine-Wissing and David Thiel, ''Confidence Intervals'', AP Statistics Tutorial]</ref>. That source of  error cannot be eliminated, but it can  be reduced to an acceptable level by increasing the size of the sample. The other source of error arises from the likelihood that the characteristics of the sample differ from those of the ''"population"'' that it is intended to represent. That source of error does not diminish with sample size and cannot be estimated by a mathematical formula. Careful attention to what is known about the composition of the "population"  and the reflection of that composition  in the sample is the only available precaution. The composition of the respondents to an opinion poll, for example, is normally chosen to reflect as far as possible  the composition of the intended "population" as regards sex, age, income bracket etc. The remaining difference is referred as the ''sample bias'', and undetected  bias has sometimes  been a major source of misinformation.


The use by statisticians of the  term "population" refers, not to people in general, but to the category of things or people about which information is sought. A precise definition of the target population is an essential starting point in a statistical investigation, and also a possible source of misinformation. Difficulty can arise when, as often happens, the definition has to be arbitrary. If the intended population were the output of the  country's farmers, for example, it might be necessary to draw an arbitrary dividing line between farmers and owners of smallholdings such as market gardens. Any major change over time in the relative output of farm products by the included and excluded categories might then lead to misleading conclusions. Technological change, such as the change from typewriters to word processors has sometimes  given rise to serious difficulties in the construction of the price indexes used in the correction of GDP for inflation<ref> See the article on [[Gross domestic product]]/ref>. Since there is no objective solution to those problems, it is inevitable that national statistics embody an element of judgement exercised by the professional statisticians in the statistics authorities.
The use by statisticians of the  term "population" refers, not to people in general, but to the category of things or people about which information is sought. A precise definition of the target population is an essential starting point in a statistical investigation, and also a possible source of misinformation. Difficulty can arise when, as often happens, the definition has to be arbitrary. If the intended population were the output of the  country's farmers, for example, it might be necessary to draw an arbitrary dividing line between farmers and owners of smallholdings such as market gardens. Any major change over time in the relative output of farm products by the included and excluded categories might then lead to misleading conclusions. Technological change, such as the change from typewriters to word processors has sometimes  given rise to serious difficulties in the construction of the price indexes used in the correction of GDP for inflation<ref> See the article on [[Gross domestic product]]</ref>. Since there is no objective solution to those problems, it is inevitable that national statistics embody an element of judgement exercised by the professional statisticians in the statistics authorities.





Revision as of 07:10, 28 June 2009

This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
Tutorials [?]
 
This editable Main Article is under development and subject to a disclaimer.

Applied statistics provide both a familiar source of information and a notorious source of error and misinformation. Errors commonly arise from misplaced confidence in intuitive interpretations, but some of the most serious have arisen from misuse by mathematicians and other professionals. Deliberate misinterpretation of statistics by politicians and marketing professionals is so much a popular commonplace that its genuine use is often treated with suspicion. To those unfamiliar with it, statistics can seem impenetrably arcane, but its pitfalls can be avoided given a grasp of a few readily understood concepts.

(terms shown in italics are defined in the glossary on the related articles subpage).

Overview: the basics

Statistics are observations that are recorded in numerical form. It is essential to their successful handling to accept that statistics are not facts and therefore incontrovertible, but observations about facts and therefore fallible. The reliability of the information that they provide depends not only upon their successful interpretation, but also upon the accuracy with which the facts are observed and the extent to which they truly represent the subject matter of that information. An appreciation of the means by which statistics are collected is thus an essential part of the understanding of statistics and is least as important as a familiarity with the tools that are used in its interpretation.

Although the derivation of those tools involved advanced mathematics, the laws of chance on which much of statistics theory is based are no more than a formalisation of intuitive concepts, and the use of the resulting algorithms and computer software requires only a grasp of basic mathematical principles

The collection of statistics

The methodology adopted for the collection of observations has a profound influence upon the problem of extracting useful information from the resulting statistics. That problem is at its easiest when the collecting authority can minimise disturbing influences by conducting a "controlled experiment"[1]. A range of more complex methodologies (and associated software packages) referred to as "the design of experiments" [2] is available for use when the collecting authority has various lesser degrees of control. The object of the design in each case is to facilitate the testing of an hypothesis by helping to remove the influence of factors that the hypothesis does not take into account. At the furthest extreme from the controlled experiment, no such help can be provided through the physical elimination of extraneous influences - and, if they are to be eliminated, it must be done after they have been identified by a purely analytical technique termed the "analysis of variance"[3] For example, the rôle of the authorities that collect economic statistics is necessarily passive, and the testing of economic hypotheses involves the use of a version of the analysis of variance termed "econometrics"[4] (sometimes confused with economic modelling, which is a purely deterministic technique).

The taking of samples[5] reduces the cost of collecting observations and increases the opportunities to generate false information. One source of error arises from the fact that every time a sample is taken there will be a different result. That source of error is readily quantified as the sample's standard error, or as the confidence interval within which the mean observation may be expected to lie[6]. That source of error cannot be eliminated, but it can be reduced to an acceptable level by increasing the size of the sample. The other source of error arises from the likelihood that the characteristics of the sample differ from those of the "population" that it is intended to represent. That source of error does not diminish with sample size and cannot be estimated by a mathematical formula. Careful attention to what is known about the composition of the "population" and the reflection of that composition in the sample is the only available precaution. The composition of the respondents to an opinion poll, for example, is normally chosen to reflect as far as possible the composition of the intended "population" as regards sex, age, income bracket etc. The remaining difference is referred as the sample bias, and undetected bias has sometimes been a major source of misinformation.

The use by statisticians of the term "population" refers, not to people in general, but to the category of things or people about which information is sought. A precise definition of the target population is an essential starting point in a statistical investigation, and also a possible source of misinformation. Difficulty can arise when, as often happens, the definition has to be arbitrary. If the intended population were the output of the country's farmers, for example, it might be necessary to draw an arbitrary dividing line between farmers and owners of smallholdings such as market gardens. Any major change over time in the relative output of farm products by the included and excluded categories might then lead to misleading conclusions. Technological change, such as the change from typewriters to word processors has sometimes given rise to serious difficulties in the construction of the price indexes used in the correction of GDP for inflation[7]. Since there is no objective solution to those problems, it is inevitable that national statistics embody an element of judgement exercised by the professional statisticians in the statistics authorities.



Statistical inference

The laws of chance

Probability distributions

Risks and faults

Correlation and association

Popular errors

An eminent authority has claimed that the results of most medical research are flawed because of statistical misinterpretation[8]

Accuracy and reliability

Applications

Surveys

Quality control

Econometrics

Forecasting

Risk management

References

  1. In a controlled experiment, a "control group", that is in all relevant respects similar to the experimental group, receive a "placebo", while the experimental group receive the treatment that is on trial
  2. Valerie Easton and John McCall: The Design of Experiments and ANOVA. STEPS 1997
  3. Anova Manova"
  4. Econometrics 2005
  5. [http://www.stats.gla.ac.uk/steps/glossary/sampling.html#stratsamp Valerie Easton and John McCall: Sampling, STEPS 1997},
  6. Robin Levine-Wissing and David Thiel, Confidence Intervals, AP Statistics Tutorial
  7. See the article on Gross domestic product
  8. John P. A. Ioannidis Why Most Research Findings are False, PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124 August 2005