**“There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know”** (Donald Rumsfeld)

BRIEF HISTORY

15th-14th Century: the first insurance contracts in Genoa (Italy)

1500-1654: study of gambling (Cardano…) and of mortality data (John Graunt, William Petty, Edmund Halley…). In 1690 William Petty writes “Political Arithmetick”.

-1654 A correspondence between Fermat and Pascal about a gambling problem of Chevalier de Méré leads to the study of probability.

-1713 Jacob Bernoulli’s ‘Law of large numbers’.

-1718 De Moivre’s ‘The Doctrine of chances’

-1763 Bayes’ theorem

-1770s Condorcet’s *social mathematics*

-1809 Gauss’ normal distribution, method of least squares

-1811 Laplace contributions on the central limit theorem

-1814 Laplace’s insights on how probability should be used when making inductive inferences (Laplace believed everything could be explained deterministically)

-1835 the French statistician Quetelet publishes a work on the concept of the ‘average man’ (l’homme moyen’)- the idea that comparing individual against the ‘average man’ would give the scientists the necessary data to show normal and abnormal characteristics. Quetelet initially used this concept to produce what we would now call the body mass index. However his theories influenced the French positivists, thereby encouraging such thinkers to use data to explain social issues (a subject called ‘social physics’).

Quetelet believed that the same accuracy obtained in the calculations of celestial mechanics could be obtained in explaining social issues.

-1880s Galton discovers the application of correlation when experimenting whether the data would help him to predict the size of one’s head based on the measurement of the forearm. Pearson would later say of Galton’s discovery *“this new conception of correlation brought psychology, anthropology, medicine, and sociology in large parts into the field of mathematical treatment.”*

-Karl Pearson 1857-1936 (chi square test). He also exposes the Simpson problem (later insights into ‘reverse regression’ will show that statistical conclusions are sensitive to which variables we choose to hold constant .

-1925 Fisher’s ‘Statistical methods for Research workers’

-Econometric society founded in 1930 (Ragnar Frisch one of the key proponents).

BASICS

3 ways to calculate the average: mean, median, mode (the distribution is bimodal in case there are two frequency classes with the same largest value)

Range of a distribution = upper bound value – lower bound value

The values can also be divided in quartiles: lower quartile, upper quartile and the interquartile range (it is the range of the central 50% of the data , so the difference between the previous two)

Values can also be divided in percentiles.

In statistics a *false positive *(type 1 error) occurs when the statistician interprets his analysed data as evidence of an effect which in reality is non-existent, whereas a *false negative* (type 2 error) occurs when the statistician fails to notice an existent and relevant effect from the data

Type 1 error: when the null hypothesis is true but the statistician does not reject it (the probability of making such error depends on the value the statistician has assigned to α)

Type 2 error: when the null hypothesis is false but the statistician rejects the alternative hypothesis anyway (the probability of making such error depends on β, statistical power) .

Statistical power: likelihood power to reject a null hypothesis- also to spot the difference between a group undertaking an experiment and the control group (the bigger the size of the sample, the greater the statistical power)

Confidence intervals are more reliable than p-values, yet many scientists prefer using p-values (perhaps because a potentially wide confidence interval may compromise the validity of a statistical finding).

BAYES vs FISHER

For Bayes, more confidence can be obtained on our hypotheses through approximations. Our probability estimates are constantly updated if new evidence challenges our previous beliefs (especially because the ‘prior probability’ in Bayes’ theorem may be too subjective).

Bayes’ theorem:

Ronald Fisher (1890-1962) was an English biologist who wrote influential works on how statistics should be used for scientific research. Fisher established many important aspects of modern statistical terminology like the p-value (important to determine ‘statistical significance’). Later Pearson and Neyman made Fisher’s system more rigorous by introducing the use of null and alternative hypotheses-for instance by giving a measurement of the likelihood (α) of a false positive. The null hypothesis may be rejected if the following condition is met: p < α

Fisher opposed Bayes’ theorem because of the potential for subjective bias.

Fisher’s statistical school is now known as ‘frequentism’- the essential tenet is that uncertainty (‘margin of error’) is due to the fact that samples may not reflect the population as a whole. ‘Frequentists’ generally believe that approximately most samples can be represented like a normal distribution.

PROBLEMS RELATED TO STATISTICS

-TRUTH INFLATION: the drive to publish sensational papers ‘encourages’ researches to find valid findings even if they are underpowered. (for instance the Kanazawa fiasco)

-WILL ROGERS PARADOX: it occurs when moving one class of elements from one group to another increases the average value of both groups.

-BASE RATE FALLACY: a good explanation is found in Alex Reinhart’s book “Statistics Done Wrong”- it occurs when one makes a judgement by considering irrelevant information rather than the prior probability of an event.

-ANSCOMBE’S QUARTET:When four datasets appear similar, but then look quite different when graphed. This example shows the importance of graph visualisation before making rushed inferences.

-BERKSON’S PARADOX: when two independent and unrelated variables seem to have a negative correlation.

-PSEUDOREPLICATION: replicating the same experiment on the same unit with the danger that the new repeated measurements may be conditional on previous data.

-SIMPSON’S PARADOX: When doing statistics with sub-groups, the trends in sub-groups may appear different when the sub-groups are combined (for instance the UC Berkley’s alleged gender bias)

-Outliers distort the mean

-CIRCULAR ANALYSIS (DOUBLE DIPPING): using the same data set (or a data set closely related) to prove the validity of a correlation

-CONFOUNDING VARIABLE: a hidden variable that actually explains the correlation between a dependent and independent variable (also known as the ‘third variable problem’ because in an experiment there could always be the possibility of a third variable explaining a correlation). More specifically confounding arises when the result of an experiment is actually attributable to a variable we did not factor in. For instance an increase in ice cream sales may not be due to a good marketing strategy but instead because of an increase in the weather temperature. (LURKING VARIABLE a variable that increases or decreases both the independent and dependent variable).

-SPURIOUS RELATIONSHIPS: false correlations

-REGRESSION TOWARD THE MEAN: (found by Galton)

-CALCULATION MISTAKES: In the *Nature* journal “roughly 38% of papers making typos and calculation errors in their p values” [2]

-JOHN IOANNIDIS “Why Most Published Research Findings Are False”: in medical research there is risk of conflict of interest (due to the scientists’ financial backers), bias, abuse of p-value, non-repeatable experiments, unreliable findings in small sample sizes so less statistical power, the Proteus phenomenon (“rapidly alternating extreme research claims and extremely opposite refutations”).

-Excel mistakes: in 2013 a graduate student found that an important paper (it intellectually supported the idea that low growth is due to high public debt) by Reinhart and Rogoff had omitted five rows in a function to calculate the average. In 2016 it was reported in the journal of genome biology that one-fifth of the papers (that contained data) published between 2005 and 2015 contained errors in the Excel spreadsheets [1]

-DOUGLAS JOHNSON’S “The insignificance of statistical significance testing”: on the abuse of the p-value and misconceptions.

-DATA DREDGING, P-HACKING: deliberately manipulating data to obtain the desired result

-MULTIPLE COMPARISONS FALLACY: unexpected trends may be found solely due to random chance in a data set that is large enough to contain a big number of variables

KEYNES CRITIQUE OF ECONOMETRICS

Keynes criticised econometrics in a 1940 review of a Tinbergen’s paper. Keynes believed econometrics cannot deliver comprehensive results about the business cycle, because it is impossible to have a complete list of all the factors that determine such cycle; hence there is a danger of spurious correlations.

Keynes also believed that one data is incomplete, there is also a danger that economists may create subjective models so that they get their expected results:

“It will be remembered that the seventy translators of the Septuagint were shut up in seventy separate rooms with the Hebrew text and brought out with them, when they emerged, seventy identical translations. Would the same miracle be vouchsafed if seventy multiple correlators were shut up with the same statistical material? And anyhow, I suppose, if each had a different economist perched on his *a priori*, that would make a difference to the outcome.” (Keynes, 1940).

By making an argument similar to ‘Treatise on probability’, Keynes claims that precise inductive inferences are not as reliable as we think because economists lack ‘homogeneity’ across time- that is it is not possible to take the same samples from a fixed population at all times.

Keynes also criticises Tinbergen’s idea that everything can be measured (what about psychological motivations?).

Keynes also believed that it is questionable to have independent variables in economics since the business cycle of a country is generally a highly complex organic system, where all variables affect each other.

Keynes also criticises the fact that all economic phenomena can be used in linear models.

BIG DATA

According to IBM, every day we generate 2.5 quintillion bytes of data. According to neuroscientists, our brain may store up to three terabytes of data (that is 1-millionth of the information being generated each day….).

SOURCES

[1] http://www.economist.com/blogs/graphicdetail/2016/09/daily-chart-3

[2] *“Statistics Done Wrong”* by Andrew Reinhart (ch10).