This time the student notes I post are coming from the reading of a scientific paper written by Aaron Clauset, from Santa Fe Institute, Cosma Rohilla Shalizi, from Carnegie Mellon University and Mark Newman, Professor of Physics in the University of Michigan.
The link to the paper is this one. It is a 43-page paper with 70 references on Power-Law distributions in empirical data. It complements the previous notes that I published on the Network Science book authored by Albert-László Barabási and the student paper notes I published on the structure and function of complex networks by Mark Newman.
This paper deals with power-law distributions and the difficulty to characterise them. The authors present a "principled statistical framework to deal with power-law behaviour in empirical data. The paper has a considerable mathematical component.
The following points are notes mostly literally extracted from this paper. The intent of this post is not to create new content but rather to provide Network Science students with a (literal and non-comprehensive) summary of this paper.
Why this Information Security blog touches the field of Network Science? I am convinced that we can apply Network Science learning points to our Information Security real-life scenarios.
As always, a little disclaimer: These notes do not replace the reading of the paper, they are just that, some student notes (or fragments) of the paper.
Many empirical quantities cluster around a typical value. Those distributions can be well characterised with their mean and standard deviation (the square root of its variance).
Population of cities, earthquake intensities and power outage sizes are considered to have power-law distributions.
A quantity x obeys a power law if its probability distribution p(x) proportional to x exp (-alpha), where alpha is a constant exponent (also named scaling parameter).
Alpha typically lies between 2 and 3. Typically, the power law applies only for values greater than some minimum Xmin i.e. the tail of the distribution follows a power law.
How can we recognise a power law?
There are two flavours of power-law distributions: Continuous distributions and discrete ones. Formulas for continuous distributions tend to be easier.
In many cases the Cummulative Distribution Function (CDF) of a power-law distributed variable is important. The visual form of the CDF is more robust than the Probability Density Function (PDF).
A common way to probe for power-law behaviour is to measure x, construct a histogram representing its frequency distribution and plot the histogram on doubly logarithmic axes. The distribution should approximately fall on a straight line, being alpha the absolute slope of the line. Unfortunately, this method creates systematic errors. It is a necessary but not sufficient condition.
We can use maximum likelihood estimators (MLEs) to approximate alpha. We assume that alpha is > 1.
It is important to identify the value Xmin. One approximation is the Bayesian Information Criterion (BIC). Another method, proposed by Clauset et al consists of choosing an Xmin that makes the probability distribution of the measured data and the best-fit power law model as similar as possible. A way to quantify the distance between two probability distributions is the Kolmogorov-Smirnov or KS statistic. This is the maximum distance between the CDF of the data and the fitted model.
Other not so predominant methods are the Hill plot and also simply to limit the analysis to the largest observed samples.
Interestingly, regardless of the true distribution from which data was drawn, we can always fit a power law. The question is whether the fit is a good match to the data.
The basic approach is to compare the fluctuation between a power-law form and synthetic data sets from a true power-law distribution with similar measures with empirical data (using the KS distance statistic). With large statistical samples we have a bigger chance to verify these hypotheses.
Goodness of fit test
It generates a p value that quantifies the plausibility of the hypothesis. The p-value is defined as the fraction of the synthetic distances that are larger than the empirical distance. If p is large the distance can be attributed to statistical fluctuations, if p is small the model is not a plausible fit.
For each synthetic data we compute the KS statistic relative to the best-fit powe law for that data set, not relative to the original distribution from which the data set was drawn.
Typically, the power law is ruled out if p<=0.1. It is a necesary but no sufficient condition. High p values should be treated with caution when n is small.
Let's remember we fit the power-law form to only the part of the distribution above Xmin.
We would also need to rule out the competing exponential or log-normal distributions. We would then act similarly with synthetic values coming from several plausible competing distributions and obtaining the respective p-values. It those p-values for the competing distributions are small, then they are discarded (although not with 100% certainty).
The likelihood ratio test
An easier to implement than the KS distance test to compare two distributions is the likelihood ratio test. The one with the higher likelihood is the better fit (or, if we calculate the logarithm R of the ratio, then it could be positive, negative or zero if there is a tie). If this p-value is small, the sign is a reliable indicator or which model is the better fit to the data.
A distribution is nested if it is a subset of the other distribution type. When distributions are nested, the larger family of distributions provides a fit at least as good as the subset one.
The p-value of the likelihood ratio test is effective at identifying cases in which the data are insufficient to make a firm distinction.
Application to real world data
In general, it is extremely difficult to tell the difference between log-normal, stretched exponential and power-law behaviours.
|Happy mathematical description!|