Book review: The wisdom of crowds - A leveraging tool

The book by James Surowiecki titled "The wisdom of crowds" fell into my hands and I read it during the summer of 2015. These are the main learning points that I drew from its reading.

Disclaimer: By no means this personal and non-comprehensive post aims to replace the reading of the book. This post is a biased set of thoughts, most of them extracted from the book, that went through my mind during the reading of this book.

- Diversity and independence are key to be kept in collective decisions.
- To these two points, add also decentralization and aggregation.
- A decision market is a working method to capture collective wisdom.
- Making a group diverse makes it better at problem solving.
- Expertise is spectacularly narrow.
- A large group of diverse and independent people will come up with better decisions.
- Collective decisions are only wise when they draw from very different information sources.
- Centralization is never the answer. Aggregation is.
- Betting markets are very good at predicting markets.
- Crowds find the way to collectively benefit even without speaking to each other if everyone knows that everybody is trying to make a decision.
- We live in a society in which convention has won over rationality (e.g. why all films costs almost the same in the cinema?).
- Maybe as individuals we do not know where we are going but as a group we can achieve great accomplishments.
- People think that people should be where they deserve. Merit is a key element in accepting reality.
- Vehicle traffic: Very easy to create traffic jams. Very complex to get rid of them. As a swarm, we drive quicker if we coordinate with surrounding vehicles.
- If the traffic jam is massive, no easy solution. Personal thought: Maybe then stop the car and read a book.
- Academic challenges in a collaborative environment are a morale booster.
- Reputation should not become the basis of a scientific hierarchy.
- Sometimes, being a member of a group can make people dumber (especially if the group is small and it has leaders on it).
- Sometimes small groups start already with the conclusions instead of reaching them after an evidence-gathering based process.
- Small group view polarization exists. Hierarquies make it worse even.
- The order in which people speak pays an important role.
- People who think of themselves as leaders will influence groups more than others, even if they lack expertise on what they talk about.
- Groups need an efficient way to aggregate their members' opinions.
- Investors not always behave rationally.
- Investors get emotionally attached to their shares.
- Individual irrationality can create collective rationality.
- On average crowds will give you a better answer than individuals.
- Healthy markets are led both by fear and greed.
- Bubbles and crashes are examples of crowd decisions going wrong.
- Groups are smart only when their information sources are balanced in terms of its ownership.
- All these points can be applied (and they are actually being applied) also into the business world.
- These thoughts justify why democracy is preferred to other organisational systems.

As Infosec professionals, if we can have these points in mind when designing security controls and security awareness sessions, our delivered value will be higher.

Happy crowded reading!

Groups fly!

Economics Book Review: Global Financial Systems: Stability & Risk by Jon Danielsson

How come that an Information Security blog posts now a review of a book dealing with the foundations of modern finance?

If you wonder why, then probably you are starting as an Information Security professional. Good luck to you! Train your psychological resilience

If you will read this post to find out why the reading of this book is recommendable, then surely you have wondered how Information Security can provide value to the business

This book titled Global Financial Systems: Stability and Risk is used by his author, Jon Danielsson, in his lectures about Global Financial Systems in the London School of Economics.

In 19 Chapters and in several weeks' reading time, readers get an first comprehensive idea of what has happened in the last decade and what it is currently happening in this global financial crisis. Not only that, readers get also an understanding on key financial concepts.

This information will be of great help to understand the business functionality of the IT Systems that you will probably pen-test or secure or harden or white-hat hack. And not only in the financial sector, literally in any industry sector somehow related or affected by banks i.e. in all industries.

Chapter 1 deals with systemic risk. Worth being highlighted are the interlinks among different risks and the concept of fractional reserve banking.

I identified four concepts that could have a reflection also in the Information Security field: procyclicality, information asymmetry, interdependence and perverse incentives.

Chapter 2 talks about the Great Depression from 1929 to 1933 and four potential causes such as trade restrictions, wrong monetary policies, competitive devaluations and agricultural overproduction.

Chapter 3 talks about a very special type of risk: endogenous risk. The author mentions a graph on how perceived risk goes in time after actual risk. Very interesting concept to apply also in Information Security.

Chapter 4 deals with liquidity and different models bank follow (or should follow). Liquidity is essential but, reading this chapter, complex. The distinction between funding liquidity and market liquidity is also an eye-opener.

Chapter 5 describes central banking and banking supervision. The origin of central banking dates from 1668 in Sweden and from 1694 in England. The author mentions two key roles in central banking: monetary policy and financial stability.

Chapter 6 teaches us why short-term foreign currency borrowing is a bad idea.

Chapter 7 describes the importance of the fractional reserve system and a concept that it is almost opposite to what information security professionals face on a daily basis: moral hazard (literally, "it is what happens when those taking risks do not have to face the full consequences of failure but they enjoy all benefits of success").

Chapter 8 deals with the complexity of coming up with a smart deposit insurance policy that would avoid "moral hazard" possibilities in a fractional reserve banking system.

Chapter 9 describes the problems that trading actions like short selling can bring into the financial system. An impartial reader of this chapter would see the need to come up with an effective and worldwide trading regulation. Concepts such as a "clearing house" and a "central counterparty" are mentioned.

Chapters 10 and 15: Market participants need to know probabilities to default when engaging in credit activities. These chapters explain securitisation concepts such as Special Purpose Vehicles (SPV), Collateralised Debt Obligation (CDO), Asset Backed Securities (ABS) and Credit Default Swaps (CDS). Could you think of similar concepts being used in Information Security?

Chapter 11 presents the "impossible trinity" i.e. no country is able to pursue simultaneously these three goals: fixed exchange rate, free capital movements and an independent monetary policy. Remember that the biggest market is the foreign exchange market.

Chapter 12 focuses on mathematical models of currency crises. The reader can see how these models evolved and how the global games model was proposed.

Chapter 13 goes through the different sets of international financial regulation i.e. Basel I and Basel II. There is also an appendix referring to the Value-At-Risk model.

Chapter 14 could trigger some discussions. There is a patent political element in bailing banks out. Should governments contribute or not to move private sector bank losses into the public sector?


Chapter 16 shows the need to take into account concepts such as tail risk, endogenous risk and systemic risk. Very very interesting reading for us information security professionals.

Chapter 17, 18 and 19 deal with current developments. Chapter 17 studies the period from 2007 to 2009 of the latest financial crisis, chapter 18 describes efforts taken in developing financial regulations and chapter 19 talks about the current sovereign debt crisis and its relation with the common currency and the challenge of a transfer union i.e. a higher degree of unification.

In addition, the website of the book offers the slides of every chapter, a link to modelsandrisk.org and three additional chapters with updated information on the European crisis, financial regulations and current challenges in financial policy.

Happy risk management!

Risky times




  
















Book Review: Executive Data Science by Brian Caffo, Roger D. Peng and Jeffrey Leek

In the introduction to the Data Science world, one needs to build the right frame surrounding the topic. This is usually done via a set of straight to the point books that I mention or summarise in this blog. This is the third one. All of them appear with the "data science" label.

The third book that I start with is written by Brian Caffo, Roger D. Peng and Jeffrey Leek. Its title is "Executive Data Science". You can get it here. If you need to choose only one among the three books I talked about in this blog, probably the more comprehensive one will be this one.

The collection of bullet points that I have extracted from the book is a way to acknowledge value in a twofold manner: first, I praise the book and congratulate the authors and second, I try to condense in some lines a very personal collection of points extracted from the book.

As always, here it goes my personal disclaimer: the reading of this very personal and non-comprehensive list of bullet points by no means replaces the reading of the book it refers to; on the contrary, this post is an invite to read the entire work.

In approximately 150 pages, the book provides literally the following key points (please consider all bullet points as using inverted commas i.e. they show text coming from the book):

- "Descriptive statistics have many uses, most notably helping us get familiar with a data set".
- Inference is the process of making conclusions about populations from samples.
- The most notable example of experimental design is randomization.
- Two types of learning: supervised and unsupervised.
- Machine Learning focuses on learning.
- Code and software play an important role to see if the data that you have is suitable for answering the question that you have.
- The five phases of a data science project are: question, exploratory data analysis, formal modeling, interpretation and communication.
- There are two common languages for analyzing data. The first one is the R programming language. R is a statistical programming language that allows you to pull data out of a database, analyze it, and produce visualizations very quickly. The other major programming language that’s used for this type of analysis is Python. Python is another similar language that allows you to pull data out of databases, analyze and manipulate it, visualize it, and connected to
downstream production.
- Documentation basically implies a way to integrate the analysis code and the figures and the plots that have been created by the data scientist with plain text that can be used to explain what’s going on. One example is the R Markdown
framework. Another example is iPython notebooks.
- Shiny by R studio is a way to build data products that you can share with people who don’t necessarily have a lot of data science experience.
- Data Engineer and Data Scientist: A data engineer builds out your system for actually computing on that infrastructure. A data scientist needs to be able to do statistics.
- Data scientists: They usually know how to use R or Python, which are general purpose data science languages that people use to analyze data. They know how to do some kind of visualization, often interactive visualization with something like D3.js. And they’ll likely know SQL in order to pull data out of a relational
database.
- kaggle.com is also mentioned as a data science web site.

The authors also provide useful comments on creating, managing and growing a data science team. They start with the basics e.g. "It’s very helpful to right up front have a policy on the Code of Conduct".

- Data science is an iterative process.
- The authors also mention the different types of data science questions (as already mentioned in the summary of the book titled "The Elements of Data Analytic Style".
- They also provide an exploratory data analysis checklist.
- Some words on how to start with modeling.
- Instead of starting to discuss causal analysis, they talk about associational analysis.
- They also provide some tips on data cleaning, interpretation and communication.
- Confounding: The apparent relationship or lack of relationship between A and B may be due to their joint relationship with C.
- A/B testing: giving two options.
- It’s important not to confuse randomization, a strategy used to combat lurking and confounding variables and random sampling, a stategy used to help with generalizability.
- p-value and null hypothesis are also mentioned.
- Finally they link to knit.

Happy data-ing!

Find your way








































Book Review: The Elements of Data Analytic Style by @jtleek i.e. Jeffrey Leek

In the introduction to the Data Science world, one needs to build the right frame surrounding the topic. This is usually done via a set of straight to the point books that I will be summarising in this blog.

The second book that I start with is written by Jeffrey Leek. Its title is "The Elements of Data Analytic Style". You can get it here. It is a primer on basic statistical concepts that are worth having in mind when embarking on a scientific journey.

This summary is a way to acknowledge value in a twofold manner: first, I praise the book and congratulate the author and second, I share with the community a very personal summary of the books.

Let me try to share with you the main learning points I collected from this book. As always, here it goes my personal disclaimer: the reading of this very personal and non-comprehensive summary by no means replaces the reading of the book it refers to; on the contrary, this post is an invite to read the entire work.

In approximately 100 pages, the book provides the following key points:

Type of analysis
Figure 2.1, titled the data analysis question type flow chart is the foundation of the book. It classifies the different types of data analysis. The basic one is a descriptive one (reporting results with no interpretation). A step further is a exploratory analysis (will the proposed statements be still valid in a qualitative way using a different sample?).

If this also holds true in a quantitative manner, then we are in an inferential analysis. If we can use a subset of measures to predict some others then we can talk about a predictive analysis. The next step, certainly less frequent, is the possibility to seek a cause, then we are in a casual analysis. Finally, and very rarely, if we go beyond statistics and find a deterministic relation, then those are the mechanistic analysis.

Correlation does not imply causation
This is key to understand. The additional element to really grasp it is the existence of confounding elements i.e. additional variables, not touched by the statistical work we are embarked on, that connect the variables we are studying. Two telling examples are mentioned in the book:
- The consumption of ice cream and the murder rate are correlated. However, there is no causality. There is a confounder: the temperature.
- Shoe size and literacy are correlated. However there is a confounder here: age.

Other typical mistakes
Overfitting: Using a single unsplit data set for both model building and testing.
Data dredging: Fitting a large number of models to a data set.

Components of a data set
It is not only the raw data, but also the tidy data set, a code book describing each of the variables and its values in the tidy data set and a script on how to reach the tidy data set from the raw data.
The data set should be understood even if you, as producer or curator of the data set, are not there.

Type of variables
Continuous, ordinal, categorical, missing and censored.

Some useful tips
- The preferred way to graphically represent data: plot your data.
- Explore your data thoroughly before jumping to statistical analysis.
- Use a linear regression analysis to compare it with the initial scatterplot of the original data.
- More data usually beats better algorithms.

Section 9 provides some hints on how to write an analysis. Section 10 does a similar role on how to create graphs. Section 11 hints how to present the analysis to the community. Section 12 cares about how to make the entire analysis reproducible. Section 14 provides a checklist and Section 15 additional references.

Happy analysis!



Happy stats!







Book Review: How to be a modern scientist by @jtleek i.e. Jeffrey Leek

In the introduction to the Data Science world, one needs to build the right frame surrounding the topic. This is usually done via a set of straight to the point books that I will be summarising in this blog.

The first book that I start with is written by Jeffrey Leek. It is not a Data Science book by itself but rather an introductory set of tips on how to aspire to make science today.

The title of the book is "How to be a modern scientist" that you can get here. Actually, the series of posts that I start with this one is a consequence of reading this book. It is a way to acknowledge value in a twofold manner: first, I praise the book and congratulate the author and second, I share with the community a biased version of value that they could obtain by reading this book. These two processes are currently also present in the scientific community, together with more traditional aspects such as scientific paper reading, and, certainly, writing.

Let me try to share with you the main learning points I collected from this book. As always, here it goes my personal disclaimer: the reading of this very personal and non-comprehensive summary by no means replaces the reading of the book it refers to; on the contrary, this post is an invite to read the entire work.

Paper writing and publishing
There are currently three elements in modern science, what you can write, what you can code and the data you can share (the data you have based your investigation on).
The four parts the author states that a scientific paper consists of are great: a set of methodologies, a description of data, a set of results and, finally, a set of claims.
A key point is that your paper should tell or explain a story. That is why the author talks about "selecting your plot" for your paper i.e. once you have an answer to your question is when you start writing your paper.
These chapters distinguish between posting a preprint of a paper (for example in arxiv.org and submitting the paper to a peer-reviewed journal. For junior scientists, the mix the author mentions of using a preprint server and a closed access journal is very adequate.


Peer review and data sharing
The author proposes some elegant ways to carefully and timely review papers and also mentions the use e.g. of blogs to start sharing a serious and constructive review.
Regarding data sharing, his suggestion is to use a public repository that would remain accessible throughout time such as e.g. figshare.com.


Scientific blogging and coding
A way to market your papers can be via blogging. The three recommended platforms are blogger.com, medium.com and wordpress.org.
The author also reminds us that the Internet is a medium in which controversy flourishes.
In terms of code, the suggestion for general code is github.com and bitbucket.com.


Social media in science
An useful way to promote the work of others and your work.


Teaching in science
Post your lectures on the Internet (be aware of any  non-disclosure agreements with the University or the educational institution to teach at. Share videos of your lectures and, if resources allow it, create your own online course. 


Books and internal scientific communication
Three platforms are suggested: leanpub.com, gitbook.com and amazon kindle direct publishing.

Regarding internal communication, slack.com is one of the proposed tools to keep teams in sync.


Scientific talks and reading scientific papers, credit and career planning and online identity
This are the last sections of the book: some hints on preparing scientific talks, reading papers constructively and, very important, giving credit to all those community members who have help you out either by writing something you use or by creating frameworks you use. A key suggestion is to use as many related metrics as possible in your CV and in your presentations.

Finally, the books ends up with some useful (and common sense based) tips on career planning and online identity.

Thanks to the author of How to be a modern scientist



Happy revealing!