Book Review: Site Reliability Engineering. How Google runs production systems

The following points come from a book by many Googleans and related colleagues such as Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy title "Site Reliability Engineering: How Google runs productions systems".

Disclaimer: As always, in every book review I have posted, these reviews are just invitations to read the book. Not a replacement!

"Traditionally system developers ended their task once we threw their creation into production". This brought only trouble both to the final customers and to the staff in charge of providing the service.

This book is basically Google's attempt to revamp the role of system administrator and operator in production. To place it at the same level system developers were and are.

No magic solution, just common smart sense i.e. giving system admins in prod the possibility to improve the system themselves, to automate and to scale. The authors confirm that their proposal is a specific DevOps way.

From manual steps to externally maintained automation, both system specific and generic, then to internal automation and finally autonomy.
How do they define reliability: "Probability that a system performs a function with no failure under stated conditions for a period of time". An outage for the SRE, when planned, is a change to improve the system, to innovate.

Service reliability hierarchy
Bottom-up: Monitoring, incident response, post-mortem/root cause analysis, testing and release procedures, capacity planning, development and product.

"Hope is not a valid strategy"
70% of outages come from changes in a live system.

Monitoring software should do the interpretation and humans be notified via alerts, tickets or logging (according to the criticality). No email alerts, use a dashboard with flashy colours. Nowadays monitoring is more a collection of time series (more powerful than only SNMP) i.e. a sequence of values and timestamps. The data source for automated evaluating rules.

Black box monitoring (how is the user experience?) and white box (monitoring system internals).

This way we reduce the MTTF (mean time to failure) and the MTTR (mean time to repair).

Latency vs throughput
System engineers need to understand what is best for their system, the smart mix between latency (how long) and throughput (how many). Think about cost vs projected increase in revenue. Key point: Aim for the right Service Level Objective. Do not overachieve. Over-achievement in terms of availability prevents you from innovating and improving the system.

Avoid toil
Manual, repetitive work needs to be automated. Monitoring data not being used is a candidate for renewal. Blending together too many results is complex. In a 10 to 12 SRE team, 1 or 2 people are devoted to monitoring.

Release engineering
Includes also config management at the beginning of the product lifecycle. Frequent releases result in fewer changes in between versions. Distinguish between inherent complexity and accidental complexity and avoid the latter.
In software, less is more (and more expensive). Versioning APIs is a good idea.

Incident management teams
Multi-sites teams incur in a communication overhead. How do you know the team is in the sweet spot? When handling an incident takes 6 hours, including root cause analysis and post-mortem. Prefer the rational, focused and cognitive (procedure-based) process rather than the intuitive, fast and automated. Provide clear escalation paths and follow a blameless postmortem culture. Use an incident management web based tool.

Avoid operational overhead. If there are too many alers, give the pager back to the initial developer. Prepare for outages, drill it, test the what if...?  Team members should be on-call at least once or twice per quarter.

Separation of duties in incident management: ops (rotating roles among teams and time zones), communication and planning.

Testing is continuous. Testing reduces uncertainty and reliability decreases in each change. Include configuration tests.

Team size
It should not scale directly with service growth.

Best practices
Fail safely. Make progressive rollouts. Define your error/bug budget. Follow the monitoring principles (hierarchy), make post-mortems and include capacity planning.

Look not only at mean latency but also at distribution of latencies. Prevent server overload by means of built-in graceful degradation.

Leader election requires a reformulation of the distributed asynchronous consensus problem. It cannot be solved using heartbeats (but rather replicated state machines). A byzantine failure is e.g. an incorrect message due to a bug or a malicious activity.

Production readiness review
An early involvement is desired. SRE can only work with frameworks to scale. Data integrity is the means, data availability is the goal. 

Rocky landscape
Happy reliable reading!
Interested in the mindmap of it? Here you are part 1.

And part 2.

Book Review: Practical Data Science with R by Nina Zumel and Jim Porzak

This is a very very brief collection of points extracted from this book titled "Practical Data Science with R". For those starting in this field of Data Science a recommendable foundational reference.

The main parts: An introduction to Data Science, modelling methods and delivering results.

As always, an important disclaimer when talking about a book review: The reading of this very personal and non-comprehensive list of points, mostly taken verbatim from the book, by no means replaces the reading of the book it refers to; on the contrary, this post is an invite to read the entire work.

Part 1 - Intro to Data Science

I would highlight the method the authors propose to deal with data investigations:

- Define the goal - What problem are you solving?
- Collect and manage data - What info do you need?
- Build the model - Find patterns in data that leads to a solution
- Evaluate and critique the model - Does the model solve my problem?
- Present results and document - Establish that you can solve the data problem and explain how
- Deploy the model - Deploy the model to solve the problem in the real world.

Part 2 - Models

Common classification methods such as e.g. Naive Bayes classifier, Decision trees, Logistic regression, Support vector machine.
To forecast is to assign a probability (the key is how to map data into a model).

Model types: Classification, scoring, probability estimation, ranking and clustering.
For most model evaluations, it is usual to compute one or two summary scores using a few ideal models: a null model, a Bayes rate model and the best single variable model.

Evaluating scoring models:
- Always try single variable models before trying more complicated techniques.
- Single variable modelling techniques give a useful start on variable selection.
- Consider decision trees, nearest neighbour and naive Bayes models as basic data memorization techniques.

- Functional models allow to better explore how changes in inputs affect predictions.
- Linear regression is a good first technique to model quantities.
- Logistics regression is a good first technique to model probabilities.
- Models with simple forms come with very powerful summaries and diagnostics.
- Unsupervised methods find structure (e.g. discovered clusters, discovered rules) in the data, often as a prelude to predictive modelling.

Part 3 - Delivering results 

Nowadays information systems are built off large databases. Most systems are online, mistakes in terms of data interpretation are common and mostly none of these systems are concerned with cause.

Enjoy the data forest

Wannacry related interim timeline

Let me share a timeline I constructed regarding Wannacry during the last days. The interesting point I shared with some colleagues was that the patient zero (o patients) infection vector is not referenced or described as of now yet.

15th February 2017 Microsoft cancels its monthly patching for that month

9th March 2017 Wikileaks press release regarding Vault7, "the largest-ever publication of confidential documents on the agency" according to Wikileaks.

14th March 2017 Microsoft publish security update MS17-010 for SMB Server

14th April 2017 (according to Equation Group (see releases some exploits, EternalBlue among them. EternalBlue took advantage of the vulnerability that Microsoft patch MS17-010 fiexed.

14th April 2017 Microsoft publish their triage analysis on the exploits

15th April 2017 Security companies analyse exploits. One example of the anaylisis of EternalBlue is the following:

15th April 2017 Some news sites start to wonder how come that the patch existed before the release e.g.

12th May 2017 WannaCry appears in the wild

Some sources mention that the infection vector was a phishing email

However, no analysis yet of that mentioned phishing email, its attachment and its modus operandi in general.

Update 1: Response and proposals from Microsoft

Rocky days

Book Review: Bitcoin and other virtual currencies for the 21st Century by J. Anthony Malone

A very handy book to approach Bitcoin.

Let me try to share with you the main learning points I collected from this book. As always, here it goes my personal disclaimer: the reading of this very personal and non-comprehensive summary by no means replaces the reading of the book it refers to; on the contrary, this post is an invite to read the entire work.

The book starts first with the concept of money, how money was an innovation itself, the functions of money as a medium of exchange, a unit of account, a store of value, a deferred payment and a value measure. It also provides some insights on the history of money and how credit is older than cash and, finally, a key concept: the monopolistic role of the government in terms of currency issuance.

There are some hints in the book to consider Bitcoin a starting point to end the monopoly of central banks. It claims that the Bitcoin value scheme is inspired on the old gold standard. It is interesting to read the links that the author sees between the Austrian School of Economics and Bitcoin.

The point that Bitcoin does not have a centralised clearing house is certainly a key point in the book. It also mentions that the blockchain public ledger is the heart of the Bitcoin technology. It also mentions that Bitcoin is inflation-free (there is a fixed number of Bitcoins that can eventually be minted). The supply of Bitcoins does not depend on the monetary policy of a central authority. It also remembers the Keynesian line of thought on deflation and how it encourages individuals and businesses to save money.

To use Bitcoins, you just need a Bitcoin wallet and a Bitcoin address. Technically, Bitcoin has currently a transaction limit of 7 per second.

There is a section of the book on legal aspects of Bitcoin. Apparently virtual currencies do not have legal tender status in any jurisdiction. Bitcoin has the properties of a payment system, a currency and a commodity. There is still a bit of regulatory ambiguity in terms of Bitcoin. There are some appendixes in the book related to a very useful glossary of terms, a legal guidance issued by FinCEN in the US, also from US GAO (Accountability Office), from the Inland Revenue Service, some input from revelant regulators and legal documentation on different Bitcoin-related cases.

Happy growing!

Book Review: Bitcoin and Mobile Payments: Constructing a European Union Framework (Palgrave Studies in Financial Services Technology) edited by Gabriella Gimigliano

This book sheds some light on how Bitcoin and mobile payments interact with EU rules and regulations. A key point certainly are the PSD and PSD2 directives on payment services in the internal market.

Let me try to share with you the main learning points I collected from this book. As always, here it goes my personal disclaimer: the reading of this very personal and non-comprehensive summary by no means replaces the reading of the book it refers to; on the contrary, this post is an invite to read the entire work.

The book has been built into 4 parts:

- Institutional strategy and economic background
The institutional strategy can be an enabling factor for a sound growth of new instruments and certainly for the security of payments. The definition of an effective “cyber security strategy” at national and European level is one of the pillars of the creation of the “digital single market”. The financial services and the payment industry are an essential component. Certainly the role of SEPA (Single Euro Payment Area) is considered. Interestingly, Bitcoin is an alternative payment scheme without fiat or banking money. There is an interesting statement, “Bitcoin has a tendency to create an oligopoly in terms of miners”.

- The framework – a European outline and a comparison with other frameworks
There is a lack of specific regulations in terms of virtual currencies. Can they be considered payment instruments? What are they really? What is the role of self-regulation in all this? In Europe we see a technological fragmentation of the payment chain. It is still too early to know which path will be followed. Experts suggest an adaptation of the laws for newcomers such as bitcoin.

- Regulatory challenges (e.g. protection of customers’ funds, data integrity, soundness of payment and financial system, competitiveness of European market)
A basic requirement is to have an adequate security that encourages the usability of the system. What happens when there is no central service provider? The increasingly stronger general rules for data protection in the EU will eventually require equally strong sector-based rules.
Mobile payments’ legal situation regarding Anti Money-Laundering is legally certain. Virtual currencies’ legislation not.
Interesting detail: Bitcoin does not attract too many VAT complications within the EU.
For the time being, there is a lack of a fully implemented and integrated business model in the mobile payments ecosystem in Europe.

- Evolution of payment services
Only two sentences on this topic. Bitcoin is really a conceptual revolution, mobile payments are really an evolution.

Happy constructing!

Quick Book Review: Value Web by Chris Skinner

I thought I would share with my readers a selection of the points mentioned in the book (modest disclaimer: it is a non-comprehensive, and personal, quick summary that does not replace the reading of the book)

The book is titled "Value Web" by Chris Skinner.

-          The author is an independent commentator in the financial industry.
-          Summary in a sentence: There is a network transformation of how we exchange value.
-          This network transformation is linked to our secure digital identities.
-          The author describes the blockchain technology also as an authentication technology.
-          He touches also upon the history of money and how farming created money as an instrument to keep value.
-          A detail: It was China inventing paper-based money.
-          An interesting thought: “Simplification comes from kids and complexity comes from incumbents”
-          Clear statement: Banks don’t trust each other anymore.
-          Interesting story of an attempt to regulate:
-          The author sees banks more than as money stores as value stores. His stance: value stores need regulation.
-          Three different roles played by fintech players in the banking industry: wrappers, replacers and reformers (vis a vis traditional banking).
-          How free apps can make money? By creating additional (not currently existing) value and by being relevant.
-          The potential to re-invent banking (rather than to disrupt banking)
-          However: Let’s be realisitic. In the UK 62% of the population still prefers face to face in a branch as preferred channel to access bank services.
-          Banks already require a digital core, a platform. So that channels are replaced by access. In the digital era, they talk about access (to that digital core) and not channels anymore.

The last part of the book includes interviews to key players in this field. My 2 cents. Follow these three names on twitter: @jonmatonis, @brockpierce and @chrislarsensf

Happy valueing!

Security sites to bookmark: fireeye, darkmatters.norsecorp and blueliv

New trends in security intelligence services

A traditional marketing element already present in most security providers' Internet presence is a blog on current topics of interest: A smart way to attract readers while announcing their added value as a security company. 

This is the case of three international players. They are relatively new in this sector and they all combine technology solutions with intelligence services: they are FireEye, founded in 2004, Norse, created in 2010 and Blueliv, founded in 2009. The first two even team up together for customers as relevant as the US Department of Energy.

FireEye, the veteran in this field, is a company that quickly grasped, already in 2004, the relevance to the business world of the advance persistent threats (customised cyber attacks, at the end of the day). When these attacks were already hitting the mass media news, they already devised a product and a service to protect companies. 

FireEye offers two blogs:

- ThreatResearch talks about current Internet threats. I recommend a visit to those who want to know about technical details of new malware campaigns and espionage operations that come to light.

- ExecutivePerspectives, less technical, is focused on business matters. It raises awareness among executive managers and budget decision-makers in terms of cyber (in)security.

Let's remember that in 2014 FireEye acquired Mandiant, the
security consulting firm led by Richard Bretjlich.

Norse Corporation offers also both an appliance to install and security intelligence services to hire. In its blog it presents news related to current cyber attacks together with their executives' public appearances such as the ones from Sam Glines, Norse co-founder. It also provides a link to a colourful world map with current Internet attacks that seems to be updated in real time. A very effective way to amaze those who do not work in our sector. 

An example of a typical blog post is the one showing the use of Splunk, the popular and successful log search engine, with their security intelligence data feed i.e. the product that provides the data presented in the attack map mentioned above.

Blueliv was founded by Daniel Solis. It value proposition is innovative. Gartner mentioned it in 2015 as a "cool vendor". Its blogs contains targets business people, researchers and industry practitioners. There are also some free resources ranging from datasheets to reports and videos. They also display an impressive cyber threat map

In short, the visit of these three blogs could be a first step for those security professionals willing to get introduced to the security intelligence services arena.

Happy security intelligence gathering!


Book review: The wisdom of crowds - A leveraging tool

The book by James Surowiecki titled "The wisdom of crowds" fell into my hands and I read it during the summer of 2015. These are the main learning points that I drew from its reading.

Disclaimer: By no means this personal and non-comprehensive post aims to replace the reading of the book. This post is a biased set of thoughts, most of them extracted from the book, that went through my mind during the reading of this book.

- Diversity and independence are key to be kept in collective decisions.
- To these two points, add also decentralization and aggregation.
- A decision market is a working method to capture collective wisdom.
- Making a group diverse makes it better at problem solving.
- Expertise is spectacularly narrow.
- A large group of diverse and independent people will come up with better decisions.
- Collective decisions are only wise when they draw from very different information sources.
- Centralization is never the answer. Aggregation is.
- Betting markets are very good at predicting markets.
- Crowds find the way to collectively benefit even without speaking to each other if everyone knows that everybody is trying to make a decision.
- We live in a society in which convention has won over rationality (e.g. why all films costs almost the same in the cinema?).
- Maybe as individuals we do not know where we are going but as a group we can achieve great accomplishments.
- People think that people should be where they deserve. Merit is a key element in accepting reality.
- Vehicle traffic: Very easy to create traffic jams. Very complex to get rid of them. As a swarm, we drive quicker if we coordinate with surrounding vehicles.
- If the traffic jam is massive, no easy solution. Personal thought: Maybe then stop the car and read a book.
- Academic challenges in a collaborative environment are a morale booster.
- Reputation should not become the basis of a scientific hierarchy.
- Sometimes, being a member of a group can make people dumber (especially if the group is small and it has leaders on it).
- Sometimes small groups start already with the conclusions instead of reaching them after an evidence-gathering based process.
- Small group view polarization exists. Hierarquies make it worse even.
- The order in which people speak pays an important role.
- People who think of themselves as leaders will influence groups more than others, even if they lack expertise on what they talk about.
- Groups need an efficient way to aggregate their members' opinions.
- Investors not always behave rationally.
- Investors get emotionally attached to their shares.
- Individual irrationality can create collective rationality.
- On average crowds will give you a better answer than individuals.
- Healthy markets are led both by fear and greed.
- Bubbles and crashes are examples of crowd decisions going wrong.
- Groups are smart only when their information sources are balanced in terms of its ownership.
- All these points can be applied (and they are actually being applied) also into the business world.
- These thoughts justify why democracy is preferred to other organisational systems.

As Infosec professionals, if we can have these points in mind when designing security controls and security awareness sessions, our delivered value will be higher.

Happy crowded reading!

Groups fly!