Security and risk: Book Review: Practical Data Science with R by Nina Zumel and Jim Porzak

This is a very very brief collection of points extracted from this book titled "Practical Data Science with R". For those starting in this field of Data Science a recommendable foundational reference.

The main parts: An introduction to Data Science, modelling methods and delivering results.

As always, an important disclaimer when talking about a book review: The reading of this very personal and non-comprehensive list of points, mostly taken verbatim from the book, by no means replaces the reading of the book it refers to; on the contrary, this post is an invite to read the entire work.

Part 1 - Intro to Data Science

I would highlight the method the authors propose to deal with data investigations:

- Define the goal - What problem are you solving?
- Collect and manage data - What info do you need?
- Build the model - Find patterns in data that leads to a solution
- Evaluate and critique the model - Does the model solve my problem?
- Present results and document - Establish that you can solve the data problem and explain how
- Deploy the model - Deploy the model to solve the problem in the real world.

Part 2 - Models

Common classification methods such as e.g. Naive Bayes classifier, Decision trees, Logistic regression, Support vector machine.
To forecast is to assign a probability (the key is how to map data into a model).

Model types: Classification, scoring, probability estimation, ranking and clustering.
For most model evaluations, it is usual to compute one or two summary scores using a few ideal models: a null model, a Bayes rate model and the best single variable model.

Evaluating scoring models:
- Always try single variable models before trying more complicated techniques.
- Single variable modelling techniques give a useful start on variable selection.
- Consider decision trees, nearest neighbour and naive Bayes models as basic data memorization techniques.

- Functional models allow to better explore how changes in inputs affect predictions.
- Linear regression is a good first technique to model quantities.
- Logistics regression is a good first technique to model probabilities.
- Models with simple forms come with very powerful summaries and diagnostics.
- Unsupervised methods find structure (e.g. discovered clusters, discovered rules) in the data, often as a prelude to predictive modelling.

Part 3 - Delivering results

Nowadays information systems are built off large databases. Most systems are online, mistakes in terms of data interpretation are common and mostly none of these systems are concerned with cause.

Enjoy the data forest

Book Review: Practical Data Science with R by Nina Zumel and Jim Porzak

Pages

Cyber Insurance Due Diligence

Creating IT security teams

Popular Posts

Follow me on twitter!

Contact (e.g. to speak at your event)

Views

Search this blog

Labels

Archive

Useful security links

About the author