Machine Learning with R

title: Introduction to R with exercises
author: MIHALY BARASZ for NobleProg Ltd

TABLE OF CONTENTS ⌘

Sources and further reading
Machine Learning vs. Statistical Learning
Linear regression
Exercise for linear regression
Exercise for linear regression (contd.)
R best practices
Logistic regression
Testing, cross-validation
Classification exercise
Presenting the results
Deploying your results
Generalized Linear Models
Generalized Linear Model (cont.)
Regularization
Regularization more generally

1 SOURCES AND FURTHER READING ⌘

Source materials

“An Introduction to Statistical Learning”
- Available for free in PDF form online
- Online course by Trevor Hastie and Rob Tibshirani
Andrew Ng's “Machine Learning” online course

2 MACHINE LEARNING VS. STATISTICAL LEARNING ⌘

Different origins
Different focus
Highly convergent in the recent years

3 LINEAR REGRESSION ⌘

The simplest model for estimating a numerical response Y=β0+β1X1+β2X2+⋯+βpXp+ε Details

Understanding the results
Assessing the accuracy
Iterpreting the coefficients
Understanding factors
Adding higher oder terms and interactions

4 EXERCISE FOR LINEAR REGRESSION ⌘

Data file: Advertising.csv (from ISLR)
Multivariate linear regression
- Which variables are important
- Do they not have any predicting power?
- How much precision do we lose by dropping the "unimportant" variables?

5 EXERCISE FOR LINEAR REGRESSION (CONTD.) ⌘

Interactions between variables

Regression with all interactions

- Comparing results
- What are interations
- Visualizing interactions

6 R BEST PRACTICES ⌘

Organizing your work (and data)
Reusable work
Plotting
Learning

7 LOGISTIC REGRESSION ⌘

Response is categorical: Yes or No. f(X)=β0+β1X1+β2X2+⋯+βpXp Find a suitable f(X) and classify to Yes if f(X)>0 and to No otherwise. What is a good f?

Minimizes the training error? Not fine-grained enough; hard to optimize for.
Map f(X) to probabilities and maximize for the likelihood of training data.

8 TESTING, CROSS-VALIDATION ⌘

Training vs. Test-set performance
Bias-Variance trade-off (under/overfitting)
Strategies for estimating test error; Cross-Validation
Bootstra

9 CLASSIFICATION EXERCISE ⌘

Data: "defaulters" from the ISLR package

10 PRESENTING THE RESULTS ⌘

Session in R Markdown

11 DEPLOYING YOUR RESULTS ⌘

Exporting a model to a spreadsheet
Porting to a different programming environment
Using R as a library
Deploying R applications to web
- Shiny: http://shiny.rstudio.com/

12 GENERALIZED LINEAR MODELS ⌘

What's common in linear regression and logistic regression?
How do they fit under one common assumption?
What is the family parameter in glm?

13 GENERALIZED LINEAR MODEL (CONT.) ⌘

Common underlying assumption: a linear function of the predictors determines the distribution of the response.
The parameters of the linear function are determined in a way to maximize the likelihood of the observations.

f(X)=β0+β1X1+β2X2+⋯+βpXp For example, given the value of predictors X we assume that the distribution of the response depends only on f(X):

Linear regression: N(f(X),σ2) with a constant σ2 (its value doesn't matter)
Two class classification: binomial, with the probability of the Yes class being p, where logp1−p=f(X)

Deviance: negative log likelihood (times two :)). This is what we actually minimize in practice. In case of linear regression…

14 REGULARIZATION ⌘

Prediction accuracy; especially if p>n.
Model interpretability: removes irrelevant features. Feature selection.

15 REGULARIZATION MORE GENERALLY ⌘

Methods

Subset selection
Shrinkage (aka. regularization). Ridge regression, lasso
Dimension reduction. Pricipal components regression; partial least squares.

16 REGULARIZATION EXERCISE ⌘

Data: regul.csv

17 TREE BASED METHODS ⌘

Decision trees
Random forests (baggin, bootstrap)
Boosting

18 UNSUPERVISED LEARNING ⌘

Reasons, goals
Methods

19 PRINCIPAL COMPONENTS ANALYSIS ⌘

20 CLUSTERING ⌘

Goals
Examples
Challenges

21 K-MEANS CLUSTERING ⌘

Demonstration of R magic

Machine Learning with R

Contents