Machine Learning with R

From Training Material
Jump to navigation Jump to search


title
Introduction to R with exercises
author
MIHALY BARASZ for NobleProg Ltd

TABLE OF CONTENTS ⌘

  • Sources and further reading
  • Machine Learning vs. Statistical Learning
  • Linear regression
  • Exercise for linear regression
  • Exercise for linear regression (contd.)
  • R best practices
  • Logistic regression
  • Testing, cross-validation
  • Classification exercise
  • Presenting the results
  • Deploying your results
  • Generalized Linear Models
  • Generalized Linear Model (cont.)
  • Regularization
  • Regularization more generally

1 SOURCES AND FURTHER READING ⌘

Source materials

  • “An Introduction to Statistical Learning”
    • Available for free in PDF form online
    • Online course by Trevor Hastie and Rob Tibshirani
  • Andrew Ng's “Machine Learning” online course

Further reading

  • “Think Stats” and “Think Bayes”
    • both by Allen B. Downey
    • both available for free online
    • programming in Python

2 MACHINE LEARNING VS. STATISTICAL LEARNING ⌘

  • Different origins
  • Different focus
  • Highly convergent in the recent years

3 LINEAR REGRESSION ⌘

The simplest model for estimating a numerical response Y=β0+β1X1+β2X2+⋯+βpXp+ε Details

  • Understanding the results
  • Assessing the accuracy
  • Iterpreting the coefficients
  • Understanding factors
  • Adding higher oder terms and interactions

4 EXERCISE FOR LINEAR REGRESSION ⌘

  • Data file: Advertising.csv (from ISLR)
  • Multivariate linear regression
    • Which variables are important
    • Do they not have any predicting power?
    • How much precision do we lose by dropping the "unimportant" variables?

5 EXERCISE FOR LINEAR REGRESSION (CONTD.) ⌘

  • Interactions between variables

Regression with all interactions

    • Comparing results
    • What are interations
    • Visualizing interactions

6 R BEST PRACTICES ⌘

  • Organizing your work (and data)
  • Reusable work
  • Plotting
  • Learning

7 LOGISTIC REGRESSION ⌘

Response is categorical: Yes or No. f(X)=β0+β1X1+β2X2+⋯+βpXp Find a suitable f(X) and classify to Yes if f(X)>0 and to No otherwise. What is a good f?

  • Minimizes the training error? Not fine-grained enough; hard to optimize for.
  • Map f(X) to probabilities and maximize for the likelihood of training data.

8 TESTING, CROSS-VALIDATION ⌘

  • Training vs. Test-set performance
  • Bias-Variance trade-off (under/overfitting)
  • Strategies for estimating test error; Cross-Validation
  • Bootstra

9 CLASSIFICATION EXERCISE ⌘

  • Data: "defaulters" from the ISLR package

10 PRESENTING THE RESULTS ⌘

  • Session in R Markdown

11 DEPLOYING YOUR RESULTS ⌘

  • Exporting a model to a spreadsheet
  • Porting to a different programming environment
  • Using R as a library
  • Deploying R applications to web

12 GENERALIZED LINEAR MODELS ⌘

  • What's common in linear regression and logistic regression?
  • How do they fit under one common assumption?
  • What is the family parameter in glm?

13 GENERALIZED LINEAR MODEL (CONT.) ⌘

  • Common underlying assumption: a linear function of the predictors determines the distribution of the response.
  • The parameters of the linear function are determined in a way to maximize the likelihood of the observations.

f(X)=β0+β1X1+β2X2+⋯+βpXp For example, given the value of predictors X we assume that the distribution of the response depends only on f(X):

  • Linear regression: N(f(X),σ2) with a constant σ2 (its value doesn't matter)
  • Two class classification: binomial, with the probability of the Yes class being p, where logp1−p=f(X)

Deviance: negative log likelihood (times two :)). This is what we actually minimize in practice. In case of linear regression…

14 REGULARIZATION ⌘

  • Prediction accuracy; especially if p>n.
  • Model interpretability: removes irrelevant features. Feature selection.

15 REGULARIZATION MORE GENERALLY ⌘

Methods

  • Subset selection
  • Shrinkage (aka. regularization). Ridge regression, lasso
  • Dimension reduction. Pricipal components regression; partial least squares.

16 REGULARIZATION EXERCISE ⌘

  • Data: regul.csv

17 TREE BASED METHODS ⌘

  • Decision trees
  • Random forests (baggin, bootstrap)
  • Boosting

18 UNSUPERVISED LEARNING ⌘

  • Reasons, goals
  • Methods

19 PRINCIPAL COMPONENTS ANALYSIS ⌘

20 CLUSTERING ⌘

  • Goals
  • Examples
  • Challenges

21 K-MEANS CLUSTERING ⌘

Demonstration of R magic