Coursera UW Machine Learning Specialization Notebook

For quick searching
Course can be found here
Notes can be found in my Github

This Specialization from leading researchers at the University of Washington introduces you to the exciting, high-demand field of Machine Learning. Through a series of practical case studies, you will gain applied experience in major areas of Machine Learning including Prediction, Classification, Clustering, and Information Retrieval. You will learn to analyze large and complex datasets, create systems that adapt and improve over time, and build intelligent applications that can make predictions from data.

Machine Learning Foundations: A Case Study Approach

Course can be found here
Lecture slides can be found here
Notes can be found in my Github

About this course: Do you have data and wonder what it can tell you? Do you need a deeper understanding of the core ways in which machine learning can improve your business? Do you want to be able to converse with specialists about anything from regression and classification to deep learning and recommender systems?

In this course, you will get hands-on experience with machine learning from a series of practical case-studies. At the end of the first course you will have studied

  1. how to predict house prices based on house-level features,
  2. analyze sentiment from user reviews,
  3. retrieve documents of interest,
  4. recommend products,
  5. and search for images.

Through hands-on practice with these use cases, you will be able to apply machine learning methods in a wide range of domains.

This first course treats the machine learning method as a black box. Using this abstraction, you will focus on understanding tasks of interest, matching these tasks to machine learning tools, and assessing the quality of the output. In subsequent courses, you will delve into the components of this black box by examining models and algorithms. Together, these pieces form the machine learning pipeline, which you will use in developing intelligent applications.

Learning Outcomes: By the end of this course, you will be able to:
-Identify potential applications of machine learning in practice.
-Describe the core differences in analyses enabled by regression, classification, and clustering.
-Select the appropriate machine learning task for a potential application.
-Apply regression, classification, clustering, retrieval, recommender systems, and deep learning.
-Represent your data as features to serve as input to machine learning models.
-Assess the model quality in terms of relevant error metrics for each task.
-Utilize a dataset to fit a model to analyze new data.
-Build an end-to-end application that uses machine learning at its core.
-Implement these techniques in Python.

Welcome to Machine Learning Foundations: A Case Study Approach! By joining this course, you’ve taken a first step in becoming a machine learning expert. You will learn a broad range of machine learning methods for deriving intelligence from data, and by the end of the course you will be able to implement actual intelligent applications. These applications will allow you to perform predictions, personalized recommendations and retrieval, and much more. If you continue with the subsequent courses in the Machine Learning specialization, you will delve deeper into the methods and algorithms, giving you the power to develop and deploy new machine learning services.

To begin, we recommend taking a few minutes to explore the course site. Review the material we’ll cover each week, and preview the assignments you’ll need to complete to pass the course. These assignments—one per Module 2 through 6—will walk you through Python implementations of intelligent applications for:

  • Predicting house prices
  • Analyzing the sentiment of product reviews
  • Retrieving Wikipedia articles
  • Recommending songs
  • Classifying images with deep learning

Week 1 Welcome

Machine learning is everywhere, but is often operating behind the scenes.

This introduction to the specialization provides you with insights into the power of machine learning, and the multitude of intelligent applications you personally will be able to develop and deploy upon completion.

We also discuss who we are, how we got here, and our view of the future of intelligent applications.

Why you should learn machine learning with us

Important Update regarding the Machine Learning Specialization10 min

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here: intro.pdf

Welcome to this course and specialization41 sec

Who we are5 min

Machine learning is changing the world3 min

Why a case study approach?7 min

Specialization overview6 min

Who this specialization is for and what you will be able to do

How we got into ML3 min

Who is this specialization for?4 min

What you’ll be able to do57 sec

The capstone and an example intelligent application6 min

The future of intelligent applications2 min

Getting started with the tools for the course

Reading: Getting started with Python, IPython Notebook & GraphLab Create10 min

Reading: where should my files go?10 min

Getting started with Python and the IPython Notebook

Download the IPython Notebook used in this lesson to follow along10 min

Starting an IPython Notebook5 min

Creating variables in Python7 min

Conditional statements and loops in Python8 min

Creating functions and lambdas in Python3 min

Getting started with SFrames for data engineering and analysis

Download the IPython Notebook used in this lesson to follow along10 min

Starting GraphLab Create & loading an SFrame4 min

Canvas for data visualization4 min

Interacting with columns of an SFrame4 min

Using .apply() for data transformation5 min

Week 2 Regression: Predicting House Prices

This week you will build your first intelligent application that makes predictions from data.

We will explore this idea within the context of our first case study, predicting house prices, where you will create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,…).

This is just one of the many places where regression can be applied.Other applications range from predicting health outcomes in medicine, stock prices in finance, and power usage in high-performance computing, to analyzing which regulators are important for gene expression.

You will also examine how to analyze the performance of your predictive model and implement regression in practice using an iPython notebook.

Linear regression modeling

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here: regression-intro-annotated.pdf

Predicting house prices: A case study in regression1 min

What is the goal and how might you naively address it?3 min

Linear Regression: A Model-Based Approach5 min

Adding higher order effects4 min

Evaluating regression models

Evaluating overfitting via training/test split6 min

Training/test curves4 min

Adding other features2 min

Other regression examples3 min

Summary of regression

Regression ML block diagram5 min

Quiz: Regression9 questions

Predicting house prices: IPython Notebook

Download the IPython Notebook used in this lesson to follow along10 min

Loading & exploring house sale data7 min

Splitting the data into training and test sets2 min

Learning a simple regression model to predict house prices from house size3 min

Evaluating error (RMSE) of the simple model2 min

Visualizing predictions of simple model with Matplotlib4 min

Inspecting the model coefficients learned1 min

Exploring other features of the data6 min

Learning a model to predict house prices from more features3 min

Applying learned models to predict price of an average house5 min

Applying learned models to predict price of two fancy houses7 min

Programming assignment

Reading: Predicting house prices assignment10 min

Quiz: Predicting house prices3 questions

Week 3 Classification: Analyzing Sentiment

How do you guess whether a person felt positively or negatively about an experience, just from a short review they wrote?

In our second case study, analyzing sentiment, you will create models that predict a class (positive/negative sentiment) from input features (text of the reviews, user profile information,…).This task is an example of classification, one of the most widely used areas of machine learning, with a broad array of applications, including

  • ad targeting,
  • spam detection,
  • medical diagnosis,
  • image classification.

You will analyze the accuracy of your classifier, implement an actual classifier in an iPython notebook, and take a first stab at a core piece of the intelligent application you will build and deploy in your capstone.

Classification modeling

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here: classification-annotated.pdf

Analyzing the sentiment of reviews: A case study in classification38 sec

What is an intelligent restaurant review system?4 min

Examples of classification tasks4 min

Linear classifiers5 min

Decision boundaries3 min

Evaluating classification models

Training and evaluating a classifier4 min

What’s a good accuracy?3 min

False positives, false negatives, and confusion matrices6 min

Learning curves5 min

Class probabilities1 min

Summary of classification

Classification ML block diagram3 min

Quiz: Classification7 questions

Analyzing sentiment: IPython Notebook

Download the IPython Notebook used in this lesson to follow along10 min

Loading & exploring product review data2 min

Creating the word count vector2 min

Defining which reviews have positive or negative sentiment4 min

Training a sentiment classifier3 min

Evaluating a classifier & the ROC curve4 min

Applying model to find most positive & negative reviews for a product4 min

Exploring the most positive & negative aspects of a product4 min

Programming assignment

Reading: Analyzing product sentiment assignment10 min

Quiz: Analyzing product sentiment11 questions

Week 4 Clustering and Similarity: Retrieving Documents

A reader is interested in a specific news article and you want to find a similar articles to recommend. What is the right notion of similarity? How do I automatically search over documents to find the one that is most similar? How do I quantitatively represent the documents in the first place?

In this third case study, retrieving documents, you will examine various document representations and an algorithm to retrieve the most similar subset. You will also consider structured representations of the documents that automatically group articles by similarity (e.g., document topic).

You will actually build an intelligent document retrieval system for Wikipedia entries in an iPython notebook.

Algorithms for retrieval and measuring similarity of documents

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here: clustering-intro-annotated.pdf

Document retrieval: A case study in clustering and measuring similarity35 sec

What is the document retrieval task?1 min

Word count representation for measuring similarity6 min

Prioritizing important words with tf-idf3 min

Calculating tf-idf vectors5 min

Retrieving similar documents using nearest neighbor search2 min

Clustering models and algorithms

Clustering documents task overview2 min

Clustering documents: An unsupervised learning task4 min

k-means: A clustering algorithm3 min

Other examples of clustering6 min

Summary of clustering and similarity

Clustering and similarity ML block diagram7 min

Quiz: Clustering and Similarity6 questions

Document retrieval: IPython Notebook

Download the IPython Notebook used in this lesson to follow along10 min

Loading & exploring Wikipedia data5 min

Exploring word counts5 min

Computing & exploring TF-IDFs7 min

Computing distances between Wikipedia articles5 min

Building & exploring a nearest neighbors model for Wikipedia articles3 min

Examples of document retrieval in action4 min

Programming assignment

Reading: Retrieving Wikipedia articles assignment10 min

Quiz: Retrieving Wikipedia articles9 questions

Machine Learning: Regression

Course can be found here
Summary can be found in my Github

About this course: Case Study - Predicting Housing Prices

In our first case study, predicting house prices, you will create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,…). This is just one of the many places where regression can be applied. Other applications range from predicting health outcomes in medicine, stock prices in finance, and power usage in high-performance computing, to analyzing which regulators are important for gene expression.

In this course, you will explore regularized linear regression models for the task of prediction and feature selection. You will be able to handle very large sets of features and select between models of various complexity. You will also analyze the impact of aspects of your data – such as outliers – on your selected models and predictions. To fit these models, you will implement optimization algorithms that scale to large datasets.

Learning Outcomes: By the end of this course, you will be able to:
-Describe the input and output of a regression model.
-Compare and contrast bias and variance when modeling data.
-Estimate model parameters using optimization algorithms.
-Tune parameters with cross validation.
-Analyze the performance of the model.
-Describe the notion of sparsity and how LASSO leads to sparse solutions.
-Deploy methods to select between models.
-Exploit the model to form predictions.
-Build a regression model to predict prices using a housing dataset.
-Implement these techniques in Python.

Week 1

Welcome

Regression is one of the most important and broadly used machine learning and statistics tools out there. It allows you to make predictions from data by learning the relationship between features of your data and some observed, continuous-valued response. Regression is used in a massive number of applications ranging from predicting stock prices to understanding gene regulatory networks.

This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.

What is this course about?

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

intro.pdf

Welcome!1 min
What is the course about?3 min
Outlining the first half of the course5 min
Outlining the second half of the course5 min
Assumed background4 min
Reading: Software tools you’ll need10 min

Simple Linear Regression

Our course starts from the most basic regression model: Just fitting a line to data. This simple model for forming predictions from a single, univariate feature of the data is appropriately called “simple linear regression”.

In this module, we describe the high-level regression task and then specialize these concepts to the simple linear regression case. You will learn how to formulate a simple regression model and fit the model to data using both a closed-form solution as well as an iterative optimization algorithm called gradient descent. Based on this fitted function, you will interpret the estimated model parameters and form predictions. You will also analyze the sensitivity of your fit to outlying observations.

You will examine all of these concepts in the context of a case study of predicting house prices from the square feet of the house.

Regression fundamentals

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week1_simpleregression-annotated.pdf

A case study in predicting house prices1 min
Regression fundamentals: data & model8 min
Regression fundamentals: the task2 min
Regression ML block diagram4 min

The simple linear regression model, its use, and interpretation

The simple linear regression model2 min
The cost of using a given line6 min
Using the fitted line6 min
Interpreting the fitted line6 min

An aside on optimization: one dimensional objectives

Defining our least squares optimization objective3 min
Finding maxima or minima analytically7 min
Maximizing a 1d function: a worked example2 min
Finding the max via hill climbing6 min
Finding the min via hill descent3 min
Choosing stepsize and convergence criteria6 min

An aside on optimization: multidimensional objectives

Gradients: derivatives in multiple dimensions5 min
Gradient descent: multidimensional hill descent6 min

Finding the least squares line

Computing the gradient of RSS7 min
Approach 1: closed-form solution5 min
Optional reading: worked-out example for closed-form solution10 min
Approach 2: gradient descent7 min
Optional reading: worked-out example for gradient descent10 min
Comparing the approaches1 min

Discussion and summary of simple linear regression

Download notebooks to follow along10 min
Influence of high leverage points: exploring the data4 min
Influence of high leverage points: removing Center City7 min
Influence of high leverage points: removing high-end towns3 min
Asymmetric cost functions3 min
A brief recap1 min
Quiz: Simple Linear Regression7 questions






Programming assignment

Reading: Fitting a simple linear regression model on housing data10 min
Quiz: Fitting a simple linear regression model on housing data4 questions


Week 2 Multiple Regression

The next step in moving beyond simple linear regression is to consider “multiple regression” where multiple features of the data are used to form predictions.

More specifically, in this module, you will learn how to build models of more complex relationship between a single variable (e.g., ‘square feet’) and the observed response (like ‘house sales price’). This includes things like fitting a polynomial to your data, or capturing seasonal changes in the response value. You will also learn how to incorporate multiple input variables (e.g., ‘square feet’, ‘# bedrooms’, ‘# bathrooms’). You will then be able to describe how all of these models can still be cast within the linear regression framework, but now using multiple “features”. Within this multiple regression framework, you will fit models to data, interpret estimated coefficients, and form predictions.

Here, you will also implement a gradient descent algorithm for fitting a multiple regression model.

Multiple features of one input

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week2_multipleregression-annotated.pdf

Multiple regression intro30 sec

Polynomial regression3 min

Modeling seasonality8 min

Where we see seasonality3 min

Regression with general features of 1 input2 min

Incorporating multiple inputs

Motivating the use of multiple inputs4 min

Defining notation3 min

Regression with features of multiple inputs3 min

Interpreting the multiple regression fit7 min

Setting the stage for computing the least squares fit

Optional reading: review of matrix algebra10 min

This section involves some use of matrix algebra. If you’d like to brush up on it, we recommend a short tutorial.

Rewriting the single observation model in vector notation6 min

Rewriting the model for all observations in matrix notation4 min

Computing the cost of a D-dimensional curve9 min

Computing the least squares D-dimensional curve

Computing the gradient of RSS3 min

Approach 1: closed-form solution3 min

Discussing the closed-form solution4 min

Approach 2: gradient descent2 min

Feature-by-feature update9 min

Algorithmic summary of gradient descent approach4 min

Summarizing multiple regression

A brief recap1 min

Quiz: Multiple Regression9 questions










Programming assignment 1

Reading: Exploring different multiple regression models for house price prediction10 min

Quiz: Exploring different multiple regression models for house price prediction8 questions





Programming assignment 2

Numpy tutorial10 min

More information on Numpy, beyond this tutorial, can be found in the Numpy getting started guide.

Reading: Implementing gradient descent for multiple regression10 min

Quiz: Implementing gradient descent for multiple regression5 questions



Week 3 Assessing Performance

Having learned about linear regression models and algorithms for estimating the parameters of such models, you are now ready to assess how well your considered method should perform in predicting new data. You are also ready to select amongst possible models to choose the best performing.

This module is all about these important topics of model selection and assessment. You will examine both theoretical and practical aspects of such analyses. You will first explore the concept of measuring the “loss” of your predictions, and use this to define training, test, and generalization error. For these measures of error, you will analyze how they vary with model complexity and how they might be utilized to form a valid assessment of predictive performance. This leads directly to an important conversation about the bias-variance tradeoff, which is fundamental to machine learning. Finally, you will devise a method to first select amongst models and then assess the performance of the selected model.

The concepts described in this module are key to all machine learning problems, well-beyond the regression setting addressed in this course.

Defining how we assess performance

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week3_assessingperformance-annotated.pdf

Assessing performance intro32 sec

What do we mean by “loss”?4 min

Training error: assessing loss on the training set7 min

Generalization error: what we really want8 min

Test error: what we can actually compute4 min

Defining overfitting2 min

Training/test split1 min

3 sources of error and the bias-variance tradeoff

Irreducible error and bias6 min

Variance and the bias-variance tradeoff6 min

Error vs. amount of data6 min

OPTIONAL ADVANCED MATERIAL: Formally defining and deriving the 3 sources of error

Formally defining the 3 sources of error14 min

Formally deriving why 3 sources of error20 min

Putting the pieces together

Training/validation/test split for model selection, fitting, and assessment7 min

A brief recap1 min

Quiz: Assessing Performance13 questions
















Programming assignment

Reading: Exploring the bias-variance tradeoff10 min

Quiz: Exploring the bias-variance tradeoff4 questions


Week 4 Ridge Regression

You have examined how the performance of a model varies with increasing model complexity, and can describe the potential pitfall of complex models becoming overfit to the training data. In this module, you will explore a very simple, but extremely effective technique for automatically coping with this issue. This method is called “ridge regression”. You start out with a complex model, but now fit the model in a manner that not only incorporates a measure of fit to the training data, but also a term that biases the solution away from overfitted functions. To this end, you will explore symptoms of overfitted functions and use this to define a quantitative measure to use in your revised optimization objective. You will derive both a closed-form and gradient descent algorithm for fitting the ridge regression objective; these forms are small modifications from the original algorithms you derived for multiple regression. To select the strength of the bias away from overfitting, you will explore a general-purpose method called “cross validation”.

You will implement both cross-validation and gradient descent to fit a ridge regression model and select the regularization constant.

Characteristics of overfit models

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week4_ridgeregression-annotated.pdf

Symptoms of overfitting in polynomial regression2 min

Download the notebook and follow along10 min

Next, we will see a demo illustrating the concept of overfitting. We recommend you download the IPython Notebook used in the demo to follow along. (The second and third parts of this notebook will be used to demonstrate ridge regression and LASSO; two techniques to address overfitting.)

IPython Notebook:
Overfitting_Demo_Ridge_Lasso.ipynb.zip

Overfitting demo7 min

Overfitting for more general multiple regression models3 min

The ridge objective

Balancing fit and magnitude of coefficients7 min

The resulting ridge objective and its extreme solutions5 min

How ridge regression balances bias and variance1 min

Download the notebook and follow along10 min

Ridge regression demo9 min

The ridge coefficient path4 min

Optimizing the ridge objective

Computing the gradient of the ridge objective5 min

Approach 1: closed-form solution6 min

Discussing the closed-form solution5 min

Approach 2: gradient descent9 min

Tying up the loose ends

Selecting tuning parameters via cross validation3 min

K-fold cross validation5 min

How to handle the intercept6 min

A brief recap1 min

Quiz: Ridge Regression9 questions







Programming Assignment 1

Reading: Observing effects of L2 penalty in polynomial regression10 min

Quiz: Observing effects of L2 penalty in polynomial regression7 questions





Programming Assignment 2

Reading: Implementing ridge regression via gradient descent10 min

Quiz: Implementing ridge regression via gradient descent8 questions





Week 5 Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions.

To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model.

Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

Feature selection via explicit model enumeration

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week5_lassoregression-annotated.pdf

The feature selection task3 min

All subsets6 min

Complexity of all subsets3 min

Greedy algorithms7 min

Complexity of the greedy forward stepwise algorithm2 min

Feature selection implicitly via regularized regression

Can we use regularization for feature selection?3 min

Thresholding ridge coefficients?4 min

The lasso objective and its coefficient path7 min

Geometric intuition for sparsity of lasso solutions

Visualizing the ridge cost7 min

Visualizing the ridge solution6 min

Visualizing the lasso cost and solution7 min

Download the notebook and follow along10 min

Lasso demo5 min

Setting the stage for solving the lasso

What makes the lasso objective different3 min

Coordinate descent5 min

Normalizing features3 min

Coordinate descent for least squares regression (normalized features)8 min

Optimizing the lasso objective

Coordinate descent for lasso (normalized features)5 min

Assessing convergence and other lasso solvers2 min

Coordinate descent for lasso (unnormalized features)1 min

OPTIONAL ADVANCED MATERIAL: Deriving the lasso coordinate descent update

Deriving the lasso coordinate descent update19 min

Tying up loose ends

Choosing the penalty strength and other practical issues with lasso5 min

A brief recap3 min

Quiz: Feature Selection and Lasso7 questions















Programming Assignment 1

Reading: Using LASSO to select features10 min

Quiz: Using LASSO to select features6 questions





Programming Assignment 2

Reading: Implementing LASSO using coordinate descent10 min

Quiz: Implementing LASSO using coordinate descent8 questions






Week 6

Nearest Neighbors & Kernel Regression

Up to this point, we have focused on methods that fit parametric functions—like polynomials and hyperplanes—to the entire dataset. In this module, we instead turn our attention to a class of “nonparametric” methods. These methods allow the complexity of the model to increase as more data are observed, and result in fits that adapt locally to the observations.

We start by considering the simple and intuitive example of nonparametric methods, nearest neighbor regression: The prediction for a query point is based on the outputs of the most related observations in the training set. This approach is extremely simple, but can provide excellent predictions, especially for large datasets. You will deploy algorithms to search for the nearest neighbors and form predictions based on the discovered neighbors. Building on this idea, we turn to kernel regression. Instead of forming predictions based on a small set of neighboring observations, kernel regression uses all observations in the dataset, but the impact of these observations on the predicted value is weighted by their similarity to the query point. You will analyze the theoretical performance of these methods in the limit of infinite training data, and explore the scenarios in which these methods work well versus struggle. You will also implement these techniques and observe their practical behavior.

Motivating local fits

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week6_NNkernelregression-annotated.pdf

Limitations of parametric regression3 min

Nearest neighbor regression

1-Nearest neighbor regression approach8 min
Distance metrics4 min
1-Nearest neighbor algorithm3 min

k-Nearest neighbors and weighted k-nearest neighbors

k-Nearest neighbors regression7 min
k-Nearest neighbors in practice3 min
Weighted k-nearest neighbors4 min

Kernel regression

From weighted k-NN to kernel regression6 min
Global fits of parametric models vs. local fits of kernel regression6 min

k-NN and kernel regression wrapup

Performance of NN as amount of data grows7 min
Issues with high-dimensions, data scarcity, and computational complexity3 min
k-NN for classification1 min
A brief recap1 min
Quiz: Nearest Neighbors & Kernel Regression7 questions

Programming Assignment

Reading: Predicting house prices using k-nearest neighbors regression10 min
Quiz: Predicting house prices using k-nearest neighbors regression8 questions

Closing Remarks

In the conclusion of the course, we will recap what we have covered. This represents both techniques specific to regression, as well as foundational machine learning concepts that will appear throughout the specialization. We also briefly discuss some important regression techniques we did not cover in this course.

We conclude with an overview of what’s in store for you in the rest of the specialization.

What we’ve learned

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

closing.pdf

Simple and multiple regression4 min
Assessing performance and ridge regression7 min
Feature selection, lasso, and nearest neighbor regression4 min

Summary and what’s ahead in the specialization

What we covered and what we didn’t cover5 min
Thank you!1 min

Machine Learning: Classification

Course can be found here
Lecture slides can be found here
Summary can be found in my Github

About this course: Case Studies: Analyzing Sentiment & Loan Default Prediction

In our case study on analyzing sentiment, you will create models that predict a class (positive/negative sentiment) from input features (text of the reviews, user profile information,…). In our second case study for this course, loan default prediction, you will tackle financial data, and predict when a loan is likely to be risky or safe for the bank. These tasks are an examples of classification, one of the most widely used areas of machine learning, with a broad array of applications, including ad targeting, spam detection, medical diagnosis and image classification.

In this course, you will create classifiers that provide state-of-the-art performance on a variety of tasks. You will become familiar with the most successful techniques, which are most widely used in practice, including logistic regression, decision trees and boosting. In addition, you will be able to design and implement the underlying algorithms that can learn these models at scale, using stochastic gradient ascent. You will implement these technique on real-world, large-scale machine learning tasks. You will also address significant tasks you will face in real-world applications of ML, including handling missing data and measuring precision and recall to evaluate a classifier. This course is hands-on, action-packed, and full of visualizations and illustrations of how these techniques will behave on real data. We’ve also included optional content in every module, covering advanced topics for those who want to go even deeper!

Learning Objectives: By the end of this course, you will be able to:
-Describe the input and output of a classification model.
-Tackle both binary and multiclass classification problems.
-Implement a logistic regression model for large-scale classification.
-Create a non-linear model using decision trees.
-Improve the performance of any model using boosting.
-Scale your methods with stochastic gradient ascent.
-Describe the underlying decision boundaries.
-Build a classification model to predict sentiment in a product review dataset.
-Analyze financial data to predict loan defaults.
-Use techniques for handling missing data.
-Evaluate your models using precision-recall metrics.
-Implement these techniques in Python (or in the language of your choice, though Python is highly recommended).

Week 1

Welcome !

Classification is one of the most widely used techniques in machine learning, with a broad array of applications, including sentiment analysis, ad targeting, spam detection, risk assessment, medical diagnosis and image classification. The core goal of classification is to predict a category or class y from some inputs x. Through this course, you will become familiar with the fundamental models and algorithms used in classification, as well as a number of core machine learning concepts. Rather than covering all aspects of classification, you will focus on a few core techniques, which are widely used in the real-world to get state-of-the-art performance. By following our hands-on approach, you will implement your own algorithms on multiple real-world tasks, and deeply grasp the core techniques needed to be successful with these approaches in practice. This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.

Welcome to the course

Important Update regarding the Machine Learning Specialization 10 min

Hello Machine Learning learners,

Please know that due to unforeseen circumstances, courses 5 and 6 - Recommender Systems & Dimensionality Reduction and An Intelligent Application with Deep Learning - will not be launching as part of the Machine Learning Specialization. We understand this may come as very disappointing news and we’re deeply sorry for this inconvenience. If you have paid for these courses or have received financial aid from Coursera, you will remain eligible to earn your Specialization Certificate upon successfully completing courses 1-4 of the Specialization. If you paid for courses 5 & 6 via a pre-payment toward the Specialization, Coursera has provided you with free access to two other courses offered by the University of Washington: Computational Neuroscience and Data Manipulation at Scale: Systems and Algorithms. An email has been sent out with specific instructions on how to enroll in these courses. If you individually paid for either x or y course, you will receive a refund within the next two weeks.

If you have any questions or would like to request a refund, please feel free to contact Coursera’s 24/7 learner support team via the Request a Refund article in the Learner Help Center. The last day to request a refund will be April 30, 2017. We value you as a Coursera learner and want to ensure that your experience with the Machine Learning Specialization remains a positive one.

Regards,

The Coursera Team

Slides presented in this module 10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

intro.pdf

Welcome to the classification course, a part of the Machine Learning Specialization 1 min

https://www.coursera.org/learn/ml-classification/lecture/YMpzf/welcome-to-the-classification-course-a-part-of-the-machine-learning

What is this course about? 6 min

https://www.coursera.org/learn/ml-classification/lecture/qZhKx/what-is-this-course-about

Impact of classification 1 min

https://www.coursera.org/learn/ml-classification/lecture/OnpWH/impact-of-classification

Course overview and details

Course overview 3 min

https://www.coursera.org/learn/ml-classification/lecture/84fuF/course-overview

Outline of first half of course 5 min

https://www.coursera.org/learn/ml-classification/lecture/LyubT/outline-of-first-half-of-course

Outline of second half of course 5 min

https://www.coursera.org/learn/ml-classification/lecture/z1g9k/outline-of-second-half-of-course

Assumed background 3 min

https://www.coursera.org/learn/ml-classification/lecture/IindM/assumed-background

Let’s get started! 45 sec

https://www.coursera.org/learn/ml-classification/lecture/AktDn/lets-get-started

Reading: Software tools you’ll need 10 min
Software tools you’ll need for this course

How this specialization was designed. The learning approach in this specialization is to start from use cases and then dig into algorithms and methods, what we call a case-studies approach. We are very excited about this approach, since it has worked well in several other courses. The first course, Machine Learning: Foundations, was focused on understanding how ML can be used in various cases studies. The second course, Machine Learning: Regression, was focused on models that predict a continuous value from input features. The follow on courses will dig into more details of algorithms and methods of other ML areas. We expect all learners to have taken the first and second course, before taking this course.

Classification - A Machine Learning Approach. This course focuses classification, one of the most important types of data analysis, with a wide range of applications. After successfully completing this course, you will be able to use classification methods in practice, implement some of the most fundamental algorithms in this area, and choose the right model for your task. You will become familiar with the most successful techniques, which are most widely used in practice, including logistic regression, decision trees and boosting. In addition, you will be able to design and implement the underlying algorithms that can learn these models at scale, using stochastic gradient ascent.

Programming assignment format

Almost every module will be associated with one or two programming assignments. The goal of these assignments is to have hands-on experience on the techniques we discuss in lectures. To test your implementations, you will be asked questions in a quiz following the assignment.

You will be implementing core classification techniques or other ML concepts from scratch in most modules. In a few module, you will also explore fundamental ML concepts, such as regularization or precision-recall, using existing implementations of ML algorithms, with the goal of gaining proficiency in the ML concepts.

Why Python

In this course, we are going to use the Python programming language to build several intelligent applications that use machine learning. Python is a simple scripting language that makes it easy to interact with data. Furthermore, Python has a wide range of packages that make it easy to get started and build applications, from the simplest ones to the most complex. Python is widely used in industry, and is becoming the de facto language for data science in industry. (R is another alternative language. However, R tends to be significantly less scalable and has very few deployment tools, thus it is seldomly used for production code in industry. It is possible, but discouraged to use R in this specialization.)

We will also encourage the use the IPython Notebook in our assignments. The IPython Notebook is a simple interactive environment for programming with Python, which makes it really easy to share your results. Think about it as a combination of a Python terminal and a wiki page. Thus, you can combine code, plots and text to explain what you did. (You are not required to use IPython Notebook in the assignments, and should have no problem using straight up Python if you prefer.)

Useful software tools

Although you will be implementing algorithms from scratch in various assignments, some software tools will be useful in the process. In particular, there are four types of data tools that would be helpful:

  • Data manipulation: to help you slice-and-dice the data, create new features, and clean the data.
  • Matrix operations: in the inner loops of your algorithms, you will do various matrix operations, and libraries focus on these will speed-up your code significantly.
  • Plotting library: so you can visualize data and models.
  • Pre-implemented ML algorithms: in some assignments where we are focusing on exploring ML classification models, you will use a pre-implemented ML algorithms to help focus your efforts on the fundamentals.

1.Tools for data manipulation

For data manipulation, we recommend using SFrame, an open-source, highly-scalable Python library for data manipulation. An alternative is the Pandas library. A huge advantage of SFrame over Pandas is that with SFrame, you are not limited to datasets that fit in memory, which allows you to deal with large datasets, even on a laptop. (The SFrame API is very similar to Pandas’ API. Here is a doc showing the relationship between the two of them.)

2.Tools for matrix operation

For matrix operations, we strongly recommend Numpy, an open-source Python library that provides fast performance, for data that fits in memory.

3.Tools for plotting

For plotting, we strongly recommend you use Matplotlib, an open-source Python library with extensive plotting functionality.

4.Tools with pre-implemented ML algorithms

For the few assignments where you will be using pre-implemented ML algorithms, we recommend you use GraphLab Create, which we used in the first and second course, a package we have been working on for many years now, and has seen an exciting adoption curve, especially in industry with folks building real applications. A popular alternative is to use scikit-learn. GraphLab Create is more scalable than scikit-learn and simpler to use when your data is not numeric vectors. On the other hand, scikit-learn is open-source.

In this course, most of the assignments are about implementing algorithms from scratch, so this choice is more flexible than in the first course. We are happy, however, for you to use any tool(s) of your liking. As you will notice, we are only grading the output of your programs, so the specific software tool is not the focus of the course. More details on using other tools are at the end of this doc.

It’s important to emphasize that this specialization is not about providing training for a specific software package. The goal of the specialization is for your effort to be spent on learning the fundamental concepts and algorithms behind machine learning in a hands-on fashion. These concepts transcend any single package. What you learn here you can use whether you write code from scratch, use any existing ML packages out there, or any that may be developed in the future. We are happy to hear that so many of you are enjoying this approach so far!

5.Licenses for SFrame & GraphLab Create

The SFrame package is available in open-source under a permissive BSD license. So, you will always be able to use SFrames for free. GraphLab Create is free on a 1-year, renewable license for educational purposes, including Coursera. The reason we suggest you use GraphLab Create for this course is because this software will make it much easier for you see machine learning in action and to help you complete your assignments quickly.

Upgrade GraphLab Create

If you are using GraphLab Create and already have it installed, please make sure you upgrade to the latest version! The simplest way to do this is to:

Open the GraphLab Launcher.
Click on ‘TERMINAL’.
On the terminal window, type:
pip install --upgrade graphlab-create

Resources

These are some good resources you can explore, if you are using the recommended software tools:

In the first course of this ML specialization, Machine Learning Foundations, we provided many tutorials and getting started guides. We recommend you go over those before tackling this course.
There are many Python resources available online. Here is a good place for documentation.
For SFrame & GraphLab Create, there is also a lot of information available online. Here are some starting points: the User Guide and detailed API docs.
For Numpy, here is a getting started guide. We will also provide a tutorial when it’s time to use it.

If you choose to use the recommended tools, you have two options: downloading and installing the required software or using a prepackaged version on a free instance on Amazon EC2.

1.Option 1: Downloading and installing all software on your own machine

Download and install Python, IPython Notebook, Numpy, SFrame and GraphLab Create. You can find the instructions here.

2.Option 2: Using a free Amazon EC2 with all the software pre-installed

If you do not have a 64-bit computer, you will not be able to run GraphLab Create. Additionally, some of you may want a simple experience where you don’t have to download the course content and install everything locally. Here, we’ll address these situations!

Amazon EC2 offers free cloud computing hours with what they call micro instances. These instances are all we need to do the work for this course. We have created an image for one such instance that is easy to launch and contains all the course content. This will allow you to run everything you need for this course in the cloud for free, without having to install anything locally. (You do need to create an Amazon EC2 account and have internet access.)

You can find step-by-step instructions here:

https://turi.com/download/install-graphlab-create-aws-coursera.html

We note that installing all the software on your own local machine may be the right option for most people; especially since you can run locally everything without needing to be online to do the homeworks. But, the option using Amazon EC2 should be a great alternative.

Github repository with starter code

In each module of the course, we have a reading with the assignments for that module as well as some starter code. For those interested, the starter code and demos used in this course are also available in a public Github repository:

https://github.com/learnml/machine-learning-specialization

Using other software packages

We strongly encourage you to use the recommended software packages for this course, since they will allow you to learn the fundamental concepts more quickly. However, you are welcome to use others. Here are a few notes if you do so.

1.Installing other software tools

In the instructions above, you will be using the GraphLab Launcher, which will automatically install Python, IPython Notebook, Numpy, Matplotlib, SFrame and GraphLab Create. If you don’t use the GraphLab Launcher, you will need to install each of these tools separately, by following the pages linked above. Anaconda is a good tool to help simplify some of this installation.

2.If you are using SFrame, but not GraphLab Create

GraphLab Create uses SFrame under the hood, but you can use just SFrame for most assignments. If you choose to do so, in the starter code for the assignments, you should change the line

import graphlab

import sframe

import sframe
and everything should work with just some small modifications, e.g., the calls:

graphlab.SFrame(...)
will become

sframe.SFrame(...)

3.If you are using other software tools out there

You are welcome to use other packages, e.g., scikit-learn instead of GraphLab Create, or Pandas instead of SFrame, or even R instead of Python. If you choose to use all these different packages, we will provide the datasets (in standard CSV format) and the assignment questions will not depend specifically on the recommended tools.

Linear Classifiers & Logistic Regression

Linear classifiers are amongst the most practical classification methods. For example, in our sentiment analysis case-study, a linear classifier associates a coefficient with the counts of each word in the sentence. In this module, you will become proficient in this type of representation. You will focus on a particularly useful type of linear classifier called logistic regression, which, in addition to allowing you to predict a class, provides a probability associated with the prediction. These probabilities are extremely useful, since they provide a degree of confidence in the predictions. In this module, you will also be able to construct features from categorical inputs, and to tackle classification problems with more than two class (multiclass problems). You will examine the results of these techniques on a real-world product sentiment analysis task.

Linear classifiers

Slides presented in this module 10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

logistic-regression-model-annotated.pdf

Linear classifiers: A motivating example 2 min

https://www.coursera.org/learn/ml-classification/lecture/HNKIj/linear-classifiers-a-motivating-example

Intuition behind linear classifiers 3 min

https://www.coursera.org/learn/ml-classification/lecture/lCBwS/intuition-behind-linear-classifiers

Decision boundaries 3 min

https://www.coursera.org/learn/ml-classification/lecture/NIdE0/decision-boundaries

Linear classifier model 5 min

https://www.coursera.org/learn/ml-classification/lecture/XBc9n/linear-classifier-model

Effect of coefficient values on decision boundary 2 min

https://www.coursera.org/learn/ml-classification/lecture/Qy2js/effect-of-coefficient-values-on-decision-boundary

Using features of the inputs 2 min

https://www.coursera.org/learn/ml-classification/lecture/WHIMY/using-features-of-the-inputs

Class probabilities

Predicting class probabilities 1 min

https://www.coursera.org/learn/ml-classification/lecture/j4Ji0/predicting-class-probabilities

Review of basics of probabilities 6 min

https://www.coursera.org/learn/ml-classification/lecture/p6rtM/review-of-basics-of-probabilities

Review of basics of conditional probabilities 8 min

https://www.coursera.org/learn/ml-classification/lecture/Cun2N/review-of-basics-of-conditional-probabilities

Using probabilities in classification 2 min

https://www.coursera.org/learn/ml-classification/lecture/f0nhO/using-probabilities-in-classification

Logistic regression

Predicting class probabilities with (generalized) linear models 5 min

https://www.coursera.org/learn/ml-classification/lecture/OV5Kt/predicting-class-probabilities-with-generalized-linear-models

https://www.coursera.org/learn/ml-classification/lecture/KXvGC/the-sigmoid-or-logistic-link-function

Logistic regression model 5 min

https://www.coursera.org/learn/ml-classification/lecture/OJQXu/logistic-regression-model

Effect of coefficient values on predicted probabilities 7 min

https://www.coursera.org/learn/ml-classification/lecture/JkEEH/effect-of-coefficient-values-on-predicted-probabilities

Overview of learning logistic regression models 2 min

https://www.coursera.org/learn/ml-classification/lecture/GuxAJ/overview-of-learning-logistic-regression-models

Practical issues for classification

Encoding categorical inputs 4 min

https://www.coursera.org/learn/ml-classification/lecture/kCY0D/encoding-categorical-inputs

Multiclass classification with 1 versus all 7 min

https://www.coursera.org/learn/ml-classification/lecture/N7QA6/multiclass-classification-with-1-versus-all

Summarizing linear classifiers & logistic regression

Recap of logistic regression classifier 1 min

https://www.coursera.org/learn/ml-classification/lecture/laPcB/recap-of-logistic-regression-classifier

Quiz: Linear Classifiers & Logistic Regression 5 questions

QUIZ
Linear Classifiers & Logistic Regression
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
August 20, 11:59 PM PDT

1 point
1.(True/False) A linear classifier assigns the predicted class based on the sign of Score(x)=wTh(x).




1 point
2.(True/False) For a conditional probability distribution over y|x, where y takes on two values (+1, -1, i.e. good review, bad review) P(y=+1|x)+P(y=−1|x)=1.




1 point
3.Which function does logistic regression use to “squeeze” the real line to [0, 1]?




1 point
4.If Score(x)=wTh(x)>0, which of the following is true about P(y=+1|x)?




1 point
5.Consider training a 1 vs. all multiclass classifier for the problem of digit recognition using logistic regression. There are 10 digits, thus there are 10 classes. How many logistic regression classifiers will we have to train?

Programming Assignment

Predicting sentiment from product reviews 10 min
Quiz: Predicting sentiment from product reviews 12 questions

QUIZ
Predicting sentiment from product reviews
12 questions
To Pass70% or higher
Attempts3 every 8 hours
Deadline
August 20, 11:59 PM PDT

1 point
1.How many weights are greater than or equal to 0?




1 point
2.Of the three data points in sample_test_data, which one has the lowest probability of being classified as a positive review?




1 point
3.Which of the following products are represented in the 20 most positive reviews?




1 point
4.Which of the following products are represented in the 20 most negative reviews?




1 point
5.What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places (e.g. 0.76).




1 point
6.Does a higher accuracy value on the training_data always imply that the classifier is better?




1 point
7.Consider the coefficients of simple_model. There should be 21 of them, an intercept term + one for each word in significant_words.

How many of the 20 coefficients (corresponding to the 20 significant_words and excluding the intercept term) are positive for the simple_model?




1 point
8.Are the positive words in the simple_model also positive words in the sentiment_model?




1 point
9.Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?




1 point
10.Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?




1 point
11.Enter the accuracy of the majority class classifier model on the test_data. Round your answer to two decimal places (e.g. 0.76).




1 point
12.Is the sentiment_model definitely better than the majority class classifier (the baseline)?



Week 2

Learning Linear Classifiers

Once familiar with linear classifiers and logistic regression, you can now dive in and write your first learning algorithm for classification. In particular, you will use gradient ascent to learn the coefficients of your classifier from data. You first will need to define the quality metric for these tasks using an approach called maximum likelihood estimation (MLE). You will also become familiar with a simple technique for selecting the step size for gradient ascent. An optional, advanced part of this module will cover the derivation of the gradient for logistic regression. You will implement your own learning algorithm for logistic regression from scratch, and use it to learn a sentiment analysis classifier.

Maximum likelihood estimation

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

logistic-regression-learning-annotated.pdf

Goal: Learning parameters of logistic regression2 min

https://www.coursera.org/learn/ml-classification/lecture/uxALW/goal-learning-parameters-of-logistic-regression

Intuition behind maximum likelihood estimation4 min
Data likelihood8 min
Finding best linear classifier with gradient ascent3 min

Gradient ascent algorithm for learning logistic regression classifier

Review of gradient ascent6 min
Learning algorithm for logistic regression3 min
Example of computing derivative for logistic regression5 min
Interpreting derivative for logistic regression5 min
Summary of gradient ascent for logistic regression2 min

Choosing step size for gradient ascent/descent

Choosing step size5 min
Careful with step sizes that are too large4 min
Rule of thumb for choosing step size3 min

(VERY OPTIONAL LESSON) Deriving gradient of logistic regression

(VERY OPTIONAL) Deriving gradient of logistic regression: Log trick4 min
(VERY OPTIONAL) Expressing the log-likelihood3 min
(VERY OPTIONAL) Deriving probability y=-1 given x2 min
(VERY OPTIONAL) Rewriting the log likelihood into a simpler form8 min
(VERY OPTIONAL) Deriving gradient of log likelihood8 min

Summarizing learning linear classifiers

Recap of learning logistic regression classifiers1 min
Quiz: Learning Linear Classifiers6 questions

QUIZ
Learning Linear Classifiers
6 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
August 27, 11:59 PM PDT

1 point
1.(True/False) A linear classifier can only learn positive coefficients.




1 point
2.(True/False) In order to train a logistic regression model, we find the weights that maximize the likelihood of the model.




1 point
3.(True/False) The data likelihood is the product of the probability of the inputs x given the weights w and response y.




1 point
4.Questions 4 and 5 refer to the following scenario.

Consider the setting where our inputs are 1-dimensional. We have data

x y
2.5 +1
0.3 -1
2.8 +1
0.5 +1

and the current estimates of the weights are w0=0 and w1=1. (w0: the intercept, w1: the weight for x).

Calculate the likelihood of this data. Round your answer to 2 decimal places.



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
x = np.array([2.5,0.3,2.8,0.5])
y = np.array([1,0,1,1])
w0 = 0
w1 = 1
def ypre(x,w0,w1):
score = w0 + x * w1
return sigmoid(score)
def sigmoid(score):
return 1.0/(1+ (np.exp(-score)))
lik = 1
for i in range(len(x)):
if i == 1:
lik *= (1 - ypre(x[i], 0, 1))
else:
lik *= (ypre(x[i], 0, 1))
print lik
# 0.230765141474



1 point
5.Refer to the scenario given in Question 4 to answer the following:

Calculate the derivative of the log likelihood with respect to w1. Round your answer to 2 decimal places.



1
2
3
4
5
6
7
8
9
def der(hx, ytrue, ypre):
return hx * (ytrue - ypre)
sum = 0
for i in range(len(x)):
sum += der(x[i],y[i],ypre(x[i],0,1))
print sum
# 0.366590721926



1 point
6.Which of the following is true about gradient ascent? Select all that apply.



Programming Assignment

Implementing logistic regression from scratch10 min
Quiz: Implementing logistic regression from scratch8 questions

QUIZ
Implementing logistic regression from scratch
8 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
August 27, 11:59 PM PDT

1 point
1.How many reviews in amazon_baby_subset.gl contain the word perfect?




1 point
2.Consider the feature_matrix that was obtained by converting our data to NumPy format.

How many features are there in the feature_matrix?




1 point
3.Assuming that the intercept is present, how does the number of features in feature_matrix relate to the number of features in the logistic regression model? Let x = [number of features in feature_matrix] and y = [number of features in logistic regression model].




1 point
4.Run your logistic regression solver with provided parameters.

As each iteration of gradient ascent passes, does the log-likelihood increase or decrease?




1 point
5.We make predictions using the weights just learned.

How many reviews were predicted to have positive sentiment?




1 point
6.What is the accuracy of the model on predictions made above? (round to 2 digits of accuracy)




1 point
7.We look at “most positive” words, the words that correspond most strongly with positive reviews.

Which of the following words is not present in the top 10 “most positive” words?




1 point
8.Similarly, we look at “most negative” words, the words that correspond most strongly with negative reviews.

Which of the following words is not present in the top 10 “most negative” words?



Overfitting & Regularization in Logistic Regression

As we saw in the regression course, overfitting is perhaps the most significant challenge you will face as you apply machine learning approaches in practice. This challenge can be particularly significant for logistic regression, as you will discover in this module, since we not only risk getting an overly complex decision boundary, but your classifier can also become overly confident about the probabilities it predicts. In this module, you will investigate overfitting in classification in significant detail, and obtain broad practical insights from some interesting visualizations of the classifiers’ outputs. You will then add a regularization term to your optimization to mitigate overfitting. You will investigate both L2 regularization to penalize large coefficient values, and L1 regularization to obtain additional sparsity in the coefficients. Finally, you will modify your gradient ascent algorithm to learn regularized logistic regression classifiers. You will implement your own regularized logistic regression classifier from scratch, and investigate the impact of the L2 penalty on real-world sentiment analysis data.

Overfitting in classification

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

logistic-regression-overfitting-annotated.pdf

Evaluating a classifier 3 min

https://www.coursera.org/learn/ml-classification/lecture/RzxaQ/evaluating-a-classifier

Review of overfitting in regression3 min
Overfitting in classification5 min
Visualizing overfitting with high-degree polynomial features3 min

Overconfident predictions due to overfitting

Overfitting in classifiers leads to overconfident predictions5 min
Visualizing overconfident predictions4 min
(OPTIONAL) Another perspecting on overfitting in logistic regression8 min

L2 regularized logistic regression

Penalizing large coefficients to mitigate overfitting5 min
L2 regularized logistic regression4 min
Visualizing effect of L2 regularization in logistic regression5 min
Learning L2 regularized logistic regression with gradient ascent7 min

Sparse logistic regression

Sparse logistic regression with L1 regularization7 min

Summarizing overfitting & regularization in logistic regression

Recap of overfitting & regularization in logistic regression58 sec
Quiz: Overfitting & Regularization in Logistic Regression8 questions

QUIZ
Overfitting & Regularization in Logistic Regression
8 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
August 27, 11:59 PM PDT

1 point
1.Consider four classifiers, whose classification performance is given by the following table:

Classification error on training set Classification error on validation set
Classifier 1 0.2 0.6
Classifier 2 0.8 0.6
Classifier 3 0.2 0.2
Classifier 4 0.5 0.4

Which of the four classifiers is most likely overfit?




1 point
2.Suppose a classifier classifies 23100 examples correctly and 1900 examples incorrectly. Compute error by hand. Round your answer to 3 decimal places.




1 point
3.(True/False) Accuracy and error measured on the same dataset always sum to 1.




1 point
4.Which of the following is NOT a correct description of complex models?




1 point
5.Which of the following is a symptom of overfitting in the context of logistic regression? Select all that apply.




1 point
6.Suppose we perform L2 regularized logistic regression to fit a sentiment classifier. Which of the following plots does NOT describe a possible coefficient path? Choose all that apply.

Note. Assume that the algorithm runs for a wide range of L2 penalty values and each coefficient plot is zoomed out enough to capture all long-term trends.




1 point
7.Suppose we perform L1 regularized logistic regression to fit a sentiment classifier. Which of the following plots does NOT describe a possible coefficient path? Choose all that apply.

Note. Assume that the algorithm runs for a wide range of L1 penalty values and each coefficient plot is zoomed out enough to capture all long-term trends.




1 point
8.In the context of L2 regularized logistic regression, which of the following occurs as we increase the L2 penalty λ? Choose all that apply.



Programming Assignment

Logistic Regression with L2 regularization10 min
Quiz: Logistic Regression with L2 regularization8 questions

QUIZ
Logistic Regression with L2 regularization
8 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
August 27, 11:59 PM PDT

1 point
1.In the function feature_derivative_with_L2, was the intercept term regularized?




1 point
2.Does the term with L2 regularization increase or decrease the log likelihood ℓℓ(w)?




1 point
3.Which of the following words is not listed in either positive_words or negative_words?




1 point
4.Questions 5 and 6 use the coefficient plot of the words in positive_words and negative_words.

(True/False) All coefficients consistently get smaller in size as the L2 penalty is increased.




1 point
5.Questions 5 and 6 use the coefficient plot of the words in positive_words and negative_words.

(True/False) The relative order of coefficients is preserved as the L2 penalty is increased. (For example, if the coefficient for ‘cat’ was more positive than that for ‘dog’, this remains true as the L2 penalty increases.)




1 point
6.Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.

Which of the following models has the highest accuracy on the training data?




1 point
7.Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.

Which of the following models has the highest accuracy on the validation data?




1 point
8.Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.

Does the highest accuracy on the training data imply that the model is the best one?



Week 3 Decision Trees

Along with linear classifiers, decision trees are amongst the most widely used classification techniques in the real world. This method is extremely intuitive, simple to implement and provides interpretable predictions.

  • In this module, you will become familiar with the core decision trees representation.
  • You will then design a simple, recursive greedy algorithm to learn decision trees from data.
  • Finally, you will extend this approach to deal with continuous inputs, a fundamental requirement for practical problems.
  • In this module, you will investigate a brand new case-study in the financial sector: predicting the risk associated with a bank loan. You will implement your own decision tree learning algorithm on real loan data.

Intuition behind decision trees

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

decision-trees-annotated.pdf

Predicting loan defaults with decision trees3 min

Intuition behind decision trees1 min

Task of learning decision trees from data3 min

Learning decision trees

Recursive greedy algorithm4 min

Learning a decision stump3 min

Selecting best feature to split on6 min

When to stop recursing4 min

Using the learned decision tree

Making predictions with decision trees1 min

Multiclass classification with decision trees2 min

Learning decision trees with continuous inputs

Threshold splits for continuous inputs6 min

(OPTIONAL) Picking the best threshold to split on3 min

Visualizing decision boundaries5 min

Summarizing decision trees

Recap of decision trees56 sec

Quiz: Decision Trees11 questions

QUIZ
Decision Trees
11 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 3, 11:59 PM PDT

1 point
1.Questions 1 to 6 refer to the following common scenario:

Consider the following dataset:

x1 x2 x3 y
1 1 1 +1
0 1 0 -1
1 0 1 -1
0 0 1 +1

Let us train a decision tree with this data. Let’s call this tree T1. What feature will we split on at the root?




x1: .5
x2: .5
x3: .25


1 point
2.Refer to the dataset presented in Question 1 to answer the following.

Fully train T1 (until each leaf has data points of the same output label). What is the depth of T1?




1 point
3.Refer to the dataset presented in Question 1 to answer the following.

What is the training error of T1?




1 point
4.Refer to the dataset presented in Question 1 to answer the following.

Now consider a tree T2, which splits on x1 at the root, and splits on x2 in the 1st level, and has leaves at the 2nd level. Note: this is the XOR function on features 1 and 2. What is the depth of T2?




1 point
5.Refer to the dataset presented in Question 1 to answer the following.

What is the training error of T2?




1 point
6.Refer to the dataset presented in Question 1 to answer the following.

Which has smaller depth, T1 or T2?




1 point
7.(True/False) When deciding to split a node, we find the best feature to split on that minimizes classification error.




1 point
8.If you are learning a decision tree, and you are at a node in which all of its data has the same y value, you should




3: False


1 point
8.Let’s say we have learned a decision tree on dataset D. Consider the split learned at the root of the decision tree. Which of the following is true if one of the data points in D is removed and we re-train the tree?




1 point
9.Consider two datasets D1 and D2, where D2 has the same data points as D1, but has an extra feature for each data point. Let T1 be the decision tree trained with D1, and T2 be the tree trained with D2. Which of the following is true?




1 point
10.(True/False) Logistic regression with polynomial degree 1 features will always have equal or lower training error than decision stumps (depth 1 decision trees).




1 point
11.(True/False) Decision trees (with depth > 1) are always linear classifiers.




1 point
11.(True/False) Decision stumps (depth 1 decision trees) are always linear classifiers.



Programming Assignment 1

Identifying safe loans with decision trees10 min

Quiz: Identifying safe loans with decision trees7 questions

QUIZ
Identifying safe loans with decision trees
7 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 3, 11:59 PM PDT

1 point
1.What percentage of the predictions on sample_validation_data did decision_tree_model get correct?




1 point
2.Which loan has the highest probability of being classified as a safe loan?




1 point
3.Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?




1 point
4.Based on the visualized tree, what prediction would you make for this data point?




1 point
5.What is the accuracy of decision_tree_model on the validation set, rounded to the nearest .01 (e.g. 0.76)?




1 point
6.How does the performance of big_model on the validation set compare to decision_tree_model on the validation set? Is this a sign of overfitting?




1 point
7.Let us assume that each mistake costs money:

Assume a cost of $10,000 per false negative.
Assume a cost of $20,000 per false positive.
What is the total cost of mistakes made by decision_tree_model on validation_data? Please enter your answer as a plain integer, without the dollar sign or the comma separator, e.g. 3002000.



Programming Assignment 2

Implementing binary decision trees10 min

Quiz: Implementing binary decision trees7 questions

QUIZ
Implementing binary decision trees
7 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 3, 11:59 PM PDT

1 point
1.What was the feature that my_decision_tree first split on while making the prediction for test_data[0]?




1 point
2.What was the first feature that lead to a right split of test_data[0]?




1 point
3.What was the last feature split on before reaching a leaf node for test_data[0]?




1 point
4.Rounded to 2nd decimal point (e.g. 0.76), what is the classification error of my_decision_tree on the test_data?




1 point
5.What is the feature that is used for the split at the root node?




1 point
6.What is the path of the first 3 feature splits considered along the left-most branch of my_decision_tree?




1 point
7.What is the path of the first 3 feature splits considered along the right-most branch of my_decision_tree?



Week 4

Preventing Overfitting in Decision Trees

Out of all machine learning techniques, decision trees are amongst the most prone to overfitting. No practical implementation is possible without including approaches that mitigate this challenge. In this module, through various visualizations and investigations, you will investigate why decision trees suffer from significant overfitting problems. Using the principle of Occam’s razor, you will mitigate overfitting by learning simpler trees. At first, you will design algorithms that stop the learning process before the decision trees become overly complex. In an optional segment, you will design a very practical approach that learns an overly-complex tree, and then simplifies it with pruning. Your implementation will investigate the effect of these techniques on mitigating overfitting on our real-world loan data set.

Overfitting in decision trees

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

decision-trees-overfitting-annotated.pdf

A review of overfitting2 min
Overfitting in decision trees5 min

Early stopping to avoid overfitting

Principle of Occam’s razor: Learning simpler decision trees5 min
Early stopping in learning decision trees6 min

(OPTIONAL LESSON) Pruning decision trees

(OPTIONAL) Motivating pruning8 min
(OPTIONAL) Pruning decision trees to avoid overfitting6 min
(OPTIONAL) Tree pruning algorithm3 min

Summarizing preventing overfitting in decision trees

Recap of overfitting and regularization in decision trees1 min
Quiz: Preventing Overfitting in Decision Trees11 questions

QUIZ
Preventing Overfitting in Decision Trees
11 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 10, 11:59 PM PDT

1 point
1.(True/False) When learning decision trees, smaller depth USUALLY translates to lower training error.




1 point
2.(True/False) If no two data points have the same input values, we can always learn a decision tree that achieves 0 training error.




1 point
3.(True/False) If decision tree T1 has lower training error than decision tree T2, then T1 will always have better test error than T2.




1 point
4.Which of the following is true for decision trees?




1 point
5.Pruning and early stopping in decision trees is used to




1 point
6.Which of the following is NOT an early stopping method?




1 point
7.Consider decision tree T1 learned with minimum node size parameter = 1000. Now consider decision tree T2 trained on the same dataset and parameters, except that the minimum node size parameter is now 100. Which of the following is always true?




1 point
8.Questions 8 to 11 refer to the following common scenario:

Imagine we are training a decision tree, and we are at a node. Each data point is (x1, x2, y), where x1,x2 are features, and y is the label. The data at this node is:

x1 x2 y
0 1 +1
1 0 +1
0 1 +1
1 1 -1

What is the classification error at this node (assuming a majority class classifier)?


1 point
9.Refer to the scenario presented in Question 8.

If we split on x1, what is the classification error?


1
point

  1. Refer to the scenario presented in Question 8.

If we split on x2, what is the classification error?


1 point
11.Refer to the scenario presented in Question 8.

If our parameter for minimum gain in error reduction is 0.1, do we split or stop early?



Programming Assignment

Decision Trees in Practice10 min
Quiz: Decision Trees in Practice14 questions

QUIZ
Decision Trees in Practice
14 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 10, 11:59 PM PDT

1 point
1.Given an intermediate node with 6 safe loans and 3 risky loans, if the min_node_size parameter is 10, what should the tree learning algorithm do next?




1 point
2.Assume an intermediate node has 6 safe loans and 3 risky loans. For each of 4 possible features to split on, the error reduction is 0.0, 0.05, 0.1, and 0.14, respectively. If the minimum gain in error reduction parameter is set to 0.2, what should the tree learning algorithm do next?




1 point
3.Consider the prediction path validation_set[0] with my_decision_tree_old and my_decision_tree_new. For my_decision_tree_new trained with

1
max_depth = 6, min_node_size = 100, min_error_reduction=0.0

is the prediction path shorter, longer, or the same as the prediction path using my_decision_tree_old that ignored the early stopping conditions 2 and 3?




1 point
4.Consider the prediction path for ANY new data point. For my_decision_tree_new trained with

1
max_depth = 6, min_node_size = 100, min_error_reduction=0.0

is the prediction path for a data point always shorter, always longer, always the same, shorter or the same, or longer or the same as for my_decision_tree_old that ignored the early stopping conditions 2 and 3?




1 point
5.For a tree trained on any dataset using parameters

1
max_depth = 6, min_node_size = 100, min_error_reduction=0.0

what is the maximum possible number of splits encountered while making a single prediction?




1 point
6.Is the validation error of the new decision tree (using early stopping conditions 2 and 3) lower than, higher than, or the same as that of the old decision tree from the previous assigment?




1 point
7.Which tree has the smallest error on the validation data?




1 point
8.Does the tree with the smallest error in the training data also have the smallest error in the validation data?




1 point
9.Is it always true that the tree with the lowest classification error on the training set will result in the lowest classification error in the validation set?




1 point
10.Which tree has the largest complexity?




1 point
11.Is it always true that the most complex tree will result in the lowest classification error in the validation_set?




1 point
12.Using the complexity definition, which model (model_4, model_5, or model_6) has the largest complexity?




1 point
13.model_4 and model_5 have similar classification error on the validation set but model_5 has lower complexity. Should you pick model_5 over model_4?




1 point
14.Using the results obtained in this section, which model (model_7, model_8, or model_9) would you choose to use?



Handling Missing Data

Real-world machine learning problems are fraught with missing data. That is, very often, some of the inputs are not observed for all data points. This challenge is very significant, happens in most cases, and needs to be addressed carefully to obtain great performance. And, this issue is rarely discussed in machine learning courses. In this module, you will tackle the missing data challenge head on. You will start with the two most basic techniques to convert a dataset with missing data into a clean dataset, namely skipping missing values and inputing missing values. In an advanced section, you will also design a modification of the decision tree learning algorithm that builds decisions about missing data right into the model. You will also explore these techniques in your real-data implementation.

Basic strategies for handling missing data

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

decision-trees-missing-values-annotated.pdf

Challenge of missing data3 min
Strategy 1: Purification by skipping missing data4 min
Strategy 2: Purification by imputing missing data4 min

Strategy 3: Modify learning algorithm to explicitly handle missing data

Modifying decision trees to handle missing data4 min
Feature split selection with missing data5 min

Summarizing handling missing data

Recap of handling missing data1 min
Quiz: Handling Missing Data7 questions

QUIZ
Handling Missing Data
7 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 10, 11:59 PM PDT

1 point
1.(True/False) Skipping data points (i.e., skipping rows of the data) that have missing features only works when the learning algorithm we are using is decision tree learning.




1 point
2.What are potential downsides of skipping features with missing values (i.e., skipping columns of the data) to handle missing data?




1 point
3.(True/False) It’s always better to remove missing data points (i.e., rows) as opposed to removing missing features (i.e., columns).




1 point
4.Consider a dataset with N training points. After imputing missing values, the number of data points in the data set is




1 point
5.Consider a dataset with D features. After imputing missing values, the number of features in the data set is




1 point
6.Which of the following are always true when imputing missing data? Select all that apply.




1 point
7.Consider data that has binary features (i.e. the feature values are 0 or 1) with some feature values of some data points missing. When learning the best feature split at a node, how would we best modify the decision tree learning algorithm to handle data points with missing values for a feature?



Week 5 Boosting

One of the most exciting theoretical questions that have been asked about machine learning is whether simple classifiers can be combined into a highly accurate ensemble. This question lead to the developing of boosting, one of the most important and practical techniques in machine learning today. This simple approach can boost the accuracy of any classifier, and is widely used in practice, e.g., it’s used by more than half of the teams who win the Kaggle machine learning competitions. In this module, you will first define the ensemble classifier, where multiple models vote on the best prediction. You will then explore a boosting algorithm called AdaBoost, which provides a great approach for boosting classifiers. Through visualizations, you will become familiar with many of the practical aspects of this techniques. You will create your very own implementation of AdaBoost, from scratch, and use it to boost the performance of your loan risk predictor on real data.

The amazing idea of boosting a classifier

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

boosting-annotated.pdf

The boosting question3 min

Ensemble classifiers5 min

Boosting5 min

AdaBoost

AdaBoost overview3 min

Weighted error4 min

Computing coefficient of each ensemble component4 min

Reweighing data to focus on mistakes4 min

Normalizing weights2 min

Applying AdaBoost

Example of AdaBoost in action5 min

Learning boosted decision stumps with AdaBoost4 min

Programming Assignment 1

Exploring Ensemble Methods10 min

Quiz: Exploring Ensemble Methods9 questions

QUIZ
Exploring Ensemble Methods
9 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 17, 11:59 PM PDT

1 point
1.What percentage of the predictions on sample_validation_data did model_5 get correct?




1 point
2.According to model_5, which loan is the least likely to be a safe loan?




1 point
3.What is the number of false positives on the validation data?




1 point
4.Using the same costs of the false positives and false negatives, what is the cost of the mistakes made by the boosted tree model (model_5) as evaluated on the validation_set?




1 point
5.What grades are the top 5 loans?




1 point
6.Which model has the best accuracy on the validation_data?




1 point
7.Is it always true that the model with the most trees will perform best on the test/validation set?




1 point
8.Does the training error reduce as the number of trees increases?




1 point
9.Is it always true that the test/validation error will reduce as the number of trees increases?



Convergence and overfitting in boosting

The Boosting Theorem3 min

Overfitting in boosting5 min

Summarizing boosting

Ensemble methods, impact of boosting & quick recap4 min

Quiz:Boosting11 questions

QUIZ
Boosting
11 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 17, 11:59 PM PDT

1 point
1.Which of the following is NOT an ensemble method?




1 point
2.Each binary classifier in an ensemble makes predictions on an input x as listed in the table below. Based on the ensemble coefficients also listed in the table, what is the final ensemble model’s prediction for x?

Classifier coefficient wt Prediction for x
Classifier 1 0.61 +1
Classifier 2 0.53 -1
Classifier 3 0.88 -1
Classifier 4 0.34 +1




1 point
3.(True/False) Boosted decision stumps is a linear classifier.




1 point
4.(True/False) For AdaBoost, test error is an appropriate criterion for choosing the optimal number of iterations.




1 point
5.In an iteration in AdaBoost, recall that learning the coefficient w_t for learned weak learner f_t is calculated by

$$\displaystyle{\frac{1}{2}\ln{\left( \frac{1-\mathtt{weighted_error}(f_t)}{\mathtt{weighted_error}(f_t)} \right)}}$$
If the weighted error of f_t is equal to .25, what is the value of w_t? Round your answer to 2 decimal places.




1 point
6.Which of the following classifiers is most accurate as computed on a weighted dataset? A classifier with:




1 point
7.Imagine we are training a decision stump in an iteration of AdaBoost, and we are at a node. Each data point is (x1, x2, y), where x1,x2 are features, and y is the label. Also included are the weights of the data. The data at this node is:

Weight x1 x2 y
0.3 0 1 +1
0.35 1 0 -1
0.1 0 1 +1
0.25 1 1 +1

Suppose we assign the same class label to all data in this node. (Pick the class label with the greater total weight.) What is the weighted error at the node? Round your answer to 2 decimal places.




1 point
8.After each iteration of AdaBoost, the weights on the data points are typically normalized to sum to 1. This is used because




1 point
9.Consider the following 2D dataset with binary labels.

We train a series of weak binary classifiers using AdaBoost. In one iteration, the weak binary classifier produces the decision boundary as follows:

Which of the five points (indicated in the second figure) will receive higher weight in the following iteration? Choose all that apply.




1 point
10.Suppose we are running AdaBoost using decision tree stumps. At a particular iteration, the data points have weights according the figure. (Large points indicate heavy weights.)

Which of the following decision tree stumps is most likely to be fit in the next iteration?




1 point
11.(True/False) AdaBoost achieves zero training error after a sufficient number of iterations, as long as we can find weak learners that perform better than random chance at each iteration of AdaBoost (i.e., on weighted data).



Programming Assignment 2

Boosting a decision stump10 min

Quiz:Boosting a decision stump5 questions

QUIZ
Boosting a decision stump
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 17, 11:59 PM PDT
You can still pass this quiz before the course ends.

1 point
1.Recall that the classification error for unweighted data is defined as follows:

classification error=# mistakes# all data points
Meanwhile, the weight of mistakes for weighted data is given by

$$WM(α,y^)=∑i=1nαi×1[yi≠y^i].$$
If we set the weights α=1 for all data points, how is the weight of mistakes WM(α,ŷ) related to the classification error?




1 point
2.Refer to section Example: Training a weighted decision tree.

Will you get the same model as small_data_decision_tree_subset_20 if you trained a decision tree with only 20 data points from the set of points in subset_20?




1 point
3.Refer to the 10-component ensemble of tree stumps trained with Adaboost.

As each component is trained sequentially, are the component weights monotonically decreasing, monotonically increasing, or neither?




1 point
4.Which of the following best describes a general trend in accuracy as we add more and more components? Answer based on the 30 components learned so far.




1 point
5.From this plot (with 30 trees), is there massive overfitting as the # of iterations increases?



Week 6 Precision-Recall

In many real-world settings, accuracy or error are not the best quality metrics for classification. You will explore a case-study that significantly highlights this issue: using sentiment analysis to display positive reviews on a restaurant website. Instead of accuracy, you will define two metrics: precision and recall, which are widely used in real-world applications to measure the quality of classifiers. You will explore how the probabilities output by your classifier can be used to trade-off precision with recall, and dive into this spectrum, using precision-recall curves. In your hands-on implementation, you will compute these metrics with your learned classifier on real-world sentiment analysis data.

Why use precision & recall as quality metrics

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

precision-recall.pdf

Case-study where accuracy is not best metric for classification3 min

What is good performance for a classifier?3 min

Precision & recall explained

Precision: Fraction of positive predictions that are actually positive5 min

Recall: Fraction of positive data predicted to be positive3 min

The precision-recall tradeoff

Precision-recall extremes2 min

Trading off precision and recall4 min

Precision-recall curve5 min

Summarizing precision-recall

Recap of precision-recall1 min

Quiz: Precision-Recall9 questions

QUIZ
Precision-Recall
9 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 1, 11:59 PM PDT

1 point
1.Questions 1 to 5 refer to the following scenario:

Suppose a binary classifier produced the following confusion matrix.

Predicted Positive Predicted Negative
Actual Positive 5600 40
Actual Negative 1900 2460

What is the recall of this classifier? Round your answer to 2 decimal places.




1 point
2.Refer to the scenario presented in Question 1 to answer the following:

(True/False) This classifier is better than random guessing.




1 point
3.Refer to the scenario presented in Question 1 to answer the following:

(True/False) This classifier is better than the majority class classifier.




1 point
4.Refer to the scenario presented in Question 1 to answer the following:

Which of the following points in the precision-recall space corresponds to this classifier?


(1)


(2)


(3)


(4)


(5)


1 point
5.Refer to the scenario presented in Question 1 to answer the following:

Which of the following best describes this classifier?


It is optimistic


It is pessimistic


None of the


1 point
6.Suppose we are fitting a logistic regression model on a dataset where the vast majority of the data points are labeled as positive. To compensate for overfitting to the dominant class, we should


Require higher confidence level for positive predictions


Require lower confidence level for positive predictions


1 point
7.It is often the case that false positives and false negatives incur different costs. In situations where false negatives cost much more than false positives, we should


Require higher confidence level for positive predictions


Require lower confidence level for positive predictions


1 point
8.We are interested in reducing the number of false negatives. Which of the following metrics should we primarily look at?


Accuracy


Precision


Recall


1 point
9.Suppose we set the threshold for positive predictions at 0.9. What is the lowest score that is classified as positive? Round your answer to 2 decimal places.



Programming Assignment

Exploring precision and recall10 min

Quiz: Exploring precision and recall13 questions

QUIZ
Exploring precision and recall
13 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 1, 11:59 PM PDT

1 point
1.Consider the logistic regression model trained on amazon_baby.gl using GraphLab Create.

Using accuracy as the evaluation metric, was our logistic regression model better than the majority class classifier?




1 point
2.How many predicted values in the test set are false positives?




1 point
3.Consider the scenario where each false positive costs $100 and each false negative $1.

Given the stipulation, what is the cost associated with the logistic regression classifier’s performance on the test set?


Between $0 and $100,000


Between $100,000 and $200,000


Between $200,000 and $300,000


Above $300,000


1 point
4.Out of all reviews in the test set that are predicted to be positive, what fraction of them are false positives? (Round to the second decimal place e.g. 0.25)




1 point
5.Based on what we learned in lecture, if we wanted to reduce this fraction of false positives to be below 3.5%, we would:


Discard a sufficient number of positive predictions


Discard a sufficient number of negative predictions


Increase threshold for predicting the positive class (y^=+1)


Decrease threshold for predicting the positive class (y^=+1)


1 point
6.What fraction of the positive reviews in the test_set were correctly predicted as positive by the classifier? Round your answer to 2 decimal places.




1 point
7.What is the recall value for a classifier that predicts +1 for all data points in the test_data?




1 point
8.What happens to the number of positive predicted reviews as the threshold increased from 0.5 to 0.9?


More reviews are predicted to be positive.


Fewer reviews are predicted to be positive.


1 point
9.Consider the metrics obtained from setting the threshold to 0.5 and to 0.9.

Does the recall increase with a higher threshold?




1 point
10.Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better? Round your answer to 3 decimal places.




1 point
11.Using threshold = 0.98, how many false negatives do we get on the test_data? (Hint: You may use the graphlab.evaluation.confusion_matrix function implemented in GraphLab Create.)




1 point
12.Questions 13 and 14 are concerned with the reviews that contain the word baby.

Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better for the reviews of data in baby_reviews? Round your answer to 3 decimal places.




1 point
13.Questions 13 and 14 are concerned with the reviews that contain the word baby.

Is this threshold value smaller or larger than the threshold used for the entire dataset to achieve the same specified precision of 96.5%?


Larger


Smaller

Week 7 Scaling to Huge Datasets & Online Learning

With the advent of the internet, the growth of social media, and the embedding of sensors in the world, the magnitudes of data that our machine learning algorithms must handle have grown tremendously over the last decade. This effect is sometimes called “Big Data”. Thus, our learning algorithms must scale to bigger and bigger datasets. In this module, you will develop a small modification of gradient ascent called stochastic gradient, which provides significant speedups in the running time of our algorithms. This simple change can drastically improve scaling, but makes the algorithm less stable and harder to use in practice. In this module, you will investigate the practical techniques needed to make stochastic gradient viable, and to thus to obtain learning algorithms that scale to huge datasets. You will also address a new kind of machine learning problem, online learning, where the data streams in over time, and we must learn the coefficients as the data arrives. This task can also be solved with stochastic gradient. You will implement your very own stochastic gradient ascent algorithm for logistic regression from scratch, and evaluate it on sentiment analysis data.

Scaling ML to huge datasets

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

online-learning-annotated.pdf

Gradient ascent won’t scale to today’s huge datasets3 min

Timeline of scalable machine learning & stochastic gradient4 min

Scaling ML with stochastic gradient

Why gradient ascent won’t scale3 min

Stochastic gradient: Learning one data point at a time3 min

Comparing gradient to stochastic gradient3 min

Understanding why stochastic gradient works

Why would stochastic gradient ever work?4 min

Convergence paths2 min

Stochastic gradient: Practical tricks

Shuffle data before running stochastic gradient2 min

Choosing step size3 min

Don’t trust last coefficients1 min

(OPTIONAL) Learning from batches of data3 min

(OPTIONAL) Measuring convergence4 min

(OPTIONAL) Adding regularization3 min

Online learning: Fitting models from streaming data

The online learning task3 min

Using stochastic gradient for online learning3 min

Summarizing scaling to huge datasets & online learning

Scaling to huge datasets through parallelization & module recap1 min

Quiz: Scaling to Huge Datasets & Online Learning10 questions

QUIZ
Scaling to Huge Datasets & Online Learning
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 8, 11:59 PM PDT

1 point
1.(True/False) Stochastic gradient ascent often requires fewer passes over the dataset than batch gradient ascent to achieve a similar log likelihood.




1 point
2.(True/False) Choosing a large batch size results in less noisy gradients




1 point
3.(True/False) The set of coefficients obtained at the last iteration represents the best coefficients found so far.




1 point
4.Suppose you obtained the plot of log likelihood below after running stochastic gradient ascent.

Which of the following actions would help the most to improve the rate of convergence?


Increase step size


Decrease step size


Decrease batch size


1 point
5.Suppose you obtained the plot of log likelihood below after running stochastic gradient ascent.

Which of the following actions would help to improve the rate of convergence?


Increase batch size


Increase step size


Decrease step size


1 point
6.Suppose it takes about 1 milliseconds to compute a gradient for a single example. You run an online advertising company and would like to do online learning via mini-batch stochastic gradient ascent. If you aim to update the coefficients once every 5 minutes, how many examples can you cover in each update? Overhead and other operations take up 2 minutes, so you only have 3 minutes for the coefficient update.




1 point
7.In search for an optimal step size, you experiment with multiple step sizes and obtain the following convergence plot.

Which line corresponds to the best step size?


(1)


(2)


(3)


(4)


(5)


1 point
8.Suppose you run stochastic gradient ascent with two different batch sizes. Which of the two lines below corresponds to the smaller batch size (assuming both use the same step size)?


(1)


(2)


1 point
9.Which of the following is NOT a benefit of stochastic gradient ascent over batch gradient ascent? Choose all that apply.


Each coefficient step is very fast.


Log likelihood of data improves monotonically.


Stochastic gradient ascent can be used for online learning.


Stochastic gradient ascent can achieve higher likelihood than batch gradient ascent for the same amount of running time.


Stochastic gradient ascent is highly robust with respect to parameter choices.


1 point
10.Suppose we run the stochastic gradient ascent algorithm described in the lecture with batch size of 100. To make 10 passes over a dataset consisting of 15400 examples, how many iterations does it need to run?



Programming Assignment

Training Logistic Regression via Stochastic Gradient Ascent10 min

Quiz: Training Logistic Regression via Stochastic Gradient Ascent12 questions

QUIZ
Training Logistic Regression via Stochastic Gradient Ascent
12 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 8, 11:59 PM PDT

1 point
1.In Module 3 assignment, there were 194 features (an intercept + one feature for each of the 193 important words). In this assignment, we will use stochastic gradient ascent to train the classifier using logistic regression. How does the changing the solver to stochastic gradient ascent affect the number of features?


Increases


Decreases


Stays the same


1 point
2.Recall from the lecture and the earlier assignment, the log likelihood (without the averaging term) is given by

$$ℓℓ(w)=∑i=1N((1[yi=+1]−1)wTh(xi)−ln(1+exp(−wTh(xi))))$$
whereas the average log likelihood is given by

$$ℓℓA(w)=1/N∑i=1N((1[yi=+1]−1)wTh(xi)−ln(1+exp(−wTh(xi))))$$
How are the functions ℓℓ(w) and ℓℓA(w) related?


ℓℓA(w)=ℓℓ(w)


ℓℓA(w)=(1/N)⋅ℓℓ(w)


ℓℓA(w)=N⋅ℓℓ(w)


ℓℓA(w)=ℓℓ(w)−∥w∥


1 point
3.Refer to the sub-section Computing the gradient for a single data point.

The code block above computed

∂ℓi(w)∂wj
for j = 1 and i = 10. Is this quantity a scalar or a 194-dimensional vector?


A scalar


A 194-dimensional vector


1 point
4.Refer to the sub-section Modifying the derivative for using a batch of data points.

The code block computed

∑s=ii+B∂ℓs(w)∂wj
for j = 10, i = 10, and B = 10. Is this a scalar or a 194-dimensional vector?


A scalar


A 194-dimensional vector


1 point
5.For what value of B is the term

∑s=1B∂ℓs(w)∂wj
the same as the full gradient

∂ℓ(w)∂wj
? A numeric answer is expected for this question. Hint: consider the training set we are using now.




1 point
6.For what value of batch size B above is the stochastic gradient ascent function logistic_regression_SG act as a standard gradient ascent algorithm? A numeric answer is expected for this question. Hint: consider the training set we are using now.




1 point
7.When you set batch_size = 1, as each iteration passes, how does the average log likelihood in the batch change?


Increases


Decreases


Fluctuates


1 point
8.When you set batch_size = len(feature_matrix_train), as each iteration passes, how does the average log likelihood in the batch change?


Increases


Decreases


Fluctuates


1 point
9.Suppose that we run stochastic gradient ascent with a batch size of 100. How many gradient updates are performed at the end of two passes over a dataset consisting of 50000 data points?




1 point
10.Refer to the section Stochastic gradient ascent vs gradient ascent.

In the first figure, how many passes does batch gradient ascent need to achieve a similar log likelihood as stochastic gradient ascent?


It’s always better


10 passes


20 passes


150 passes or more


1 point
11.Questions 11 and 12 refer to the section Plotting the log likelihood as a function of passes for each step size.

Which of the following is the worst step size? Pick the step size that results in the lowest log likelihood in the end.


1e-2


1e-1


1e0


1e1


1e2


1 point
12.Questions 11 and 12 refer to the section Plotting the log likelihood as a function of passes for each step size.

Which of the following is the best step size? Pick the step size that results in the highest log likelihood in the end.


1e-4


1e-2


1e0


1e1


1e2

Machine Learning: Clustering & Retrieval

Course can be found here
Lecture slides can be found here
Summary can be found in my Github

About this course: Case Studies: Finding Similar Documents

A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover?

In this third case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce.

Learning Outcomes: By the end of this course, you will be able to:
-Create a document retrieval system using k-nearest neighbors.
-Identify various similarity metrics for text data.
-Reduce computations in k-nearest neighbor search by using KD-trees.
-Produce approximate nearest neighbors using locality sensitive hashing.
-Compare and contrast supervised and unsupervised learning tasks.
-Cluster documents by topic using k-means.
-Describe how to parallelize k-means using MapReduce.
-Examine probabilistic clustering approaches using mixtures models.
-Fit a mixture of Gaussian model using expectation maximization (EM).
-Perform mixed membership modeling using latent Dirichlet allocation (LDA).
-Describe the steps of a Gibbs sampler and how to use its output to draw inferences.
-Compare and contrast initialization techniques for non-convex optimization objectives.
-Implement these techniques in Python.

Week 1 Welcome

Clustering and retrieval are some of the most high-impact machine learning tools out there. Retrieval is used in almost every applications and device we interact with, like in providing a set of products related to one a shopper is currently considering, or a list of people you might want to connect with on a social media platform. Clustering can be used to aid retrieval, but is a more broadly useful tool for automatically discovering structure in data, like uncovering groups of similar patients.

This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.

What is this course about?

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

intro.pdf

Welcome and introduction to clustering and retrieval tasks6 min

Course overview3 min

Module-by-module topics covered8 min

Assumed background6 min

Software tools you’ll need for this course10 min

Github repository with starter code

In each module of the course, we have a reading with the assignments for that module as well as some starter code. For those interested, the starter code and demos used in this course are also available in a public Github repository:

https://github.com/learnml/machine-learning-specialization

A big week ahead!10 min

We start the course by considering a retrieval task of fetching a document similar to one someone is currently reading. We cast this problem as one of nearest neighbor search, which is a concept we have seen in the Foundations and Regression courses. However, here, you will take a deep dive into two critical components of the algorithms: the data representation and metric for measuring similarity between pairs of datapoints. You will examine the computational burden of the naive nearest neighbor search algorithm, and instead implement scalable alternatives using KD-trees for handling large datasets and locality sensitive hashing (LSH) for providing approximate nearest neighbors, even in high-dimensional spaces. You will explore all of these ideas on a Wikipedia dataset, comparing and contrasting the impact of the various choices you can make on the nearest neighbor results produced.

Introduction to nearest neighbor search and algorithms

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

retrieval-intro-annotated.pdf

Retrieval as k-nearest neighbor search2 min

1-NN algorithm2 min

k-NN algorithm6 min

The importance of data representations and distance metrics

Document representation5 min

Distance metrics: Euclidean and scaled Euclidean6 min

Writing (scaled) Euclidean distance using (weighted) inner products4 min

Distance metrics: Cosine similarity9 min

To normalize or not and other distance considerations6 min

Quiz: Representations and metrics6 questions

QUIZ
Representations and metrics
6 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT

1 point
1.Consider three data points with two features as follows:

Among the three points, which two are closest to each other in terms of having the ​smallest Euclidean distance?


A and B


A and C


B and C


1 point
2.Consider three data points with two features as follows:

Among the three points, which two are closest to each other in terms of having the ​largest cosine similarity (or equivalently, ​smallest cosine distance)?


A and B


A and C


B and C


1 point
3.Consider the following two sentences.

Sentence 1: The quick brown fox jumps over the lazy dog.
Sentence 2: A quick brown dog outpaces a quick fox.
Compute the Euclidean distance using word counts. To compute word counts, turn all words into lower case and strip all punctuation, so that “The” and “the” are counted as the same token. That is, document 1 would be represented as

x=[# the,# a,# quick,# brown,# fox,# jumps,# over,# lazy,# dog,# outpaces]
where # word is the count of that word in the document.

Round your answer to 3 decimal places.




sum = 13


1 point
4.Consider the following two sentences.

Sentence 1: The quick brown fox jumps over the lazy dog.
Sentence 2: A quick brown dog outpaces a quick fox.
Recall that

cosine distance = 1 - cosine similarity = 1−xTy||x||||y||
Compute the cosine distance between sentence 1 and sentence 2 using word counts. To compute word counts, turn all words into lower case and strip all punctuation, so that “The” and “the” are counted as the same token. That is, document 1 would be represented as

x=[# the,# a,# quick,# brown,# fox,# jumps,# over,# lazy,# dog,# outpaces]
where # word is the count of that word in the document.

Round your answer to 3 decimal places.




1 point
5.(True/False) For positive features, cosine similarity is always between 0 and 1.




1 point
6.Which of the following does not describe the word count document representation? (Note: this is different from TF-IDF document representation.)


Ignores the order of the words


Assigns a high score to a frequently occurring word


Penalizes words that appear in every document

Programming Assignment 1

Choosing features and metrics for nearest neighbor search10 min

Quiz: Choosing features and metrics for nearest neighbor search5 questions

QUIZ
Choosing features and metrics for nearest neighbor search
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT

1 point
1.Among the words that appear in both Barack Obama and Francisco Barrio, take the 5 that appear most frequently in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?




1 point
2.Measure the pairwise distance between the Wikipedia pages of Barack Obama, George W. Bush, and Joe Biden. Which of the three pairs has the smallest distance?


Between Obama and Biden


Between Obama and Bush


Between Biden and Bush


1 point
3.Collect all words that appear both in Barack Obama and George W. Bush pages. Out of those words, find the 10 words that show up most often in Obama’s page. Which of the following is NOT one of the 10 words?


the


presidential


in


act


his


1 point
4.Among the words that appear in both Barack Obama and Phil Schiliro, take the 5 that have largest weights in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?




1
point

  1. Compute the Euclidean distance between TF-IDF features of Obama and Biden. Round your answer to 3 decimal places. Use American-style decimals (e.g. 110.921).


Scaling up k-NN search using KD-trees

Complexity of brute force search1 min

KD-tree representation9 min

NN search with KD-trees7 min

Complexity of NN search with KD-trees5 min

Visualizing scaling behavior of KD-trees4 min

Approximate k-NN search using KD-trees7 min

(OPTIONAL) A worked-out example for KD-trees10 min

Quiz: KD-trees5 questions

QUIZ
KD-trees
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT

1 point
1.Which of the following is not true about KD-trees?


It divides the feature space into nested axis-aligned boxes.


It can be used only for approximate nearest neighbor search but not for exact nearest neighbor search.


It prunes parts of the feature space away from consideration by inspecting smallest possible distances that can be achieved.


The query time scales sublinearly with the number of data points and exponentially with the number of dimensions.


It works best in low to medium dimension settings.


1 point
2.Questions 2, 3, 4, and 5 involves training a KD-tree on the following dataset:

X1 X2
Data point 1 -1.58 -2.01
Data point 2 0.91 3.98
Data point 3 -0.73 4.00
Data point 4 -4.22 1.16
Data point 5 4.19 -2.02
Data point 6 -0.33 2.15

Train a KD-tree by hand as follows:

  • First split using X1 and then using X2. Alternate between X1 and X2 in order.
  • Use “middle-of-the-range” heuristic for each split. Take the maximum and minimum of the coordinates of the member points.
  • Keep subdividing until every leaf node contains two or fewer data points.

What is the split value used for the first split? Enter the exact value, as you are expected to obtain a finite number of decimals. Use American-style decimals (e.g. 0.026).




1 point
3.Refer to Question 2 for context.

What is the split value used for the second split? Enter the exact value, as you are expected to obtain a finite number of decimals. Use American-style decimals (e.g. 0.026).




1 point
4.Refer to Question 2 for context.

Given a query point (-3, 1.5), which of the data points belong to the same leaf node as the query point? Choose all that apply.


Data point 1


Data point 2


Data point 3


Data point 4


Data point 5


Data point 6


1 point
5.Refer to Question 2 for context.

Perform backtracking with the query point (-3, 1.5) to perform exact nearest neighbor search. Which of the data points would be pruned from the search? Choose all that apply.

Hint: Assume that each node in the KD-tree remembers the tight bound on the coordinates of its member points, as follows:


Data point 1


Data point 2


Data point 3


Data point 4


Data point 5


Data point 6

Limitations of KD-trees3 min

LSH as an alternative to KD-trees4 min

Using random lines to partition points5 min

Defining more bins3 min

Searching neighboring bins8 min

LSH in higher dimensions4 min

(OPTIONAL) Improving efficiency through multiple tables22 min

Quiz: Locality Sensitive Hashing5 questions

QUIZ
Locality Sensitive Hashing
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT

1 point
1.(True/False) Like KD-trees, Locality Sensitive Hashing lets us compute exact nearest neighbors while inspecting only a fraction of the data points in the training set.




1 point
2.(True/False) Given two data points with high cosine similarity, the probability that a randomly drawn line would separate the two points is small.




1 point
3.(True/False) The true nearest neighbor of the query is guaranteed to fall into the same bin as the query.




1 point
4.(True/False) Locality Sensitive Hashing is more efficient than KD-trees in high dimensional setting.




1 point
5.Suppose you trained an LSH model and performed a lookup using the bin index of the query. You notice that the list of candidates returned are not at all similar to the query item. Which of the following changes would not produce a more relevant list of candidates?


Use multiple tables.


Increase the number of random lines/hyperplanes.


Inspect more neighboring bins to the bin containing the query.


Decrease the number of random lines/hyperplanes.

Programming Assignment 2

Implementing Locality Sensitive Hashing from scratch10 min

Quiz: Implementing Locality Sensitive Hashing from scratch5 questions

QUIZ
Implementing Locality Sensitive Hashing from scratch
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT

1 point
1.What is the document ID of Barack Obama’s article?




1 point
2.Which bin contains Barack Obama’s article? Enter its integer index.




1 point
3.Examine the bit representations of the bins containing Barack Obama and Joe Biden. In how many places do they agree?


16 out of 16 places (Barack Obama and Joe Biden fall into the same bin)


14 out of 16 places


12 out of 16 places


10 out of 16 places


8 out of 16 places


1 point
4.Refer to the section “Effect of nearby bin search”. What was the smallest search radius that yielded the correct nearest neighbor for Obama, namely Joe Biden?




1 point
5.Suppose our goal was to produce 10 approximate nearest neighbors whose average distance from the query document is within 0.01 of the average for the true 10 nearest neighbors. For Barack Obama, the true 10 nearest neighbors are on average about 0.77. What was the smallest search radius for Barack Obama that produced an average distance of 0.78 or better?



A brief recap2 min

Week 3 Clustering with k-means

In clustering, our goal is to group the datapoints in our dataset into disjoint sets. Motivated by our document analysis case study, you will use clustering to discover thematic groups of articles by “topic”. These topics are not provided in this unsupervised learning task; rather, the idea is to output such cluster labels that can be post-facto associated with known topics like “Science”, “World News”, etc. Even without such post-facto labels, you will examine how the clustering output can provide insights into the relationships between datapoints in the dataset. The first clustering algorithm you will implement is k-means, which is the most widely used clustering algorithm out there. To scale up k-means, you will learn about the general MapReduce framework for parallelizing and distributing computations, and then how the iterates of k-means can utilize this framework. You will show that k-means can provide an interpretable grouping of Wikipedia articles when appropriately tuned.

Introduction to clustering

Slides presented in this module10 min

The goal of clustering3 min

An unsupervised task6 min

Hope for unsupervised learning, and some challenge cases4 min

Clustering via k-means

The k-means algorithm7 min

k-means as coordinate descent6 min

Smart initialization via k-means++4 min

Assessing the quality and choosing the number of clusters9 min

Quiz: k-means9 questions

QUIZ
k-means
9 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 29, 11:59 PM PDT

1 point
1.(True/False) k-means always converges to a local optimum.




1 point
2.(True/False) The clustering objective is non-increasing throughout a run of k-means.




1 point
3.(True/False) Running k-means with a larger value of k always enables a lower possible final objective value than running k-means with smaller k.




1 point
4.(True/False) Any initialization of the centroids in k-means is just as good as any other.




1 point
5.(True/False) Initializing centroids using k-means++ guarantees convergence to a global optimum.




1 point
6.(True/False) Initializing centroids using k-means++ costs more than random initialization in the beginning, but can pay off eventually by speeding up convergence.




1 point
7.(True/False) Using k-means++ can only influence the number of iterations to convergence, not the quality of the final assignments (i.e., objective value at convergence).




4 points
8.Consider the following dataset:

X1 X2
Data point 1 -1.88 2.05
Data point 2 -0.71 0.42
Data point 3 2.41 -0.67
Data point 4 1.85 -3.80
Data point 5 -3.69 -1.33

Perform k-means with k=2 until the cluster assignment does not change between successive iterations. Use the following initialization for the centroids:

X1 X2
Cluster 1 2.00 2.00
Cluster 2 -2.00 -2.00

Which of the five data points changed its cluster assignment most often during the k-means run?


Data point 1


Data point 2


Data point 3


Data point 4


Data point 5


1 point
9.Suppose we initialize k-means with the following centroids

Which of the following best describes the cluster assignment in the first iteration of k-means?

Programming Assignment

Clustering text data with k-means10 min

Quiz: Clustering text data with K-means8 questions

QUIZ
Clustering text data with K-means
8 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 29, 11:59 PM PDT

1 point
1.Make sure you have the latest versions of the notebook and the file kmeans-arrays.npz Read this post if

… you downloaded the files before September 10
… you created an Amazon EC2 instance before October 1


I acknowledge.
1 point
2.(True/False) The clustering objective (heterogeneity) is non-increasing for this example.




1 point
3.Let’s step back from this particular example. If the clustering objective (heterogeneity) would ever increase when running K-means, that would indicate: (choose one)


K-means algorithm got stuck in a bad local minimum


There is a bug in the K-means code


All data points consist of exact duplicates


Nothing is wrong. The objective should generally go down sooner or later.
1 point
4.Refer to the output of K-means for K=3 and seed=0. Which of the three clusters contains the greatest number of data points in the end?


Cluster #0


Cluster #1


Cluster #2
1 point

  1. Another way to capture the effect of changing initialization is to look at the distribution of cluster assignments. Compute the size (# of member data points) of clusters for each of the multiple runs of K-means.

Look at the size of the largest cluster (most # of member data points) across multiple runs, with seeds 0, 20000, …, 120000. What is the maximum value this quantity takes?




1 point
6.Refer to the section “Visualize clusters of documents”. Which of the 10 clusters above contains the greatest number of articles?


Cluster 0: artists, actors, film directors, playwrights


Cluster 4: professors, researchers, scholars


Cluster 5: Australian rules football players, American football players


Cluster 7: composers, songwriters, singers, music producers


Cluster 9: politicians


1 point
7.Refer to the section “Visualize clusters of documents”. Which of the 10 clusters above contains the least number of articles?


Cluster 1: soccer (association football) players, rugby players


Cluster 3: baseball players


Cluster 6: female figures from various fields


Cluster 7: composers, songwriters, singers, music producers


Cluster 8: ice hockey players

1 point

  1. Another sign of too large K is having lots of small clusters. Look at the distribution of cluster sizes (by number of member data points). How many of the 100 clusters have fewer than 236 articles, i.e. 0.4% of the dataset?


MapReduce for scaling k-means

Motivating MapReduce8 min

The general MapReduce abstraction5 min

MapReduce execution overview and combiners6 min

MapReduce for k-means7 min

Quiz: MapReduce for k-means5 questions

QUIZ
MapReduce for k-means
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 29, 11:59 PM PDT

1 point
1.Suppose we are operating on a 1D vector. Which of the following operation is not data parallel over the vector elements?


Add a constant to every element.


Multiply the vector by a constant.


Increment the vector by another vector of the same dimension.


Compute the average of the elements.


Compute the sign of each element.


1 point
2.(True/False) A single mapper call can emit multiple (key,value) pairs.




1 point
3.(True/False) More than one reducer can emit (key,value) pairs with the same key simultaneously.




1 point
4.(True/False) Suppose we are running k-means using MapReduce. Some mappers may be launched for a new k-means iteration even if some reducers from the previous iteration are still running.




1 point
5.Consider the following list of binary operations. Which can be used for the reduce step of MapReduce? Choose all that apply.

Hints: The reduce step requires a binary operator that satisfied both of the following conditions.

Commutative: OP(x1,x2)=OP(x2,x1)
Associative: OP(OP(x1,x2),x3)=OP(x1,OP(x2,x3))


OP1(x1,x2)=max(x1,x2)


OP2(x1,x2)=x1+x2−2


OP3(x1,x2)=3x1+2x2


OP4(x1,x2)=x21+x2


OP5(x1,x2)=(x1+x2)/2

Summarizing clustering with k-means

Other applications of clustering7 min

Week 4 Mixture Models

In k-means, observations are each hard-assigned to a single cluster, and these assignments are based just on the cluster centers, rather than also incorporating shape information. In our second module on clustering, you will perform probabilistic model-based clustering that provides (1) a more descriptive notion of a “cluster” and (2) accounts for uncertainty in assignments of datapoints to clusters via “soft assignments”. You will explore and implement a broadly useful algorithm called expectation maximization (EM) for inferring these soft assignments, as well as the model parameters. To gain intuition, you will first consider a visually appealing image clustering task. You will then cluster Wikipedia articles, handling the high-dimensionality of the tf-idf document representation considered.

Motivating and setting the foundation for mixture models

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

mixmodel-EM-annotated.pdf

Motiving probabilistic clustering models8 min

Aggregating over unknown classes in an image dataset6 min

Univariate Gaussian distributions2 min

Bivariate and multivariate Gaussians7 min

Mixtures of Gaussians for clustering

Mixture of Gaussians6 min

Interpreting the mixture of Gaussian terms5 min

Scaling mixtures of Gaussians for document clustering5 min

Expectation Maximization (EM) building blocks

Computing soft assignments from known cluster parameters7 min

(OPTIONAL) Responsibilities as Bayes’ rule5 min

Estimating cluster parameters from known cluster assignments6 min

Estimating cluster parameters from soft assignments8 min

The EM algorithm

EM iterates in equations and pictures6 min

Convergence, initialization, and overfitting of EM9 min

Relationship to k-means3 min

(OPTIONAL) A worked-out example for EM10 min

Quiz: EM for Gaussian mixtures9 questions

QUIZ
EM for Gaussian mixtures
9 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 5, 11:59 PM PST

1 point
1.(True/False) While the EM algorithm maintains uncertainty about the cluster assignment for each observation via soft assignments, the model assumes that every observation comes from only one cluster.




1 point
2.(True/False) In high dimensions, the EM algorithm runs the risk of setting cluster variances to zero.




1 point
3.In the EM algorithm, what do the E step and M step represent, respectively?


Estimate cluster responsibilities, Maximize likelihood over parameters


Estimate likelihood over parameters, Maximize cluster responsibilities


Estimate number of parameters, Maximize likelihood over parameters


Estimate likelihood over parameters, Maximize number of parameters


1 point
4.Suppose we have data that come from a mixture of 6 Gaussians (i.e., that is the true data structure). Which model would we expect to have the highest log-likelihood after fitting via the EM algorithm?


A mixture of Gaussians with 2 component clusters


A mixture of Gaussians with 4 component clusters


A mixture of Gaussians with 6 component clusters


A mixture of Gaussians with 7 component clusters


A mixture of Gaussians with 10 component clusters
6


1 point
5.Which of the following correctly describes the differences between EM for mixtures of Gaussians and k-means? Choose all that apply.


k-means often gets stuck in a local minimum, while EM tends not to


EM is better at capturing clusters of different sizes and orientations


EM is better at capturing clusters with overlaps


EM is less prone to overfitting than k-means


k-means is equivalent to running EM with infinitesimally small diagonal covariances.


1 point
6.Suppose we are running the EM algorithm. After an E-step, we obtain the following responsibility matrix:

Cluster responsibilities Cluster A Cluster B Cluster C
Data point 1 0.20 0.40 0.40
Data point 2 0.50 0.10 0.40
Data point 3 0.70 0.20 0.10

Which is the least probable cluster for data point 1?


Cluster A


Cluster B


Cluster C


1 point
7.Suppose we are running the EM algorithm. After an E-step, we obtain the following responsibility matrix:

Cluster responsibilities Cluster A Cluster B Cluster C
Data point 1 0.20 0.40 0.40
Data point 2 0.50 0.10 0.40
Data point 3 0.70 0.20 0.10

Suppose also that the data points are as follows:

Dataset X Y Z
Data point 1 3 1 2
Data point 2 0 0 3
Data point 3 1 3 7

Let us compute the new mean for Cluster A. What is the Z coordinate of the new mean? Round your answer to 3 decimal places.
(2*0.2 +3*0.5+7*0.7)/(.2+.5+.7)=




1 point
8.Which of the following contour plots describes a Gaussian distribution with diagonal covariance? Choose all that apply.


(1)


(2)


(3)


(4)


(5)


2 points
9.Suppose we initialize EM for mixtures of Gaussians (using full covariance matrices) with the following clusters:

Which of the following best describes the updated clusters after the first iteration of EM?


Summarizing mixture models

A brief recap1 min

Programming Assignment 1

Implementing EM for Gaussian mixtures10 min

Quiz: Implementing EM for Gaussian mixtures6 questions

QUIZ
Implementing EM for Gaussian mixtures
6 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 5, 11:59 PM PST

1 point
1.What is the weight that EM assigns to the first component after running the above codeblock? Round your answer to 3 decimal places.




1 point
2.Using the same set of results, obtain the mean that EM assigns the second component. What is the mean in the first dimension? Round your answer to 3 decimal places.




1 point
3.Using the same set of results, obtain the covariance that EM assigns the third component. What is the variance in the first dimension? Round your answer to 3 decimal places.




1 point
4.Is the loglikelihood plot monotonically increasing, monotonically decreasing, or neither?


Monotonically increasing


Monotonically decreasing


Neither


1 point
5.Calculate the likelihood (score) of the first image in our data set (img[0]) under each Gaussian component through a call to multivariate_normal.pdf. Given these values, what cluster assignment should we make for this image?


Cluster 0


Cluster 1


Cluster 2


Cluster 3


1 point
6.Four of the following images are not in the list of top 5 images in the first cluster. Choose these four.


Image 1


Image 2


Image 3


Image 4


Image 5


Image 6


Image 7

Programming Assignment 2

Clustering text data with Gaussian mixtures10 min

Quiz: Clustering text data with Gaussian mixtures4 questions

QUIZ
Clustering text data with Gaussian mixtures
4 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 5, 11:59 PM PST

1 point
1.Select all the topics that have a cluster in the model created above.


Baseball


Basketball


Soccer/football


Music


Politics


Law


Finance


1 point
2.Try fitting EM with the random initial parameters you created above. What is the final loglikelihood that the algorithm converges to? Choose the range that contains this value.


Less than 2.2e9


Between 2.2e9 and 2.3e9


Between 2.3e9 and 2.4e9


Between 2.4e9 and 2.5e9


Greater than 2.5e9


1 point
3.Is the final loglikelihood larger or smaller than the final loglikelihood we obtained above when initializing EM with the results from running k-means?


Initializing EM with k-means led to a larger final loglikelihood


Initializing EM with k-means led to a smaller final loglikelihood


1 point
4.For the above model, out_random_init, use the visualize_EM_clusters method you created above. Are the clusters more or less interpretable than the ones found after initializing using k-means?


More interpretable


Less interpretable

Week 5 Mixed Membership Modeling via Latent Dirichlet Allocation

The clustering model inherently assumes that data divide into disjoint sets, e.g., documents by topic. But, often our data objects are better described via memberships in a collection of sets, e.g., multiple topics. In our fourth module, you will explore latent Dirichlet allocation (LDA) as an example of such a mixed membership model particularly useful in document analysis. You will interpret the output of LDA, and various ways the output can be utilized, like as a set of learned document features. The mixed membership modeling ideas you learn about through LDA for document analysis carry over to many other interesting models and applications, like social network models where people have multiple affiliations.

Throughout this module, we introduce aspects of Bayesian modeling and a Bayesian inference algorithm called Gibbs sampling. You will be able to implement a Gibbs sampler for LDA by the end of the module.

Introduction to latent Dirichlet allocation

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

LDA-annotated.pdf

Mixed membership models for documents3 min

An alternative document clustering model4 min

Components of latent Dirichlet allocation model2 min

Goal of LDA inference5 min

Quiz: Latent Dirichlet Allocation5 questions

QUIZ
Latent Dirichlet Allocation
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 12, 11:59 PM PST

1 point
1.(True/False) According to the assumptions of LDA, each document in the corpus contains words about a single topic.




1 point
2.(True/False) Using LDA to analyze a set of documents is an example of a supervised learning task.




1 point
3.(True/False) When training an LDA model, changing the ordering of words in a document does not affect the overall joint probability.




1 point
4.(True/False) Suppose in a trained LDA model two documents have no topics in common (i.e., one document has 0 weight on any topic with non-zero weight in the other document). As a result, a single word in the vocabulary cannot have high probability of occurring in both documents.




1 point
5.(True/False) Topic models are guaranteed to produce weights on words that are coherent and easily interpretable by humans.



Bayesian inference via Gibbs sampling

The need for Bayesian inference4 min

Gibbs sampling from 10,000 feet5 min

A standard Gibbs sampler for LDA9 min

Collapsed Gibbs sampling for LDA

What is collapsed Gibbs sampling?3 min

A worked example for LDA: Initial setup4 min

A worked example for LDA: Deriving the resampling distribution7 min

Using the output of collapsed Gibbs sampling4 min

Summarizing latent Dirichlet allocation

A brief recap1 min

Quiz: Learning LDA model via Gibbs sampling10 questions

QUIZ
Learning LDA model via Gibbs sampling
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 12, 11:59 PM PST

1 point
1.(True/False) Each iteration of Gibbs sampling for Bayesian inference in topic models is guaranteed to yield a higher joint model probability than the previous sample.




1 point
2.(Check all that are true) Bayesian methods such as Gibbs sampling can be advantageous because they


Account for uncertainty over parameters when making predictions


Are faster than methods such as EM


Maximize the log probability of the data under the model


Regularize parameter estimates to avoid extreme values


1 point
3.For the standard LDA model discussed in the lectures, how many parameters are required to represent the distributions defining the topics?


[# unique words]


[# unique words] * [# topics]


[# documents] * [# unique words]


[# documents] * [# topics]


2 points
4.Suppose we have a collection of documents, and we are focusing our analysis to the use of the following 10 words. We ran several iterations of collapsed Gibbs sampling for an LDA model with K=2 topics and alpha=10.0 and gamma=0.1 (with notation as in the collapsed Gibbs sampling lecture). The corpus-wide assignments at our most recent collapsed Gibbs iteration are summarized in the following table of counts:

Word Count in topic 1 Count in topic 2
baseball 52 0
homerun 15 0
ticket 9 2
price 9 25
manager 20 37
owner 17 32
company 1 23
stock 0 75
bankrupt 0 19
taxes 0 29

We also have a single document i with the following topic assignments for each word:

topic 1 2 1 2 1
word baseball manager ticket price owner

Suppose we want to re-compute the topic assignment for the word “manager”. To sample a new topic, we need to compute several terms to determine how much the document likes each topic, and how much each topic likes the word “manager”. The following questions will all relate to this situation.

First, using the notation in the slides, what is the value of mmanager,1 (i.e., the number of times the word “manager” has been assigned to topic 1)?




1 point
5.Consider the situation described in Question 4.

What is the value of ∑wmw,1, where the sum is taken over all words in the vocabulary?




1 point
6.Consider the situation described in Question 4.

Following the notation in the slides, what is the value of ni,1 for this document i (i.e., the number of words in document i assigned to topic 1)?




1 point
7.In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.

After decrementing, what is the value of ni,2?




1 point
8.In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.

After decrementing, what is the value of mmanager,2?




1 point
9.In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.

After decrementing, what is the value of ∑wmw,2?




2 points
10.Consider the situation described in Question 4.

As discussed in the slides, the unnormalized probability of assigning to topic 1 is

p1=ni,1+αNi−1+Kαmmanager,1+γ∑wmw,1+Vγ
where V is the total size of the vocabulary.

Similarly the unnormalized probability of assigning to topic 2 is

p2=ni,2+αNi−1+Kαmmanager,2+γ∑wmw,2+Vγ
Using the above equations and the results computed in previous questions, compute the probability of assigning the word “manager” to topic 1.

(Reminder: Normalize across the two topic options so that the probabilities of all possible assignments—topic 1 and topic 2—sum to 1.)

Round your answer to 3 decimal places.




p1 = (3+10)/(4+210)(20+0.1)/(123+100.1)
p2 = (1+10)/(4+2
10)(36+0.1)/(241+100.1)

Programming Assignment

Modeling text topics with Latent Dirichlet Allocation10 min

Quiz: Modeling text topics with Latent Dirichlet Allocation12 questions

QUIZ
Modeling text topics with Latent Dirichlet Allocation
12 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 12, 11:59 PM PST

1 point
1.Identify the top 3 most probable words for the first topic.


institute


university


professor


research


studies


game


coach


1 point
2.What is the sum of the probabilities assigned to the top 50 words in the 3rd topic? Round your answer to 3 decimal places.




1 point
3.What is the topic most closely associated with the article about former US President George W. Bush? Use the average results from 100 topic predictions.




1 point
4.What are the top 3 topics corresponding to the article about English football (soccer) player Steven Gerrard? Use the average results from 100 topic predictions.


science and research


team sports


music, TV, and film


international athletics


Great Britain and Australia


1 point
5.Using the LDA representation, compute the 5000 nearest neighbors for American baseball player Alex Rodriguez. For what value of k is Mariano Rivera the k-th nearest neighbor to Alex Rodriguez?




1 point
6.Using the TF-IDF representation, compute the 5000 nearest neighbors for American baseball player Alex Rodriguez. For what value of k is Mariano Rivera the k-th nearest neighbor to Alex Rodriguez?




1 point
7.What was the value of alpha used to fit our original topic model?




1 point
8.What was the value of gamma used to fit our original topic model? Remember that GraphLab Create uses “beta” instead of “gamma” to refer to the hyperparameter that influences topic distributions over words.




1 point
9.How many topics are assigned a weight greater than 0.3 or less than 0.05 for the article on Paul Krugman in the low alpha model? Use the average results from 100 topic predictions.




1 point
10.How many topics are assigned a weight greater than 0.3 or less than 0.05 for the article on Paul Krugman in the high alpha model? Use the average results from 100 topic predictions.




1 point
11.For each topic of the low gamma model, compute the number of words required to make a list with total probability 0.5. What is the average number of words required across all topics? (HINT: use the get_topics() function from GraphLab Create with the cdf_cutoff argument.)




1 point
12.For each topic of the high gamma model, compute the number of words required to make a list with total probability 0.5. What is the average number of words required across all topics? (HINT: use the get_topics() function from GraphLab Create with the cdf_cutoff argument).



Week 6 Hierarchical Clustering & Closing Remarks

In the conclusion of the course, we will recap what we have covered. This represents both techniques specific to clustering and retrieval, as well as foundational machine learning concepts that are more broadly useful.

We provide a quick tour into an alternative clustering approach called hierarchical clustering, which you will experiment with on the Wikipedia dataset. Following this exploration, we discuss how clustering-type ideas can be applied in other areas like segmenting time series. We then briefly outline some important clustering and retrieval ideas that we did not cover in this course.

We conclude with an overview of what’s in store for you in the rest of the specialization.

What we’ve learned

Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

closing-annotated.pdf

Module 1 recap10 min

Module 2 recap3 min

Module 3 recap6 min

Module 4 recap7 min

Hierarchical clustering and clustering for time series segmentation

Why hierarchical clustering?2 min

Divisive clustering4 min

Agglomerative clustering2 min

The dendrogram4 min

Agglomerative clustering details7 min

Hidden Markov models9 min

Programming Assignment

Modeling text data with a hierarchy of clusters10 min

Quiz: Modeling text data with a hierarchy of clusters3 questions

QUIZ
Modeling text data with a hierarchy of clusters
3 questions
To Pass33% or higher
Attempts3 every 8 hours
Deadline
November 19, 11:59 PM PST

1 point
1.Make sure you have the latest versions of the notebook. Read this post if

… you downloaded the notebook before September 10
… you created an Amazon EC2 instance before October 1

I acknowledge.

1 point
2.Which diagram best describes the hierarchy right after splitting the ice_hockey_football cluster?
football golf

1 point
3.Let us bipartition the clusters male_non_athletes and female_non_athletes. Which diagram best describes the resulting hierarchy of clusters for the non-athletes?

Note. The clusters for the athletes are not shown to save space.

Summary and what’s ahead in the specialization

What we didn’t cover2 min

Thank you!1 min