This Specialization from leading researchers at the University of Washington introduces you to the exciting, highdemand field of Machine Learning. Through a series of practical case studies, you will gain applied experience in major areas of Machine Learning including Prediction, Classification, Clustering, and Information Retrieval. You will learn to analyze large and complex datasets, create systems that adapt and improve over time, and build intelligent applications that can make predictions from data.
Machine Learning Foundations: A Case Study Approach
Course can be found here
Lecture slides can be found here
Notes can be found in my Github
About this course: Do you have data and wonder what it can tell you? Do you need a deeper understanding of the core ways in which machine learning can improve your business? Do you want to be able to converse with specialists about anything from regression and classification to deep learning and recommender systems?
In this course, you will get handson experience with machine learning from a series of practical casestudies. At the end of the first course you will have studied
 how to predict house prices based on houselevel features,
 analyze sentiment from user reviews,
 retrieve documents of interest,
 recommend products,
 and search for images.
Through handson practice with these use cases, you will be able to apply machine learning methods in a wide range of domains.
This first course treats the machine learning method as a black box. Using this abstraction, you will focus on understanding tasks of interest, matching these tasks to machine learning tools, and assessing the quality of the output. In subsequent courses, you will delve into the components of this black box by examining models and algorithms. Together, these pieces form the machine learning pipeline, which you will use in developing intelligent applications.
Learning Outcomes: By the end of this course, you will be able to:
Identify potential applications of machine learning in practice.
Describe the core differences in analyses enabled by regression, classification, and clustering.
Select the appropriate machine learning task for a potential application.
Apply regression, classification, clustering, retrieval, recommender systems, and deep learning.
Represent your data as features to serve as input to machine learning models.
Assess the model quality in terms of relevant error metrics for each task.
Utilize a dataset to fit a model to analyze new data.
Build an endtoend application that uses machine learning at its core.
Implement these techniques in Python.
Welcome to Machine Learning Foundations: A Case Study Approach! By joining this course, you’ve taken a first step in becoming a machine learning expert. You will learn a broad range of machine learning methods for deriving intelligence from data, and by the end of the course you will be able to implement actual intelligent applications. These applications will allow you to perform predictions, personalized recommendations and retrieval, and much more. If you continue with the subsequent courses in the Machine Learning specialization, you will delve deeper into the methods and algorithms, giving you the power to develop and deploy new machine learning services.
To begin, we recommend taking a few minutes to explore the course site. Review the material we’ll cover each week, and preview the assignments you’ll need to complete to pass the course. These assignments—one per Module 2 through 6—will walk you through Python implementations of intelligent applications for:
 Predicting house prices
 Analyzing the sentiment of product reviews
 Retrieving Wikipedia articles
 Recommending songs
 Classifying images with deep learning
Week 1 Welcome
Machine learning is everywhere, but is often operating behind the scenes.
This introduction to the specialization provides you with insights into the power of machine learning, and the multitude of intelligent applications you personally will be able to develop and deploy upon completion.
We also discuss who we are, how we got here, and our view of the future of intelligent applications.
Why you should learn machine learning with us
Important Update regarding the Machine Learning Specialization10 min
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here: intro.pdf
Welcome to this course and specialization41 sec
Who we are5 min
Machine learning is changing the world3 min
Why a case study approach?7 min
Specialization overview6 min
Who this specialization is for and what you will be able to do
How we got into ML3 min
Who is this specialization for?4 min
What you’ll be able to do57 sec
The capstone and an example intelligent application6 min
The future of intelligent applications2 min
Getting started with the tools for the course
Reading: Getting started with Python, IPython Notebook & GraphLab Create10 min
Reading: where should my files go?10 min
Getting started with Python and the IPython Notebook
Download the IPython Notebook used in this lesson to follow along10 min
Starting an IPython Notebook5 min
Creating variables in Python7 min
Conditional statements and loops in Python8 min
Creating functions and lambdas in Python3 min
Getting started with SFrames for data engineering and analysis
Download the IPython Notebook used in this lesson to follow along10 min
Starting GraphLab Create & loading an SFrame4 min
Canvas for data visualization4 min
Interacting with columns of an SFrame4 min
Using .apply() for data transformation5 min
Week 2 Regression: Predicting House Prices
This week you will build your first intelligent application that makes predictions from data.
We will explore this idea within the context of our first case study, predicting house prices, where you will create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,…).
This is just one of the many places where regression can be applied.Other applications range from predicting health outcomes in medicine, stock prices in finance, and power usage in highperformance computing, to analyzing which regulators are important for gene expression.
You will also examine how to analyze the performance of your predictive model and implement regression in practice using an iPython notebook.
Linear regression modeling
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here: regressionintroannotated.pdf
Predicting house prices: A case study in regression1 min
What is the goal and how might you naively address it?3 min
Linear Regression: A ModelBased Approach5 min
Adding higher order effects4 min
Evaluating regression models
Evaluating overfitting via training/test split6 min
Training/test curves4 min
Adding other features2 min
Other regression examples3 min
Summary of regression
Regression ML block diagram5 min
Quiz: Regression9 questions
Predicting house prices: IPython Notebook
Download the IPython Notebook used in this lesson to follow along10 min
Loading & exploring house sale data7 min
Splitting the data into training and test sets2 min
Learning a simple regression model to predict house prices from house size3 min
Evaluating error (RMSE) of the simple model2 min
Visualizing predictions of simple model with Matplotlib4 min
Inspecting the model coefficients learned1 min
Exploring other features of the data6 min
Learning a model to predict house prices from more features3 min
Applying learned models to predict price of an average house5 min
Applying learned models to predict price of two fancy houses7 min
Programming assignment
Reading: Predicting house prices assignment10 min
Quiz: Predicting house prices3 questions
Week 3 Classification: Analyzing Sentiment
How do you guess whether a person felt positively or negatively about an experience, just from a short review they wrote?
In our second case study, analyzing sentiment, you will create models that predict a class (positive/negative sentiment) from input features (text of the reviews, user profile information,…).This task is an example of classification, one of the most widely used areas of machine learning, with a broad array of applications, including
 ad targeting,
 spam detection,
 medical diagnosis,
 image classification.
You will analyze the accuracy of your classifier, implement an actual classifier in an iPython notebook, and take a first stab at a core piece of the intelligent application you will build and deploy in your capstone.
Classification modeling
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here: classificationannotated.pdf
Analyzing the sentiment of reviews: A case study in classification38 sec
What is an intelligent restaurant review system?4 min
Examples of classification tasks4 min
Linear classifiers5 min
Decision boundaries3 min
Evaluating classification models
Training and evaluating a classifier4 min
What’s a good accuracy?3 min
False positives, false negatives, and confusion matrices6 min
Learning curves5 min
Class probabilities1 min
Summary of classification
Classification ML block diagram3 min
Quiz: Classification7 questions
Analyzing sentiment: IPython Notebook
Download the IPython Notebook used in this lesson to follow along10 min
Loading & exploring product review data2 min
Creating the word count vector2 min
Exploring the most popular product4 min
Defining which reviews have positive or negative sentiment4 min
Training a sentiment classifier3 min
Evaluating a classifier & the ROC curve4 min
Applying model to find most positive & negative reviews for a product4 min
Exploring the most positive & negative aspects of a product4 min
Programming assignment
Reading: Analyzing product sentiment assignment10 min
Quiz: Analyzing product sentiment11 questions
Week 4 Clustering and Similarity: Retrieving Documents
A reader is interested in a specific news article and you want to find a similar articles to recommend. What is the right notion of similarity? How do I automatically search over documents to find the one that is most similar? How do I quantitatively represent the documents in the first place?
In this third case study, retrieving documents, you will examine various document representations and an algorithm to retrieve the most similar subset. You will also consider structured representations of the documents that automatically group articles by similarity (e.g., document topic).
You will actually build an intelligent document retrieval system for Wikipedia entries in an iPython notebook.
Algorithms for retrieval and measuring similarity of documents
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here: clusteringintroannotated.pdf
Document retrieval: A case study in clustering and measuring similarity35 sec
What is the document retrieval task?1 min
Word count representation for measuring similarity6 min
Prioritizing important words with tfidf3 min
Calculating tfidf vectors5 min
Retrieving similar documents using nearest neighbor search2 min
Clustering models and algorithms
Clustering documents task overview2 min
Clustering documents: An unsupervised learning task4 min
kmeans: A clustering algorithm3 min
Other examples of clustering6 min
Summary of clustering and similarity
Clustering and similarity ML block diagram7 min
Quiz: Clustering and Similarity6 questions
Document retrieval: IPython Notebook
Download the IPython Notebook used in this lesson to follow along10 min
Loading & exploring Wikipedia data5 min
Exploring word counts5 min
Computing & exploring TFIDFs7 min
Computing distances between Wikipedia articles5 min
Building & exploring a nearest neighbors model for Wikipedia articles3 min
Examples of document retrieval in action4 min
Programming assignment
Reading: Retrieving Wikipedia articles assignment10 min
Quiz: Retrieving Wikipedia articles9 questions
Machine Learning: Regression
Course can be found here
Summary can be found in my Github
About this course: Case Study  Predicting Housing Prices
In our first case study, predicting house prices, you will create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,…). This is just one of the many places where regression can be applied. Other applications range from predicting health outcomes in medicine, stock prices in finance, and power usage in highperformance computing, to analyzing which regulators are important for gene expression.
In this course, you will explore regularized linear regression models for the task of prediction and feature selection. You will be able to handle very large sets of features and select between models of various complexity. You will also analyze the impact of aspects of your data – such as outliers – on your selected models and predictions. To fit these models, you will implement optimization algorithms that scale to large datasets.
Learning Outcomes: By the end of this course, you will be able to:
Describe the input and output of a regression model.
Compare and contrast bias and variance when modeling data.
Estimate model parameters using optimization algorithms.
Tune parameters with cross validation.
Analyze the performance of the model.
Describe the notion of sparsity and how LASSO leads to sparse solutions.
Deploy methods to select between models.
Exploit the model to form predictions.
Build a regression model to predict prices using a housing dataset.
Implement these techniques in Python.
Week 1
Welcome
Regression is one of the most important and broadly used machine learning and statistics tools out there. It allows you to make predictions from data by learning the relationship between features of your data and some observed, continuousvalued response. Regression is used in a massive number of applications ranging from predicting stock prices to understanding gene regulatory networks.
This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.
What is this course about?
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
Welcome!1 min
What is the course about?3 min
Outlining the first half of the course5 min
Outlining the second half of the course5 min
Assumed background4 min
Reading: Software tools you’ll need10 min
Simple Linear Regression
Our course starts from the most basic regression model: Just fitting a line to data. This simple model for forming predictions from a single, univariate feature of the data is appropriately called “simple linear regression”.
In this module, we describe the highlevel regression task and then specialize these concepts to the simple linear regression case. You will learn how to formulate a simple regression model and fit the model to data using both a closedform solution as well as an iterative optimization algorithm called gradient descent. Based on this fitted function, you will interpret the estimated model parameters and form predictions. You will also analyze the sensitivity of your fit to outlying observations.
You will examine all of these concepts in the context of a case study of predicting house prices from the square feet of the house.
Regression fundamentals
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
week1_simpleregressionannotated.pdf
A case study in predicting house prices1 min
Regression fundamentals: data & model8 min
Regression fundamentals: the task2 min
Regression ML block diagram4 min
The simple linear regression model, its use, and interpretation
The simple linear regression model2 min
The cost of using a given line6 min
Using the fitted line6 min
Interpreting the fitted line6 min
An aside on optimization: one dimensional objectives
Defining our least squares optimization objective3 min
Finding maxima or minima analytically7 min
Maximizing a 1d function: a worked example2 min
Finding the max via hill climbing6 min
Finding the min via hill descent3 min
Choosing stepsize and convergence criteria6 min
An aside on optimization: multidimensional objectives
Gradients: derivatives in multiple dimensions5 min
Gradient descent: multidimensional hill descent6 min
Finding the least squares line
Computing the gradient of RSS7 min
Approach 1: closedform solution5 min
Optional reading: workedout example for closedform solution10 min
Approach 2: gradient descent7 min
Optional reading: workedout example for gradient descent10 min
Comparing the approaches1 min
Discussion and summary of simple linear regression
Download notebooks to follow along10 min
Influence of high leverage points: exploring the data4 min
Influence of high leverage points: removing Center City7 min
Influence of high leverage points: removing highend towns3 min
Asymmetric cost functions3 min
A brief recap1 min
Quiz: Simple Linear Regression7 questions
Programming assignment
Reading: Fitting a simple linear regression model on housing data10 min
Quiz: Fitting a simple linear regression model on housing data4 questions
Week 2 Multiple Regression
The next step in moving beyond simple linear regression is to consider “multiple regression” where multiple features of the data are used to form predictions.
More specifically, in this module, you will learn how to build models of more complex relationship between a single variable (e.g., ‘square feet’) and the observed response (like ‘house sales price’). This includes things like fitting a polynomial to your data, or capturing seasonal changes in the response value. You will also learn how to incorporate multiple input variables (e.g., ‘square feet’, ‘# bedrooms’, ‘# bathrooms’). You will then be able to describe how all of these models can still be cast within the linear regression framework, but now using multiple “features”. Within this multiple regression framework, you will fit models to data, interpret estimated coefficients, and form predictions.
Here, you will also implement a gradient descent algorithm for fitting a multiple regression model.
Multiple features of one input
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
week2_multipleregressionannotated.pdf
Multiple regression intro30 sec
Polynomial regression3 min
Modeling seasonality8 min
Where we see seasonality3 min
Regression with general features of 1 input2 min
Incorporating multiple inputs
Motivating the use of multiple inputs4 min
Defining notation3 min
Regression with features of multiple inputs3 min
Interpreting the multiple regression fit7 min
Setting the stage for computing the least squares fit
Optional reading: review of matrix algebra10 min
This section involves some use of matrix algebra. If you’d like to brush up on it, we recommend a short tutorial.
Rewriting the single observation model in vector notation6 min
Rewriting the model for all observations in matrix notation4 min
Computing the cost of a Ddimensional curve9 min
Computing the least squares Ddimensional curve
Computing the gradient of RSS3 min
Approach 1: closedform solution3 min
Discussing the closedform solution4 min
Approach 2: gradient descent2 min
Featurebyfeature update9 min
Algorithmic summary of gradient descent approach4 min
Summarizing multiple regression
A brief recap1 min
Quiz: Multiple Regression9 questions
Programming assignment 1
Reading: Exploring different multiple regression models for house price prediction10 min
Quiz: Exploring different multiple regression models for house price prediction8 questions
Programming assignment 2
Numpy tutorial10 min
More information on Numpy, beyond this tutorial, can be found in the Numpy getting started guide.
Reading: Implementing gradient descent for multiple regression10 min
Quiz: Implementing gradient descent for multiple regression5 questions
Week 3 Assessing Performance
Having learned about linear regression models and algorithms for estimating the parameters of such models, you are now ready to assess how well your considered method should perform in predicting new data. You are also ready to select amongst possible models to choose the best performing.
This module is all about these important topics of model selection and assessment. You will examine both theoretical and practical aspects of such analyses. You will first explore the concept of measuring the “loss” of your predictions, and use this to define training, test, and generalization error. For these measures of error, you will analyze how they vary with model complexity and how they might be utilized to form a valid assessment of predictive performance. This leads directly to an important conversation about the biasvariance tradeoff, which is fundamental to machine learning. Finally, you will devise a method to first select amongst models and then assess the performance of the selected model.
The concepts described in this module are key to all machine learning problems, wellbeyond the regression setting addressed in this course.
Defining how we assess performance
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
week3_assessingperformanceannotated.pdf
Assessing performance intro32 sec
What do we mean by “loss”?4 min
3 measures of loss and their trends with model complexity
Training error: assessing loss on the training set7 min
Generalization error: what we really want8 min
Test error: what we can actually compute4 min
Defining overfitting2 min
Training/test split1 min
3 sources of error and the biasvariance tradeoff
Irreducible error and bias6 min
Variance and the biasvariance tradeoff6 min
Error vs. amount of data6 min
OPTIONAL ADVANCED MATERIAL: Formally defining and deriving the 3 sources of error
Formally defining the 3 sources of error14 min
Formally deriving why 3 sources of error20 min
Putting the pieces together
Training/validation/test split for model selection, fitting, and assessment7 min
A brief recap1 min
Quiz: Assessing Performance13 questions
Programming assignment
Reading: Exploring the biasvariance tradeoff10 min
Quiz: Exploring the biasvariance tradeoff4 questions
Week 4 Ridge Regression
You have examined how the performance of a model varies with increasing model complexity, and can describe the potential pitfall of complex models becoming overfit to the training data. In this module, you will explore a very simple, but extremely effective technique for automatically coping with this issue. This method is called “ridge regression”. You start out with a complex model, but now fit the model in a manner that not only incorporates a measure of fit to the training data, but also a term that biases the solution away from overfitted functions. To this end, you will explore symptoms of overfitted functions and use this to define a quantitative measure to use in your revised optimization objective. You will derive both a closedform and gradient descent algorithm for fitting the ridge regression objective; these forms are small modifications from the original algorithms you derived for multiple regression. To select the strength of the bias away from overfitting, you will explore a generalpurpose method called “cross validation”.
You will implement both crossvalidation and gradient descent to fit a ridge regression model and select the regularization constant.
Characteristics of overfit models
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
week4_ridgeregressionannotated.pdf
Symptoms of overfitting in polynomial regression2 min
Download the notebook and follow along10 min
Next, we will see a demo illustrating the concept of overfitting. We recommend you download the IPython Notebook used in the demo to follow along. (The second and third parts of this notebook will be used to demonstrate ridge regression and LASSO; two techniques to address overfitting.)
IPython Notebook:
Overfitting_Demo_Ridge_Lasso.ipynb.zip
Overfitting demo7 min
Overfitting for more general multiple regression models3 min
The ridge objective
Balancing fit and magnitude of coefficients7 min
The resulting ridge objective and its extreme solutions5 min
How ridge regression balances bias and variance1 min
Download the notebook and follow along10 min
Ridge regression demo9 min
The ridge coefficient path4 min
Optimizing the ridge objective
Computing the gradient of the ridge objective5 min
Approach 1: closedform solution6 min
Discussing the closedform solution5 min
Approach 2: gradient descent9 min
Tying up the loose ends
Selecting tuning parameters via cross validation3 min
Kfold cross validation5 min
How to handle the intercept6 min
A brief recap1 min
Quiz: Ridge Regression9 questions
Programming Assignment 1
Reading: Observing effects of L2 penalty in polynomial regression10 min
Quiz: Observing effects of L2 penalty in polynomial regression7 questions
Programming Assignment 2
Reading: Implementing ridge regression via gradient descent10 min
Quiz: Implementing ridge regression via gradient descent8 questions
Week 5 Feature Selection & Lasso
A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions.
To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model.
Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.
Feature selection via explicit model enumeration
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
week5_lassoregressionannotated.pdf
The feature selection task3 min
All subsets6 min
Complexity of all subsets3 min
Greedy algorithms7 min
Complexity of the greedy forward stepwise algorithm2 min
Feature selection implicitly via regularized regression
Can we use regularization for feature selection?3 min
Thresholding ridge coefficients?4 min
The lasso objective and its coefficient path7 min
Geometric intuition for sparsity of lasso solutions
Visualizing the ridge cost7 min
Visualizing the ridge solution6 min
Visualizing the lasso cost and solution7 min
Download the notebook and follow along10 min
Lasso demo5 min
Setting the stage for solving the lasso
What makes the lasso objective different3 min
Coordinate descent5 min
Normalizing features3 min
Coordinate descent for least squares regression (normalized features)8 min
Optimizing the lasso objective
Coordinate descent for lasso (normalized features)5 min
Assessing convergence and other lasso solvers2 min
Coordinate descent for lasso (unnormalized features)1 min
OPTIONAL ADVANCED MATERIAL: Deriving the lasso coordinate descent update
Deriving the lasso coordinate descent update19 min
Tying up loose ends
Choosing the penalty strength and other practical issues with lasso5 min
A brief recap3 min
Quiz: Feature Selection and Lasso7 questions
Programming Assignment 1
Reading: Using LASSO to select features10 min
Quiz: Using LASSO to select features6 questions
Programming Assignment 2
Reading: Implementing LASSO using coordinate descent10 min
Quiz: Implementing LASSO using coordinate descent8 questions
Week 6
Nearest Neighbors & Kernel Regression
Up to this point, we have focused on methods that fit parametric functions—like polynomials and hyperplanes—to the entire dataset. In this module, we instead turn our attention to a class of “nonparametric” methods. These methods allow the complexity of the model to increase as more data are observed, and result in fits that adapt locally to the observations.
We start by considering the simple and intuitive example of nonparametric methods, nearest neighbor regression: The prediction for a query point is based on the outputs of the most related observations in the training set. This approach is extremely simple, but can provide excellent predictions, especially for large datasets. You will deploy algorithms to search for the nearest neighbors and form predictions based on the discovered neighbors. Building on this idea, we turn to kernel regression. Instead of forming predictions based on a small set of neighboring observations, kernel regression uses all observations in the dataset, but the impact of these observations on the predicted value is weighted by their similarity to the query point. You will analyze the theoretical performance of these methods in the limit of infinite training data, and explore the scenarios in which these methods work well versus struggle. You will also implement these techniques and observe their practical behavior.
Motivating local fits
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
week6_NNkernelregressionannotated.pdf
Limitations of parametric regression3 min
Nearest neighbor regression
1Nearest neighbor regression approach8 min
Distance metrics4 min
1Nearest neighbor algorithm3 min
kNearest neighbors and weighted knearest neighbors
kNearest neighbors regression7 min
kNearest neighbors in practice3 min
Weighted knearest neighbors4 min
Kernel regression
From weighted kNN to kernel regression6 min
Global fits of parametric models vs. local fits of kernel regression6 min
kNN and kernel regression wrapup
Performance of NN as amount of data grows7 min
Issues with highdimensions, data scarcity, and computational complexity3 min
kNN for classification1 min
A brief recap1 min
Quiz: Nearest Neighbors & Kernel Regression7 questions
Programming Assignment
Reading: Predicting house prices using knearest neighbors regression10 min
Quiz: Predicting house prices using knearest neighbors regression8 questions
Closing Remarks
In the conclusion of the course, we will recap what we have covered. This represents both techniques specific to regression, as well as foundational machine learning concepts that will appear throughout the specialization. We also briefly discuss some important regression techniques we did not cover in this course.
We conclude with an overview of what’s in store for you in the rest of the specialization.
What we’ve learned
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
Simple and multiple regression4 min
Assessing performance and ridge regression7 min
Feature selection, lasso, and nearest neighbor regression4 min
Summary and what’s ahead in the specialization
What we covered and what we didn’t cover5 min
Thank you!1 min
Machine Learning: Classification
Course can be found here
Lecture slides can be found here
Summary can be found in my Github
About this course: Case Studies: Analyzing Sentiment & Loan Default Prediction
In our case study on analyzing sentiment, you will create models that predict a class (positive/negative sentiment) from input features (text of the reviews, user profile information,…). In our second case study for this course, loan default prediction, you will tackle financial data, and predict when a loan is likely to be risky or safe for the bank. These tasks are an examples of classification, one of the most widely used areas of machine learning, with a broad array of applications, including ad targeting, spam detection, medical diagnosis and image classification.
In this course, you will create classifiers that provide stateoftheart performance on a variety of tasks. You will become familiar with the most successful techniques, which are most widely used in practice, including logistic regression, decision trees and boosting. In addition, you will be able to design and implement the underlying algorithms that can learn these models at scale, using stochastic gradient ascent. You will implement these technique on realworld, largescale machine learning tasks. You will also address significant tasks you will face in realworld applications of ML, including handling missing data and measuring precision and recall to evaluate a classifier. This course is handson, actionpacked, and full of visualizations and illustrations of how these techniques will behave on real data. We’ve also included optional content in every module, covering advanced topics for those who want to go even deeper!
Learning Objectives: By the end of this course, you will be able to:
Describe the input and output of a classification model.
Tackle both binary and multiclass classification problems.
Implement a logistic regression model for largescale classification.
Create a nonlinear model using decision trees.
Improve the performance of any model using boosting.
Scale your methods with stochastic gradient ascent.
Describe the underlying decision boundaries.
Build a classification model to predict sentiment in a product review dataset.
Analyze financial data to predict loan defaults.
Use techniques for handling missing data.
Evaluate your models using precisionrecall metrics.
Implement these techniques in Python (or in the language of your choice, though Python is highly recommended).
Week 1
Welcome !
Classification is one of the most widely used techniques in machine learning, with a broad array of applications, including sentiment analysis, ad targeting, spam detection, risk assessment, medical diagnosis and image classification. The core goal of classification is to predict a category or class y from some inputs x. Through this course, you will become familiar with the fundamental models and algorithms used in classification, as well as a number of core machine learning concepts. Rather than covering all aspects of classification, you will focus on a few core techniques, which are widely used in the realworld to get stateoftheart performance. By following our handson approach, you will implement your own algorithms on multiple realworld tasks, and deeply grasp the core techniques needed to be successful with these approaches in practice. This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.
Welcome to the course
Important Update regarding the Machine Learning Specialization 10 min
Hello Machine Learning learners,
Please know that due to unforeseen circumstances, courses 5 and 6  Recommender Systems & Dimensionality Reduction and An Intelligent Application with Deep Learning  will not be launching as part of the Machine Learning Specialization. We understand this may come as very disappointing news and we’re deeply sorry for this inconvenience. If you have paid for these courses or have received financial aid from Coursera, you will remain eligible to earn your Specialization Certificate upon successfully completing courses 14 of the Specialization. If you paid for courses 5 & 6 via a prepayment toward the Specialization, Coursera has provided you with free access to two other courses offered by the University of Washington: Computational Neuroscience and Data Manipulation at Scale: Systems and Algorithms. An email has been sent out with specific instructions on how to enroll in these courses. If you individually paid for either x or y course, you will receive a refund within the next two weeks.
If you have any questions or would like to request a refund, please feel free to contact Coursera’s 24/7 learner support team via the Request a Refund article in the Learner Help Center. The last day to request a refund will be April 30, 2017. We value you as a Coursera learner and want to ensure that your experience with the Machine Learning Specialization remains a positive one.
Regards,
The Coursera Team
Slides presented in this module 10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
Welcome to the classification course, a part of the Machine Learning Specialization 1 min
What is this course about? 6 min
https://www.coursera.org/learn/mlclassification/lecture/qZhKx/whatisthiscourseabout
Impact of classification 1 min
https://www.coursera.org/learn/mlclassification/lecture/OnpWH/impactofclassification
Course overview and details
Course overview 3 min
https://www.coursera.org/learn/mlclassification/lecture/84fuF/courseoverview
Outline of first half of course 5 min
https://www.coursera.org/learn/mlclassification/lecture/LyubT/outlineoffirsthalfofcourse
Outline of second half of course 5 min
https://www.coursera.org/learn/mlclassification/lecture/z1g9k/outlineofsecondhalfofcourse
Assumed background 3 min
https://www.coursera.org/learn/mlclassification/lecture/IindM/assumedbackground
Let’s get started! 45 sec
https://www.coursera.org/learn/mlclassification/lecture/AktDn/letsgetstarted
Reading: Software tools you’ll need 10 min
Software tools you’ll need for this course
How this specialization was designed. The learning approach in this specialization is to start from use cases and then dig into algorithms and methods, what we call a casestudies approach. We are very excited about this approach, since it has worked well in several other courses. The first course, Machine Learning: Foundations, was focused on understanding how ML can be used in various cases studies. The second course, Machine Learning: Regression, was focused on models that predict a continuous value from input features. The follow on courses will dig into more details of algorithms and methods of other ML areas. We expect all learners to have taken the first and second course, before taking this course.
Classification  A Machine Learning Approach. This course focuses classification, one of the most important types of data analysis, with a wide range of applications. After successfully completing this course, you will be able to use classification methods in practice, implement some of the most fundamental algorithms in this area, and choose the right model for your task. You will become familiar with the most successful techniques, which are most widely used in practice, including logistic regression, decision trees and boosting. In addition, you will be able to design and implement the underlying algorithms that can learn these models at scale, using stochastic gradient ascent.
Programming assignment format
Almost every module will be associated with one or two programming assignments. The goal of these assignments is to have handson experience on the techniques we discuss in lectures. To test your implementations, you will be asked questions in a quiz following the assignment.
You will be implementing core classification techniques or other ML concepts from scratch in most modules. In a few module, you will also explore fundamental ML concepts, such as regularization or precisionrecall, using existing implementations of ML algorithms, with the goal of gaining proficiency in the ML concepts.
Why Python
In this course, we are going to use the Python programming language to build several intelligent applications that use machine learning. Python is a simple scripting language that makes it easy to interact with data. Furthermore, Python has a wide range of packages that make it easy to get started and build applications, from the simplest ones to the most complex. Python is widely used in industry, and is becoming the de facto language for data science in industry. (R is another alternative language. However, R tends to be significantly less scalable and has very few deployment tools, thus it is seldomly used for production code in industry. It is possible, but discouraged to use R in this specialization.)
We will also encourage the use the IPython Notebook in our assignments. The IPython Notebook is a simple interactive environment for programming with Python, which makes it really easy to share your results. Think about it as a combination of a Python terminal and a wiki page. Thus, you can combine code, plots and text to explain what you did. (You are not required to use IPython Notebook in the assignments, and should have no problem using straight up Python if you prefer.)
Useful software tools
Although you will be implementing algorithms from scratch in various assignments, some software tools will be useful in the process. In particular, there are four types of data tools that would be helpful:
 Data manipulation: to help you sliceanddice the data, create new features, and clean the data.
 Matrix operations: in the inner loops of your algorithms, you will do various matrix operations, and libraries focus on these will speedup your code significantly.
 Plotting library: so you can visualize data and models.
 Preimplemented ML algorithms: in some assignments where we are focusing on exploring ML classification models, you will use a preimplemented ML algorithms to help focus your efforts on the fundamentals.
1.Tools for data manipulation
For data manipulation, we recommend using SFrame, an opensource, highlyscalable Python library for data manipulation. An alternative is the Pandas library. A huge advantage of SFrame over Pandas is that with SFrame, you are not limited to datasets that fit in memory, which allows you to deal with large datasets, even on a laptop. (The SFrame API is very similar to Pandas’ API. Here is a doc showing the relationship between the two of them.)
2.Tools for matrix operation
For matrix operations, we strongly recommend Numpy, an opensource Python library that provides fast performance, for data that fits in memory.
3.Tools for plotting
For plotting, we strongly recommend you use Matplotlib, an opensource Python library with extensive plotting functionality.
4.Tools with preimplemented ML algorithms
For the few assignments where you will be using preimplemented ML algorithms, we recommend you use GraphLab Create, which we used in the first and second course, a package we have been working on for many years now, and has seen an exciting adoption curve, especially in industry with folks building real applications. A popular alternative is to use scikitlearn. GraphLab Create is more scalable than scikitlearn and simpler to use when your data is not numeric vectors. On the other hand, scikitlearn is opensource.
In this course, most of the assignments are about implementing algorithms from scratch, so this choice is more flexible than in the first course. We are happy, however, for you to use any tool(s) of your liking. As you will notice, we are only grading the output of your programs, so the specific software tool is not the focus of the course. More details on using other tools are at the end of this doc.
It’s important to emphasize that this specialization is not about providing training for a specific software package. The goal of the specialization is for your effort to be spent on learning the fundamental concepts and algorithms behind machine learning in a handson fashion. These concepts transcend any single package. What you learn here you can use whether you write code from scratch, use any existing ML packages out there, or any that may be developed in the future. We are happy to hear that so many of you are enjoying this approach so far!
5.Licenses for SFrame & GraphLab Create
The SFrame package is available in opensource under a permissive BSD license. So, you will always be able to use SFrames for free. GraphLab Create is free on a 1year, renewable license for educational purposes, including Coursera. The reason we suggest you use GraphLab Create for this course is because this software will make it much easier for you see machine learning in action and to help you complete your assignments quickly.
Upgrade GraphLab Create
If you are using GraphLab Create and already have it installed, please make sure you upgrade to the latest version! The simplest way to do this is to:
Open the GraphLab Launcher.
Click on ‘TERMINAL’.
On the terminal window, type:pip install upgrade graphlabcreate
Resources
These are some good resources you can explore, if you are using the recommended software tools:
In the first course of this ML specialization, Machine Learning Foundations, we provided many tutorials and getting started guides. We recommend you go over those before tackling this course.
There are many Python resources available online. Here is a good place for documentation.
For SFrame & GraphLab Create, there is also a lot of information available online. Here are some starting points: the User Guide and detailed API docs.
For Numpy, here is a getting started guide. We will also provide a tutorial when it’s time to use it.
Installing the recommended software tools
If you choose to use the recommended tools, you have two options: downloading and installing the required software or using a prepackaged version on a free instance on Amazon EC2.
1.Option 1: Downloading and installing all software on your own machine
Download and install Python, IPython Notebook, Numpy, SFrame and GraphLab Create. You can find the instructions here.
2.Option 2: Using a free Amazon EC2 with all the software preinstalled
If you do not have a 64bit computer, you will not be able to run GraphLab Create. Additionally, some of you may want a simple experience where you don’t have to download the course content and install everything locally. Here, we’ll address these situations!
Amazon EC2 offers free cloud computing hours with what they call micro instances. These instances are all we need to do the work for this course. We have created an image for one such instance that is easy to launch and contains all the course content. This will allow you to run everything you need for this course in the cloud for free, without having to install anything locally. (You do need to create an Amazon EC2 account and have internet access.)
You can find stepbystep instructions here:
https://turi.com/download/installgraphlabcreateawscoursera.html
We note that installing all the software on your own local machine may be the right option for most people; especially since you can run locally everything without needing to be online to do the homeworks. But, the option using Amazon EC2 should be a great alternative.
Github repository with starter code
In each module of the course, we have a reading with the assignments for that module as well as some starter code. For those interested, the starter code and demos used in this course are also available in a public Github repository:
https://github.com/learnml/machinelearningspecialization
Using other software packages
We strongly encourage you to use the recommended software packages for this course, since they will allow you to learn the fundamental concepts more quickly. However, you are welcome to use others. Here are a few notes if you do so.
1.Installing other software tools
In the instructions above, you will be using the GraphLab Launcher, which will automatically install Python, IPython Notebook, Numpy, Matplotlib, SFrame and GraphLab Create. If you don’t use the GraphLab Launcher, you will need to install each of these tools separately, by following the pages linked above. Anaconda is a good tool to help simplify some of this installation.
2.If you are using SFrame, but not GraphLab Create
GraphLab Create uses SFrame under the hood, but you can use just SFrame for most assignments. If you choose to do so, in the starter code for the assignments, you should change the line
import graphlab
import sframe
import sframe
and everything should work with just some small modifications, e.g., the calls:
graphlab.SFrame(...)
will become
sframe.SFrame(...)
3.If you are using other software tools out there
You are welcome to use other packages, e.g., scikitlearn instead of GraphLab Create, or Pandas instead of SFrame, or even R instead of Python. If you choose to use all these different packages, we will provide the datasets (in standard CSV format) and the assignment questions will not depend specifically on the recommended tools.
Linear Classifiers & Logistic Regression
Linear classifiers are amongst the most practical classification methods. For example, in our sentiment analysis casestudy, a linear classifier associates a coefficient with the counts of each word in the sentence. In this module, you will become proficient in this type of representation. You will focus on a particularly useful type of linear classifier called logistic regression, which, in addition to allowing you to predict a class, provides a probability associated with the prediction. These probabilities are extremely useful, since they provide a degree of confidence in the predictions. In this module, you will also be able to construct features from categorical inputs, and to tackle classification problems with more than two class (multiclass problems). You will examine the results of these techniques on a realworld product sentiment analysis task.
Linear classifiers
Slides presented in this module 10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
logisticregressionmodelannotated.pdf
Linear classifiers: A motivating example 2 min
Intuition behind linear classifiers 3 min
https://www.coursera.org/learn/mlclassification/lecture/lCBwS/intuitionbehindlinearclassifiers
Decision boundaries 3 min
https://www.coursera.org/learn/mlclassification/lecture/NIdE0/decisionboundaries
Linear classifier model 5 min
https://www.coursera.org/learn/mlclassification/lecture/XBc9n/linearclassifiermodel
Effect of coefficient values on decision boundary 2 min
Using features of the inputs 2 min
https://www.coursera.org/learn/mlclassification/lecture/WHIMY/usingfeaturesoftheinputs
Class probabilities
Predicting class probabilities 1 min
https://www.coursera.org/learn/mlclassification/lecture/j4Ji0/predictingclassprobabilities
Review of basics of probabilities 6 min
https://www.coursera.org/learn/mlclassification/lecture/p6rtM/reviewofbasicsofprobabilities
Review of basics of conditional probabilities 8 min
Using probabilities in classification 2 min
https://www.coursera.org/learn/mlclassification/lecture/f0nhO/usingprobabilitiesinclassification
Logistic regression
Predicting class probabilities with (generalized) linear models 5 min
The sigmoid (or logistic) link function 4 min
https://www.coursera.org/learn/mlclassification/lecture/KXvGC/thesigmoidorlogisticlinkfunction
Logistic regression model 5 min
https://www.coursera.org/learn/mlclassification/lecture/OJQXu/logisticregressionmodel
Effect of coefficient values on predicted probabilities 7 min
Overview of learning logistic regression models 2 min
Practical issues for classification
Encoding categorical inputs 4 min
https://www.coursera.org/learn/mlclassification/lecture/kCY0D/encodingcategoricalinputs
Multiclass classification with 1 versus all 7 min
Summarizing linear classifiers & logistic regression
Recap of logistic regression classifier 1 min
Quiz: Linear Classifiers & Logistic Regression 5 questions
QUIZ
Linear Classifiers & Logistic Regression
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
August 20, 11:59 PM PDT
1 point
1.(True/False) A linear classifier assigns the predicted class based on the sign of Score(x)=wTh(x).
1 point
2.(True/False) For a conditional probability distribution over yx, where y takes on two values (+1, 1, i.e. good review, bad review) P(y=+1x)+P(y=−1x)=1.
1 point
3.Which function does logistic regression use to “squeeze” the real line to [0, 1]?
1 point
4.If Score(x)=wTh(x)>0, which of the following is true about P(y=+1x)?
1 point
5.Consider training a 1 vs. all multiclass classifier for the problem of digit recognition using logistic regression. There are 10 digits, thus there are 10 classes. How many logistic regression classifiers will we have to train?
Programming Assignment
Predicting sentiment from product reviews 10 min
Quiz: Predicting sentiment from product reviews 12 questions
QUIZ
Predicting sentiment from product reviews
12 questions
To Pass70% or higher
Attempts3 every 8 hours
Deadline
August 20, 11:59 PM PDT
1 point
1.How many weights are greater than or equal to 0?
1 point
2.Of the three data points in sample_test_data, which one has the lowest probability of being classified as a positive review?
1 point
3.Which of the following products are represented in the 20 most positive reviews?
1 point
4.Which of the following products are represented in the 20 most negative reviews?
1 point
5.What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places (e.g. 0.76).
1 point
6.Does a higher accuracy value on the training_data always imply that the classifier is better?
1 point
7.Consider the coefficients of simple_model. There should be 21 of them, an intercept term + one for each word in significant_words.
How many of the 20 coefficients (corresponding to the 20 significant_words and excluding the intercept term) are positive for the simple_model?
1 point
8.Are the positive words in the simple_model also positive words in the sentiment_model?
1 point
9.Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?
1 point
10.Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?
1 point
11.Enter the accuracy of the majority class classifier model on the test_data. Round your answer to two decimal places (e.g. 0.76).
1 point
12.Is the sentiment_model definitely better than the majority class classifier (the baseline)?
Week 2
Learning Linear Classifiers
Once familiar with linear classifiers and logistic regression, you can now dive in and write your first learning algorithm for classification. In particular, you will use gradient ascent to learn the coefficients of your classifier from data. You first will need to define the quality metric for these tasks using an approach called maximum likelihood estimation (MLE). You will also become familiar with a simple technique for selecting the step size for gradient ascent. An optional, advanced part of this module will cover the derivation of the gradient for logistic regression. You will implement your own learning algorithm for logistic regression from scratch, and use it to learn a sentiment analysis classifier.
Maximum likelihood estimation
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
logisticregressionlearningannotated.pdf
Goal: Learning parameters of logistic regression2 min
Intuition behind maximum likelihood estimation4 min
Data likelihood8 min
Finding best linear classifier with gradient ascent3 min
Gradient ascent algorithm for learning logistic regression classifier
Review of gradient ascent6 min
Learning algorithm for logistic regression3 min
Example of computing derivative for logistic regression5 min
Interpreting derivative for logistic regression5 min
Summary of gradient ascent for logistic regression2 min
Choosing step size for gradient ascent/descent
Choosing step size5 min
Careful with step sizes that are too large4 min
Rule of thumb for choosing step size3 min
(VERY OPTIONAL LESSON) Deriving gradient of logistic regression
(VERY OPTIONAL) Deriving gradient of logistic regression: Log trick4 min
(VERY OPTIONAL) Expressing the loglikelihood3 min
(VERY OPTIONAL) Deriving probability y=1 given x2 min
(VERY OPTIONAL) Rewriting the log likelihood into a simpler form8 min
(VERY OPTIONAL) Deriving gradient of log likelihood8 min
Summarizing learning linear classifiers
Recap of learning logistic regression classifiers1 min
Quiz: Learning Linear Classifiers6 questions
QUIZ
Learning Linear Classifiers
6 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
August 27, 11:59 PM PDT
1 point
1.(True/False) A linear classifier can only learn positive coefficients.
1 point
2.(True/False) In order to train a logistic regression model, we find the weights that maximize the likelihood of the model.
1 point
3.(True/False) The data likelihood is the product of the probability of the inputs x given the weights w and response y.
1 point
4.Questions 4 and 5 refer to the following scenario.
Consider the setting where our inputs are 1dimensional. We have data
x  y 

2.5  +1 
0.3  1 
2.8  +1 
0.5  +1 
and the current estimates of the weights are w0=0 and w1=1. (w0: the intercept, w1: the weight for x).
Calculate the likelihood of this data. Round your answer to 2 decimal places.


1 point
5.Refer to the scenario given in Question 4 to answer the following:
Calculate the derivative of the log likelihood with respect to w1. Round your answer to 2 decimal places.


1 point
6.Which of the following is true about gradient ascent? Select all that apply.
Programming Assignment
Implementing logistic regression from scratch10 min
Quiz: Implementing logistic regression from scratch8 questions
QUIZ
Implementing logistic regression from scratch
8 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
August 27, 11:59 PM PDT
1 point
1.How many reviews in amazon_baby_subset.gl contain the word perfect?
1 point
2.Consider the feature_matrix that was obtained by converting our data to NumPy format.
How many features are there in the feature_matrix?
1 point
3.Assuming that the intercept is present, how does the number of features in feature_matrix relate to the number of features in the logistic regression model? Let x = [number of features in feature_matrix] and y = [number of features in logistic regression model].
1 point
4.Run your logistic regression solver with provided parameters.
As each iteration of gradient ascent passes, does the loglikelihood increase or decrease?
1 point
5.We make predictions using the weights just learned.
How many reviews were predicted to have positive sentiment?
1 point
6.What is the accuracy of the model on predictions made above? (round to 2 digits of accuracy)
1 point
7.We look at “most positive” words, the words that correspond most strongly with positive reviews.
Which of the following words is not present in the top 10 “most positive” words?
1 point
8.Similarly, we look at “most negative” words, the words that correspond most strongly with negative reviews.
Which of the following words is not present in the top 10 “most negative” words?
Overfitting & Regularization in Logistic Regression
As we saw in the regression course, overfitting is perhaps the most significant challenge you will face as you apply machine learning approaches in practice. This challenge can be particularly significant for logistic regression, as you will discover in this module, since we not only risk getting an overly complex decision boundary, but your classifier can also become overly confident about the probabilities it predicts. In this module, you will investigate overfitting in classification in significant detail, and obtain broad practical insights from some interesting visualizations of the classifiers’ outputs. You will then add a regularization term to your optimization to mitigate overfitting. You will investigate both L2 regularization to penalize large coefficient values, and L1 regularization to obtain additional sparsity in the coefficients. Finally, you will modify your gradient ascent algorithm to learn regularized logistic regression classifiers. You will implement your own regularized logistic regression classifier from scratch, and investigate the impact of the L2 penalty on realworld sentiment analysis data.
Overfitting in classification
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
logisticregressionoverfittingannotated.pdf
Evaluating a classifier 3 min
https://www.coursera.org/learn/mlclassification/lecture/RzxaQ/evaluatingaclassifier
Review of overfitting in regression3 min
Overfitting in classification5 min
Visualizing overfitting with highdegree polynomial features3 min
Overconfident predictions due to overfitting
Overfitting in classifiers leads to overconfident predictions5 min
Visualizing overconfident predictions4 min
(OPTIONAL) Another perspecting on overfitting in logistic regression8 min
L2 regularized logistic regression
Penalizing large coefficients to mitigate overfitting5 min
L2 regularized logistic regression4 min
Visualizing effect of L2 regularization in logistic regression5 min
Learning L2 regularized logistic regression with gradient ascent7 min
Sparse logistic regression
Sparse logistic regression with L1 regularization7 min
Summarizing overfitting & regularization in logistic regression
Recap of overfitting & regularization in logistic regression58 sec
Quiz: Overfitting & Regularization in Logistic Regression8 questions
QUIZ
Overfitting & Regularization in Logistic Regression
8 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
August 27, 11:59 PM PDT
1 point
1.Consider four classifiers, whose classification performance is given by the following table:
Classification error on training set  Classification error on validation set 

Classifier 1  0.2 0.6 
Classifier 2  0.8 0.6 
Classifier 3  0.2 0.2 
Classifier 4  0.5 0.4 
Which of the four classifiers is most likely overfit?
1 point
2.Suppose a classifier classifies 23100 examples correctly and 1900 examples incorrectly. Compute error by hand. Round your answer to 3 decimal places.
1 point
3.(True/False) Accuracy and error measured on the same dataset always sum to 1.
1 point
4.Which of the following is NOT a correct description of complex models?
1 point
5.Which of the following is a symptom of overfitting in the context of logistic regression? Select all that apply.
1 point
6.Suppose we perform L2 regularized logistic regression to fit a sentiment classifier. Which of the following plots does NOT describe a possible coefficient path? Choose all that apply.
Note. Assume that the algorithm runs for a wide range of L2 penalty values and each coefficient plot is zoomed out enough to capture all longterm trends.
1 point
7.Suppose we perform L1 regularized logistic regression to fit a sentiment classifier. Which of the following plots does NOT describe a possible coefficient path? Choose all that apply.
Note. Assume that the algorithm runs for a wide range of L1 penalty values and each coefficient plot is zoomed out enough to capture all longterm trends.
1 point
8.In the context of L2 regularized logistic regression, which of the following occurs as we increase the L2 penalty λ? Choose all that apply.
Programming Assignment
Logistic Regression with L2 regularization10 min
Quiz: Logistic Regression with L2 regularization8 questions
QUIZ
Logistic Regression with L2 regularization
8 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
August 27, 11:59 PM PDT
1 point
1.In the function feature_derivative_with_L2, was the intercept term regularized?
1 point
2.Does the term with L2 regularization increase or decrease the log likelihood ℓℓ(w)?
1 point
3.Which of the following words is not listed in either positive_words or negative_words?
1 point
4.Questions 5 and 6 use the coefficient plot of the words in positive_words and negative_words.
(True/False) All coefficients consistently get smaller in size as the L2 penalty is increased.
1 point
5.Questions 5 and 6 use the coefficient plot of the words in positive_words and negative_words.
(True/False) The relative order of coefficients is preserved as the L2 penalty is increased. (For example, if the coefficient for ‘cat’ was more positive than that for ‘dog’, this remains true as the L2 penalty increases.)
1 point
6.Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.
Which of the following models has the highest accuracy on the training data?
1 point
7.Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.
Which of the following models has the highest accuracy on the validation data?
1 point
8.Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.
Does the highest accuracy on the training data imply that the model is the best one?
Week 3 Decision Trees
Along with linear classifiers, decision trees are amongst the most widely used classification techniques in the real world. This method is extremely intuitive, simple to implement and provides interpretable predictions.
 In this module, you will become familiar with the core decision trees representation.
 You will then design a simple, recursive greedy algorithm to learn decision trees from data.
 Finally, you will extend this approach to deal with continuous inputs, a fundamental requirement for practical problems.
 In this module, you will investigate a brand new casestudy in the financial sector: predicting the risk associated with a bank loan. You will implement your own decision tree learning algorithm on real loan data.
Intuition behind decision trees
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
decisiontreesannotated.pdf
Predicting loan defaults with decision trees3 min
Intuition behind decision trees1 min
Task of learning decision trees from data3 min
Learning decision trees
Recursive greedy algorithm4 min
Learning a decision stump3 min
Selecting best feature to split on6 min
When to stop recursing4 min
Using the learned decision tree
Making predictions with decision trees1 min
Multiclass classification with decision trees2 min
Learning decision trees with continuous inputs
Threshold splits for continuous inputs6 min
(OPTIONAL) Picking the best threshold to split on3 min
Visualizing decision boundaries5 min
Summarizing decision trees
Recap of decision trees56 sec
Quiz: Decision Trees11 questions
QUIZ
Decision Trees
11 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 3, 11:59 PM PDT
1 point
1.Questions 1 to 6 refer to the following common scenario:
Consider the following dataset:
x1  x2  x3  y 

1  1  1  +1 
0  1  0  1 
1  0  1  1 
0  0  1  +1 
Let us train a decision tree with this data. Let’s call this tree T1. What feature will we split on at the root?
x1: .5
x2: .5
x3: .25
1 point
2.Refer to the dataset presented in Question 1 to answer the following.
Fully train T1 (until each leaf has data points of the same output label). What is the depth of T1?
1 point
3.Refer to the dataset presented in Question 1 to answer the following.
What is the training error of T1?
1 point
4.Refer to the dataset presented in Question 1 to answer the following.
Now consider a tree T2, which splits on x1 at the root, and splits on x2 in the 1st level, and has leaves at the 2nd level. Note: this is the XOR function on features 1 and 2. What is the depth of T2?
1 point
5.Refer to the dataset presented in Question 1 to answer the following.
What is the training error of T2?
1 point
6.Refer to the dataset presented in Question 1 to answer the following.
Which has smaller depth, T1 or T2?
1 point
7.(True/False) When deciding to split a node, we find the best feature to split on that minimizes classification error.
1 point
8.If you are learning a decision tree, and you are at a node in which all of its data has the same y value, you should
3: False
1 point
8.Let’s say we have learned a decision tree on dataset D. Consider the split learned at the root of the decision tree. Which of the following is true if one of the data points in D is removed and we retrain the tree?
1 point
9.Consider two datasets D1 and D2, where D2 has the same data points as D1, but has an extra feature for each data point. Let T1 be the decision tree trained with D1, and T2 be the tree trained with D2. Which of the following is true?
1 point
10.(True/False) Logistic regression with polynomial degree 1 features will always have equal or lower training error than decision stumps (depth 1 decision trees).
1 point
11.(True/False) Decision trees (with depth > 1) are always linear classifiers.
1 point
11.(True/False) Decision stumps (depth 1 decision trees) are always linear classifiers.
Programming Assignment 1
Identifying safe loans with decision trees10 min
Quiz: Identifying safe loans with decision trees7 questions
QUIZ
Identifying safe loans with decision trees
7 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 3, 11:59 PM PDT
1 point
1.What percentage of the predictions on sample_validation_data did decision_tree_model get correct?
1 point
2.Which loan has the highest probability of being classified as a safe loan?
1 point
3.Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?
1 point
4.Based on the visualized tree, what prediction would you make for this data point?
1 point
5.What is the accuracy of decision_tree_model on the validation set, rounded to the nearest .01 (e.g. 0.76)?
1 point
6.How does the performance of big_model on the validation set compare to decision_tree_model on the validation set? Is this a sign of overfitting?
1 point
7.Let us assume that each mistake costs money:
Assume a cost of $10,000 per false negative.
Assume a cost of $20,000 per false positive.
What is the total cost of mistakes made by decision_tree_model on validation_data? Please enter your answer as a plain integer, without the dollar sign or the comma separator, e.g. 3002000.
Programming Assignment 2
Implementing binary decision trees10 min
Quiz: Implementing binary decision trees7 questions
QUIZ
Implementing binary decision trees
7 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 3, 11:59 PM PDT
1 point
1.What was the feature that my_decision_tree first split on while making the prediction for test_data[0]?
1 point
2.What was the first feature that lead to a right split of test_data[0]?
1 point
3.What was the last feature split on before reaching a leaf node for test_data[0]?
1 point
4.Rounded to 2nd decimal point (e.g. 0.76), what is the classification error of my_decision_tree on the test_data?
1 point
5.What is the feature that is used for the split at the root node?
1 point
6.What is the path of the first 3 feature splits considered along the leftmost branch of my_decision_tree?
1 point
7.What is the path of the first 3 feature splits considered along the rightmost branch of my_decision_tree?
Week 4
Preventing Overfitting in Decision Trees
Out of all machine learning techniques, decision trees are amongst the most prone to overfitting. No practical implementation is possible without including approaches that mitigate this challenge. In this module, through various visualizations and investigations, you will investigate why decision trees suffer from significant overfitting problems. Using the principle of Occam’s razor, you will mitigate overfitting by learning simpler trees. At first, you will design algorithms that stop the learning process before the decision trees become overly complex. In an optional segment, you will design a very practical approach that learns an overlycomplex tree, and then simplifies it with pruning. Your implementation will investigate the effect of these techniques on mitigating overfitting on our realworld loan data set.
Overfitting in decision trees
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
decisiontreesoverfittingannotated.pdf
A review of overfitting2 min
Overfitting in decision trees5 min
Early stopping to avoid overfitting
Principle of Occam’s razor: Learning simpler decision trees5 min
Early stopping in learning decision trees6 min
(OPTIONAL LESSON) Pruning decision trees
(OPTIONAL) Motivating pruning8 min
(OPTIONAL) Pruning decision trees to avoid overfitting6 min
(OPTIONAL) Tree pruning algorithm3 min
Summarizing preventing overfitting in decision trees
Recap of overfitting and regularization in decision trees1 min
Quiz: Preventing Overfitting in Decision Trees11 questions
QUIZ
Preventing Overfitting in Decision Trees
11 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 10, 11:59 PM PDT
1 point
1.(True/False) When learning decision trees, smaller depth USUALLY translates to lower training error.
1 point
2.(True/False) If no two data points have the same input values, we can always learn a decision tree that achieves 0 training error.
1 point
3.(True/False) If decision tree T1 has lower training error than decision tree T2, then T1 will always have better test error than T2.
1 point
4.Which of the following is true for decision trees?
1 point
5.Pruning and early stopping in decision trees is used to
1 point
6.Which of the following is NOT an early stopping method?
1 point
7.Consider decision tree T1 learned with minimum node size parameter = 1000. Now consider decision tree T2 trained on the same dataset and parameters, except that the minimum node size parameter is now 100. Which of the following is always true?
1 point
8.Questions 8 to 11 refer to the following common scenario:
Imagine we are training a decision tree, and we are at a node. Each data point is (x1, x2, y), where x1,x2 are features, and y is the label. The data at this node is:
x1  x2  y 

0  1  +1 
1  0  +1 
0  1  +1 
1  1  1 
What is the classification error at this node (assuming a majority class classifier)?
1 point
9.Refer to the scenario presented in Question 8.
If we split on x1, what is the classification error?
1
point
 Refer to the scenario presented in Question 8.
If we split on x2, what is the classification error?
1 point
11.Refer to the scenario presented in Question 8.
If our parameter for minimum gain in error reduction is 0.1, do we split or stop early?
Programming Assignment
Decision Trees in Practice10 min
Quiz: Decision Trees in Practice14 questions
QUIZ
Decision Trees in Practice
14 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 10, 11:59 PM PDT
1 point
1.Given an intermediate node with 6 safe loans and 3 risky loans, if the min_node_size parameter is 10, what should the tree learning algorithm do next?
1 point
2.Assume an intermediate node has 6 safe loans and 3 risky loans. For each of 4 possible features to split on, the error reduction is 0.0, 0.05, 0.1, and 0.14, respectively. If the minimum gain in error reduction parameter is set to 0.2, what should the tree learning algorithm do next?
1 point
3.Consider the prediction path validation_set[0] with my_decision_tree_old and my_decision_tree_new. For my_decision_tree_new trained with


is the prediction path shorter, longer, or the same as the prediction path using my_decision_tree_old that ignored the early stopping conditions 2 and 3?
1 point
4.Consider the prediction path for ANY new data point. For my_decision_tree_new trained with


is the prediction path for a data point always shorter, always longer, always the same, shorter or the same, or longer or the same as for my_decision_tree_old that ignored the early stopping conditions 2 and 3?
1 point
5.For a tree trained on any dataset using parameters


what is the maximum possible number of splits encountered while making a single prediction?
1 point
6.Is the validation error of the new decision tree (using early stopping conditions 2 and 3) lower than, higher than, or the same as that of the old decision tree from the previous assigment?
1 point
7.Which tree has the smallest error on the validation data?
1 point
8.Does the tree with the smallest error in the training data also have the smallest error in the validation data?
1 point
9.Is it always true that the tree with the lowest classification error on the training set will result in the lowest classification error in the validation set?
1 point
10.Which tree has the largest complexity?
1 point
11.Is it always true that the most complex tree will result in the lowest classification error in the validation_set?
1 point
12.Using the complexity definition, which model (model_4, model_5, or model_6) has the largest complexity?
1 point
13.model_4 and model_5 have similar classification error on the validation set but model_5 has lower complexity. Should you pick model_5 over model_4?
1 point
14.Using the results obtained in this section, which model (model_7, model_8, or model_9) would you choose to use?
Handling Missing Data
Realworld machine learning problems are fraught with missing data. That is, very often, some of the inputs are not observed for all data points. This challenge is very significant, happens in most cases, and needs to be addressed carefully to obtain great performance. And, this issue is rarely discussed in machine learning courses. In this module, you will tackle the missing data challenge head on. You will start with the two most basic techniques to convert a dataset with missing data into a clean dataset, namely skipping missing values and inputing missing values. In an advanced section, you will also design a modification of the decision tree learning algorithm that builds decisions about missing data right into the model. You will also explore these techniques in your realdata implementation.
Basic strategies for handling missing data
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
decisiontreesmissingvaluesannotated.pdf
Challenge of missing data3 min
Strategy 1: Purification by skipping missing data4 min
Strategy 2: Purification by imputing missing data4 min
Strategy 3: Modify learning algorithm to explicitly handle missing data
Modifying decision trees to handle missing data4 min
Feature split selection with missing data5 min
Summarizing handling missing data
Recap of handling missing data1 min
Quiz: Handling Missing Data7 questions
QUIZ
Handling Missing Data
7 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 10, 11:59 PM PDT
1 point
1.(True/False) Skipping data points (i.e., skipping rows of the data) that have missing features only works when the learning algorithm we are using is decision tree learning.
1 point
2.What are potential downsides of skipping features with missing values (i.e., skipping columns of the data) to handle missing data?
1 point
3.(True/False) It’s always better to remove missing data points (i.e., rows) as opposed to removing missing features (i.e., columns).
1 point
4.Consider a dataset with N training points. After imputing missing values, the number of data points in the data set is
1 point
5.Consider a dataset with D features. After imputing missing values, the number of features in the data set is
1 point
6.Which of the following are always true when imputing missing data? Select all that apply.
1 point
7.Consider data that has binary features (i.e. the feature values are 0 or 1) with some feature values of some data points missing. When learning the best feature split at a node, how would we best modify the decision tree learning algorithm to handle data points with missing values for a feature?
Week 5 Boosting
One of the most exciting theoretical questions that have been asked about machine learning is whether simple classifiers can be combined into a highly accurate ensemble. This question lead to the developing of boosting, one of the most important and practical techniques in machine learning today. This simple approach can boost the accuracy of any classifier, and is widely used in practice, e.g., it’s used by more than half of the teams who win the Kaggle machine learning competitions. In this module, you will first define the ensemble classifier, where multiple models vote on the best prediction. You will then explore a boosting algorithm called AdaBoost, which provides a great approach for boosting classifiers. Through visualizations, you will become familiar with many of the practical aspects of this techniques. You will create your very own implementation of AdaBoost, from scratch, and use it to boost the performance of your loan risk predictor on real data.
The amazing idea of boosting a classifier
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
The boosting question3 min
Ensemble classifiers5 min
Boosting5 min
AdaBoost
AdaBoost overview3 min
Weighted error4 min
Computing coefficient of each ensemble component4 min
Reweighing data to focus on mistakes4 min
Normalizing weights2 min
Applying AdaBoost
Example of AdaBoost in action5 min
Learning boosted decision stumps with AdaBoost4 min
Programming Assignment 1
Exploring Ensemble Methods10 min
Quiz: Exploring Ensemble Methods9 questions
QUIZ
Exploring Ensemble Methods
9 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 17, 11:59 PM PDT
1 point
1.What percentage of the predictions on sample_validation_data did model_5 get correct?
1 point
2.According to model_5, which loan is the least likely to be a safe loan?
1 point
3.What is the number of false positives on the validation data?
1 point
4.Using the same costs of the false positives and false negatives, what is the cost of the mistakes made by the boosted tree model (model_5) as evaluated on the validation_set?
1 point
5.What grades are the top 5 loans?
1 point
6.Which model has the best accuracy on the validation_data?
1 point
7.Is it always true that the model with the most trees will perform best on the test/validation set?
1 point
8.Does the training error reduce as the number of trees increases?
1 point
9.Is it always true that the test/validation error will reduce as the number of trees increases?
Convergence and overfitting in boosting
The Boosting Theorem3 min
Overfitting in boosting5 min
Summarizing boosting
Ensemble methods, impact of boosting & quick recap4 min
Quiz:Boosting11 questions
QUIZ
Boosting
11 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 17, 11:59 PM PDT
1 point
1.Which of the following is NOT an ensemble method?
1 point
2.Each binary classifier in an ensemble makes predictions on an input x as listed in the table below. Based on the ensemble coefficients also listed in the table, what is the final ensemble model’s prediction for x?
Classifier coefficient wt  Prediction for x  

Classifier 1  0.61  +1 
Classifier 2  0.53  1 
Classifier 3  0.88  1 
Classifier 4  0.34  +1 
1 point
3.(True/False) Boosted decision stumps is a linear classifier.
1 point
4.(True/False) For AdaBoost, test error is an appropriate criterion for choosing the optimal number of iterations.
1 point
5.In an iteration in AdaBoost, recall that learning the coefficient w_t for learned weak learner f_t is calculated by
$$\displaystyle{\frac{1}{2}\ln{\left( \frac{1\mathtt{weighted_error}(f_t)}{\mathtt{weighted_error}(f_t)} \right)}}$$
If the weighted error of f_t is equal to .25, what is the value of w_t? Round your answer to 2 decimal places.
1 point
6.Which of the following classifiers is most accurate as computed on a weighted dataset? A classifier with:
1 point
7.Imagine we are training a decision stump in an iteration of AdaBoost, and we are at a node. Each data point is (x1, x2, y), where x1,x2 are features, and y is the label. Also included are the weights of the data. The data at this node is:
Weight  x1  x2  y 

0.3  0  1  +1 
0.35  1  0  1 
0.1  0  1  +1 
0.25  1  1  +1 
Suppose we assign the same class label to all data in this node. (Pick the class label with the greater total weight.) What is the weighted error at the node? Round your answer to 2 decimal places.
1 point
8.After each iteration of AdaBoost, the weights on the data points are typically normalized to sum to 1. This is used because
1 point
9.Consider the following 2D dataset with binary labels.
We train a series of weak binary classifiers using AdaBoost. In one iteration, the weak binary classifier produces the decision boundary as follows:
Which of the five points (indicated in the second figure) will receive higher weight in the following iteration? Choose all that apply.
1 point
10.Suppose we are running AdaBoost using decision tree stumps. At a particular iteration, the data points have weights according the figure. (Large points indicate heavy weights.)
Which of the following decision tree stumps is most likely to be fit in the next iteration?
1 point
11.(True/False) AdaBoost achieves zero training error after a sufficient number of iterations, as long as we can find weak learners that perform better than random chance at each iteration of AdaBoost (i.e., on weighted data).
Programming Assignment 2
Boosting a decision stump10 min
Quiz:Boosting a decision stump5 questions
QUIZ
Boosting a decision stump
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 17, 11:59 PM PDT
You can still pass this quiz before the course ends.
1 point
1.Recall that the classification error for unweighted data is defined as follows:
classification error=# mistakes# all data points
Meanwhile, the weight of mistakes for weighted data is given by
$$WM(α,y^)=∑i=1nαi×1[yi≠y^i].$$
If we set the weights α=1 for all data points, how is the weight of mistakes WM(α,ŷ) related to the classification error?
1 point
2.Refer to section Example: Training a weighted decision tree.
Will you get the same model as small_data_decision_tree_subset_20 if you trained a decision tree with only 20 data points from the set of points in subset_20?
1 point
3.Refer to the 10component ensemble of tree stumps trained with Adaboost.
As each component is trained sequentially, are the component weights monotonically decreasing, monotonically increasing, or neither?
1 point
4.Which of the following best describes a general trend in accuracy as we add more and more components? Answer based on the 30 components learned so far.
1 point
5.From this plot (with 30 trees), is there massive overfitting as the # of iterations increases?
Week 6 PrecisionRecall
In many realworld settings, accuracy or error are not the best quality metrics for classification. You will explore a casestudy that significantly highlights this issue: using sentiment analysis to display positive reviews on a restaurant website. Instead of accuracy, you will define two metrics: precision and recall, which are widely used in realworld applications to measure the quality of classifiers. You will explore how the probabilities output by your classifier can be used to tradeoff precision with recall, and dive into this spectrum, using precisionrecall curves. In your handson implementation, you will compute these metrics with your learned classifier on realworld sentiment analysis data.
Why use precision & recall as quality metrics
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
Casestudy where accuracy is not best metric for classification3 min
What is good performance for a classifier?3 min
Precision & recall explained
Precision: Fraction of positive predictions that are actually positive5 min
Recall: Fraction of positive data predicted to be positive3 min
The precisionrecall tradeoff
Precisionrecall extremes2 min
Trading off precision and recall4 min
Precisionrecall curve5 min
Summarizing precisionrecall
Recap of precisionrecall1 min
Quiz: PrecisionRecall9 questions
QUIZ
PrecisionRecall
9 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 1, 11:59 PM PDT
1 point
1.Questions 1 to 5 refer to the following scenario:
Suppose a binary classifier produced the following confusion matrix.
Predicted Positive  Predicted Negative 

Actual Positive 5600  40 
Actual Negative 1900  2460 
What is the recall of this classifier? Round your answer to 2 decimal places.
1 point
2.Refer to the scenario presented in Question 1 to answer the following:
(True/False) This classifier is better than random guessing.
1 point
3.Refer to the scenario presented in Question 1 to answer the following:
(True/False) This classifier is better than the majority class classifier.
1 point
4.Refer to the scenario presented in Question 1 to answer the following:
Which of the following points in the precisionrecall space corresponds to this classifier?
(1)
(2)
(3)
(4)
(5)
1 point
5.Refer to the scenario presented in Question 1 to answer the following:
Which of the following best describes this classifier?
It is optimistic
It is pessimistic
None of the
1 point
6.Suppose we are fitting a logistic regression model on a dataset where the vast majority of the data points are labeled as positive. To compensate for overfitting to the dominant class, we should
Require higher confidence level for positive predictions
Require lower confidence level for positive predictions
1 point
7.It is often the case that false positives and false negatives incur different costs. In situations where false negatives cost much more than false positives, we should
Require higher confidence level for positive predictions
Require lower confidence level for positive predictions
1 point
8.We are interested in reducing the number of false negatives. Which of the following metrics should we primarily look at?
Accuracy
Precision
Recall
1 point
9.Suppose we set the threshold for positive predictions at 0.9. What is the lowest score that is classified as positive? Round your answer to 2 decimal places.
Programming Assignment
Exploring precision and recall10 min
Quiz: Exploring precision and recall13 questions
QUIZ
Exploring precision and recall
13 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 1, 11:59 PM PDT
1 point
1.Consider the logistic regression model trained on amazon_baby.gl using GraphLab Create.
Using accuracy as the evaluation metric, was our logistic regression model better than the majority class classifier?
1 point
2.How many predicted values in the test set are false positives?
1 point
3.Consider the scenario where each false positive costs $100 and each false negative $1.
Given the stipulation, what is the cost associated with the logistic regression classifier’s performance on the test set?
Between $0 and $100,000
Between $100,000 and $200,000
Between $200,000 and $300,000
Above $300,000
1 point
4.Out of all reviews in the test set that are predicted to be positive, what fraction of them are false positives? (Round to the second decimal place e.g. 0.25)
1 point
5.Based on what we learned in lecture, if we wanted to reduce this fraction of false positives to be below 3.5%, we would:
Discard a sufficient number of positive predictions
Discard a sufficient number of negative predictions
Increase threshold for predicting the positive class (y^=+1)
Decrease threshold for predicting the positive class (y^=+1)
1 point
6.What fraction of the positive reviews in the test_set were correctly predicted as positive by the classifier? Round your answer to 2 decimal places.
1 point
7.What is the recall value for a classifier that predicts +1 for all data points in the test_data?
1 point
8.What happens to the number of positive predicted reviews as the threshold increased from 0.5 to 0.9?
More reviews are predicted to be positive.
Fewer reviews are predicted to be positive.
1 point
9.Consider the metrics obtained from setting the threshold to 0.5 and to 0.9.
Does the recall increase with a higher threshold?
1 point
10.Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better? Round your answer to 3 decimal places.
1 point
11.Using threshold = 0.98, how many false negatives do we get on the test_data? (Hint: You may use the graphlab.evaluation.confusion_matrix function implemented in GraphLab Create.)
1 point
12.Questions 13 and 14 are concerned with the reviews that contain the word baby.
Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better for the reviews of data in baby_reviews? Round your answer to 3 decimal places.
1 point
13.Questions 13 and 14 are concerned with the reviews that contain the word baby.
Is this threshold value smaller or larger than the threshold used for the entire dataset to achieve the same specified precision of 96.5%?
Larger
Smaller
Week 7 Scaling to Huge Datasets & Online Learning
With the advent of the internet, the growth of social media, and the embedding of sensors in the world, the magnitudes of data that our machine learning algorithms must handle have grown tremendously over the last decade. This effect is sometimes called “Big Data”. Thus, our learning algorithms must scale to bigger and bigger datasets. In this module, you will develop a small modification of gradient ascent called stochastic gradient, which provides significant speedups in the running time of our algorithms. This simple change can drastically improve scaling, but makes the algorithm less stable and harder to use in practice. In this module, you will investigate the practical techniques needed to make stochastic gradient viable, and to thus to obtain learning algorithms that scale to huge datasets. You will also address a new kind of machine learning problem, online learning, where the data streams in over time, and we must learn the coefficients as the data arrives. This task can also be solved with stochastic gradient. You will implement your very own stochastic gradient ascent algorithm for logistic regression from scratch, and evaluate it on sentiment analysis data.
Scaling ML to huge datasets
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
onlinelearningannotated.pdf
Gradient ascent won’t scale to today’s huge datasets3 min
Timeline of scalable machine learning & stochastic gradient4 min
Scaling ML with stochastic gradient
Why gradient ascent won’t scale3 min
Stochastic gradient: Learning one data point at a time3 min
Comparing gradient to stochastic gradient3 min
Understanding why stochastic gradient works
Why would stochastic gradient ever work?4 min
Convergence paths2 min
Stochastic gradient: Practical tricks
Shuffle data before running stochastic gradient2 min
Choosing step size3 min
Don’t trust last coefficients1 min
(OPTIONAL) Learning from batches of data3 min
(OPTIONAL) Measuring convergence4 min
(OPTIONAL) Adding regularization3 min
Online learning: Fitting models from streaming data
The online learning task3 min
Using stochastic gradient for online learning3 min
Summarizing scaling to huge datasets & online learning
Scaling to huge datasets through parallelization & module recap1 min
Quiz: Scaling to Huge Datasets & Online Learning10 questions
QUIZ
Scaling to Huge Datasets & Online Learning
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 8, 11:59 PM PDT
1 point
1.(True/False) Stochastic gradient ascent often requires fewer passes over the dataset than batch gradient ascent to achieve a similar log likelihood.
1 point
2.(True/False) Choosing a large batch size results in less noisy gradients
1 point
3.(True/False) The set of coefficients obtained at the last iteration represents the best coefficients found so far.
1 point
4.Suppose you obtained the plot of log likelihood below after running stochastic gradient ascent.
Which of the following actions would help the most to improve the rate of convergence?
Increase step size
Decrease step size
Decrease batch size
1 point
5.Suppose you obtained the plot of log likelihood below after running stochastic gradient ascent.
Which of the following actions would help to improve the rate of convergence?
Increase batch size
Increase step size
Decrease step size
1 point
6.Suppose it takes about 1 milliseconds to compute a gradient for a single example. You run an online advertising company and would like to do online learning via minibatch stochastic gradient ascent. If you aim to update the coefficients once every 5 minutes, how many examples can you cover in each update? Overhead and other operations take up 2 minutes, so you only have 3 minutes for the coefficient update.
1 point
7.In search for an optimal step size, you experiment with multiple step sizes and obtain the following convergence plot.
Which line corresponds to the best step size?
(1)
(2)
(3)
(4)
(5)
1 point
8.Suppose you run stochastic gradient ascent with two different batch sizes. Which of the two lines below corresponds to the smaller batch size (assuming both use the same step size)?
(1)
(2)
1 point
9.Which of the following is NOT a benefit of stochastic gradient ascent over batch gradient ascent? Choose all that apply.
Each coefficient step is very fast.
Log likelihood of data improves monotonically.
Stochastic gradient ascent can be used for online learning.
Stochastic gradient ascent can achieve higher likelihood than batch gradient ascent for the same amount of running time.
Stochastic gradient ascent is highly robust with respect to parameter choices.
1 point
10.Suppose we run the stochastic gradient ascent algorithm described in the lecture with batch size of 100. To make 10 passes over a dataset consisting of 15400 examples, how many iterations does it need to run?
Programming Assignment
Training Logistic Regression via Stochastic Gradient Ascent10 min
Quiz: Training Logistic Regression via Stochastic Gradient Ascent12 questions
QUIZ
Training Logistic Regression via Stochastic Gradient Ascent
12 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 8, 11:59 PM PDT
1 point
1.In Module 3 assignment, there were 194 features (an intercept + one feature for each of the 193 important words). In this assignment, we will use stochastic gradient ascent to train the classifier using logistic regression. How does the changing the solver to stochastic gradient ascent affect the number of features?
Increases
Decreases
Stays the same
1 point
2.Recall from the lecture and the earlier assignment, the log likelihood (without the averaging term) is given by
$$ℓℓ(w)=∑i=1N((1[yi=+1]−1)wTh(xi)−ln(1+exp(−wTh(xi))))$$
whereas the average log likelihood is given by
$$ℓℓA(w)=1/N∑i=1N((1[yi=+1]−1)wTh(xi)−ln(1+exp(−wTh(xi))))$$
How are the functions ℓℓ(w) and ℓℓA(w) related?
ℓℓA(w)=ℓℓ(w)
ℓℓA(w)=(1/N)⋅ℓℓ(w)
ℓℓA(w)=N⋅ℓℓ(w)
ℓℓA(w)=ℓℓ(w)−∥w∥
1 point
3.Refer to the subsection Computing the gradient for a single data point.
The code block above computed
∂ℓi(w)∂wj
for j = 1 and i = 10. Is this quantity a scalar or a 194dimensional vector?
A scalar
A 194dimensional vector
1 point
4.Refer to the subsection Modifying the derivative for using a batch of data points.
The code block computed
∑s=ii+B∂ℓs(w)∂wj
for j = 10, i = 10, and B = 10. Is this a scalar or a 194dimensional vector?
A scalar
A 194dimensional vector
1 point
5.For what value of B is the term
∑s=1B∂ℓs(w)∂wj
the same as the full gradient
∂ℓ(w)∂wj
? A numeric answer is expected for this question. Hint: consider the training set we are using now.
1 point
6.For what value of batch size B above is the stochastic gradient ascent function logistic_regression_SG act as a standard gradient ascent algorithm? A numeric answer is expected for this question. Hint: consider the training set we are using now.
1 point
7.When you set batch_size = 1, as each iteration passes, how does the average log likelihood in the batch change?
Increases
Decreases
Fluctuates
1 point
8.When you set batch_size = len(feature_matrix_train), as each iteration passes, how does the average log likelihood in the batch change?
Increases
Decreases
Fluctuates
1 point
9.Suppose that we run stochastic gradient ascent with a batch size of 100. How many gradient updates are performed at the end of two passes over a dataset consisting of 50000 data points?
1 point
10.Refer to the section Stochastic gradient ascent vs gradient ascent.
In the first figure, how many passes does batch gradient ascent need to achieve a similar log likelihood as stochastic gradient ascent?
It’s always better
10 passes
20 passes
150 passes or more
1 point
11.Questions 11 and 12 refer to the section Plotting the log likelihood as a function of passes for each step size.
Which of the following is the worst step size? Pick the step size that results in the lowest log likelihood in the end.
1e2
1e1
1e0
1e1
1e2
1 point
12.Questions 11 and 12 refer to the section Plotting the log likelihood as a function of passes for each step size.
Which of the following is the best step size? Pick the step size that results in the highest log likelihood in the end.
1e4
1e2
1e0
1e1
1e2
Machine Learning: Clustering & Retrieval
Course can be found here
Lecture slides can be found here
Summary can be found in my Github
About this course: Case Studies: Finding Similar Documents
A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover?
In this third case study, finding similar documents, you will examine similaritybased algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce.
Learning Outcomes: By the end of this course, you will be able to:
Create a document retrieval system using knearest neighbors.
Identify various similarity metrics for text data.
Reduce computations in knearest neighbor search by using KDtrees.
Produce approximate nearest neighbors using locality sensitive hashing.
Compare and contrast supervised and unsupervised learning tasks.
Cluster documents by topic using kmeans.
Describe how to parallelize kmeans using MapReduce.
Examine probabilistic clustering approaches using mixtures models.
Fit a mixture of Gaussian model using expectation maximization (EM).
Perform mixed membership modeling using latent Dirichlet allocation (LDA).
Describe the steps of a Gibbs sampler and how to use its output to draw inferences.
Compare and contrast initialization techniques for nonconvex optimization objectives.
Implement these techniques in Python.
Week 1 Welcome
Clustering and retrieval are some of the most highimpact machine learning tools out there. Retrieval is used in almost every applications and device we interact with, like in providing a set of products related to one a shopper is currently considering, or a list of people you might want to connect with on a social media platform. Clustering can be used to aid retrieval, but is a more broadly useful tool for automatically discovering structure in data, like uncovering groups of similar patients.
This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.
What is this course about?
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
Welcome and introduction to clustering and retrieval tasks6 min
Course overview3 min
Modulebymodule topics covered8 min
Assumed background6 min
Software tools you’ll need for this course10 min
Github repository with starter code
In each module of the course, we have a reading with the assignments for that module as well as some starter code. For those interested, the starter code and demos used in this course are also available in a public Github repository:
https://github.com/learnml/machinelearningspecialization
A big week ahead!10 min
Week 2 Nearest Neighbor Search
We start the course by considering a retrieval task of fetching a document similar to one someone is currently reading. We cast this problem as one of nearest neighbor search, which is a concept we have seen in the Foundations and Regression courses. However, here, you will take a deep dive into two critical components of the algorithms: the data representation and metric for measuring similarity between pairs of datapoints. You will examine the computational burden of the naive nearest neighbor search algorithm, and instead implement scalable alternatives using KDtrees for handling large datasets and locality sensitive hashing (LSH) for providing approximate nearest neighbors, even in highdimensional spaces. You will explore all of these ideas on a Wikipedia dataset, comparing and contrasting the impact of the various choices you can make on the nearest neighbor results produced.
Introduction to nearest neighbor search and algorithms
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
Retrieval as knearest neighbor search2 min
1NN algorithm2 min
kNN algorithm6 min
The importance of data representations and distance metrics
Document representation5 min
Distance metrics: Euclidean and scaled Euclidean6 min
Writing (scaled) Euclidean distance using (weighted) inner products4 min
Distance metrics: Cosine similarity9 min
To normalize or not and other distance considerations6 min
Quiz: Representations and metrics6 questions
QUIZ
Representations and metrics
6 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT
1 point
1.Consider three data points with two features as follows:
Among the three points, which two are closest to each other in terms of having the smallest Euclidean distance?
A and B
A and C
B and C
1 point
2.Consider three data points with two features as follows:
Among the three points, which two are closest to each other in terms of having the largest cosine similarity (or equivalently, smallest cosine distance)?
A and B
A and C
B and C
1 point
3.Consider the following two sentences.
Sentence 1: The quick brown fox jumps over the lazy dog.
Sentence 2: A quick brown dog outpaces a quick fox.
Compute the Euclidean distance using word counts. To compute word counts, turn all words into lower case and strip all punctuation, so that “The” and “the” are counted as the same token. That is, document 1 would be represented as
x=[# the,# a,# quick,# brown,# fox,# jumps,# over,# lazy,# dog,# outpaces]
where # word is the count of that word in the document.
Round your answer to 3 decimal places.
sum = 13
1 point
4.Consider the following two sentences.
Sentence 1: The quick brown fox jumps over the lazy dog.
Sentence 2: A quick brown dog outpaces a quick fox.
Recall that
cosine distance = 1  cosine similarity = 1−xTyxy
Compute the cosine distance between sentence 1 and sentence 2 using word counts. To compute word counts, turn all words into lower case and strip all punctuation, so that “The” and “the” are counted as the same token. That is, document 1 would be represented as
x=[# the,# a,# quick,# brown,# fox,# jumps,# over,# lazy,# dog,# outpaces]
where # word is the count of that word in the document.
Round your answer to 3 decimal places.
1 point
5.(True/False) For positive features, cosine similarity is always between 0 and 1.
1 point
6.Which of the following does not describe the word count document representation? (Note: this is different from TFIDF document representation.)
Ignores the order of the words
Assigns a high score to a frequently occurring word
Penalizes words that appear in every document
Programming Assignment 1
Choosing features and metrics for nearest neighbor search10 min
Quiz: Choosing features and metrics for nearest neighbor search5 questions
QUIZ
Choosing features and metrics for nearest neighbor search
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT
1 point
1.Among the words that appear in both Barack Obama and Francisco Barrio, take the 5 that appear most frequently in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?
1 point
2.Measure the pairwise distance between the Wikipedia pages of Barack Obama, George W. Bush, and Joe Biden. Which of the three pairs has the smallest distance?
Between Obama and Biden
Between Obama and Bush
Between Biden and Bush
1 point
3.Collect all words that appear both in Barack Obama and George W. Bush pages. Out of those words, find the 10 words that show up most often in Obama’s page. Which of the following is NOT one of the 10 words?
the
presidential
in
act
his
1 point
4.Among the words that appear in both Barack Obama and Phil Schiliro, take the 5 that have largest weights in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?
1
point
 Compute the Euclidean distance between TFIDF features of Obama and Biden. Round your answer to 3 decimal places. Use Americanstyle decimals (e.g. 110.921).
Scaling up kNN search using KDtrees
Complexity of brute force search1 min
KDtree representation9 min
NN search with KDtrees7 min
Complexity of NN search with KDtrees5 min
Visualizing scaling behavior of KDtrees4 min
Approximate kNN search using KDtrees7 min
(OPTIONAL) A workedout example for KDtrees10 min
Quiz: KDtrees5 questions
QUIZ
KDtrees
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT
1 point
1.Which of the following is not true about KDtrees?
It divides the feature space into nested axisaligned boxes.
It can be used only for approximate nearest neighbor search but not for exact nearest neighbor search.
It prunes parts of the feature space away from consideration by inspecting smallest possible distances that can be achieved.
The query time scales sublinearly with the number of data points and exponentially with the number of dimensions.
It works best in low to medium dimension settings.
1 point
2.Questions 2, 3, 4, and 5 involves training a KDtree on the following dataset:
–  X1  X2  – 

Data point 1  1.58  2.01  
Data point 2  0.91  3.98  
Data point 3  0.73  4.00  
Data point 4  4.22  1.16  
Data point 5  4.19  2.02  
Data point 6  0.33  2.15 
Train a KDtree by hand as follows:
 First split using X1 and then using X2. Alternate between X1 and X2 in order.
 Use “middleoftherange” heuristic for each split. Take the maximum and minimum of the coordinates of the member points.
 Keep subdividing until every leaf node contains two or fewer data points.
What is the split value used for the first split? Enter the exact value, as you are expected to obtain a finite number of decimals. Use Americanstyle decimals (e.g. 0.026).
1 point
3.Refer to Question 2 for context.
What is the split value used for the second split? Enter the exact value, as you are expected to obtain a finite number of decimals. Use Americanstyle decimals (e.g. 0.026).
1 point
4.Refer to Question 2 for context.
Given a query point (3, 1.5), which of the data points belong to the same leaf node as the query point? Choose all that apply.
Data point 1
Data point 2
Data point 3
Data point 4
Data point 5
Data point 6
1 point
5.Refer to Question 2 for context.
Perform backtracking with the query point (3, 1.5) to perform exact nearest neighbor search. Which of the data points would be pruned from the search? Choose all that apply.
Hint: Assume that each node in the KDtree remembers the tight bound on the coordinates of its member points, as follows:
Data point 1
Data point 2
Data point 3
Data point 4
Data point 5
Data point 6
Locality sensitive hashing for approximate NN search
Limitations of KDtrees3 min
LSH as an alternative to KDtrees4 min
Using random lines to partition points5 min
Defining more bins3 min
Searching neighboring bins8 min
LSH in higher dimensions4 min
(OPTIONAL) Improving efficiency through multiple tables22 min
Quiz: Locality Sensitive Hashing5 questions
QUIZ
Locality Sensitive Hashing
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT
1 point
1.(True/False) Like KDtrees, Locality Sensitive Hashing lets us compute exact nearest neighbors while inspecting only a fraction of the data points in the training set.
1 point
2.(True/False) Given two data points with high cosine similarity, the probability that a randomly drawn line would separate the two points is small.
1 point
3.(True/False) The true nearest neighbor of the query is guaranteed to fall into the same bin as the query.
1 point
4.(True/False) Locality Sensitive Hashing is more efficient than KDtrees in high dimensional setting.
1 point
5.Suppose you trained an LSH model and performed a lookup using the bin index of the query. You notice that the list of candidates returned are not at all similar to the query item. Which of the following changes would not produce a more relevant list of candidates?
Use multiple tables.
Increase the number of random lines/hyperplanes.
Inspect more neighboring bins to the bin containing the query.
Decrease the number of random lines/hyperplanes.
Programming Assignment 2
Implementing Locality Sensitive Hashing from scratch10 min
Quiz: Implementing Locality Sensitive Hashing from scratch5 questions
QUIZ
Implementing Locality Sensitive Hashing from scratch
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT
1 point
1.What is the document ID of Barack Obama’s article?
1 point
2.Which bin contains Barack Obama’s article? Enter its integer index.
1 point
3.Examine the bit representations of the bins containing Barack Obama and Joe Biden. In how many places do they agree?
16 out of 16 places (Barack Obama and Joe Biden fall into the same bin)
14 out of 16 places
12 out of 16 places
10 out of 16 places
8 out of 16 places
1 point
4.Refer to the section “Effect of nearby bin search”. What was the smallest search radius that yielded the correct nearest neighbor for Obama, namely Joe Biden?
1 point
5.Suppose our goal was to produce 10 approximate nearest neighbors whose average distance from the query document is within 0.01 of the average for the true 10 nearest neighbors. For Barack Obama, the true 10 nearest neighbors are on average about 0.77. What was the smallest search radius for Barack Obama that produced an average distance of 0.78 or better?
Summarizing nearest neighbor search
A brief recap2 min
Week 3 Clustering with kmeans
In clustering, our goal is to group the datapoints in our dataset into disjoint sets. Motivated by our document analysis case study, you will use clustering to discover thematic groups of articles by “topic”. These topics are not provided in this unsupervised learning task; rather, the idea is to output such cluster labels that can be postfacto associated with known topics like “Science”, “World News”, etc. Even without such postfacto labels, you will examine how the clustering output can provide insights into the relationships between datapoints in the dataset. The first clustering algorithm you will implement is kmeans, which is the most widely used clustering algorithm out there. To scale up kmeans, you will learn about the general MapReduce framework for parallelizing and distributing computations, and then how the iterates of kmeans can utilize this framework. You will show that kmeans can provide an interpretable grouping of Wikipedia articles when appropriately tuned.
Introduction to clustering
Slides presented in this module10 min
The goal of clustering3 min
An unsupervised task6 min
Hope for unsupervised learning, and some challenge cases4 min
Clustering via kmeans
The kmeans algorithm7 min
kmeans as coordinate descent6 min
Smart initialization via kmeans++4 min
Assessing the quality and choosing the number of clusters9 min
Quiz: kmeans9 questions
QUIZ
kmeans
9 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 29, 11:59 PM PDT
1 point
1.(True/False) kmeans always converges to a local optimum.
1 point
2.(True/False) The clustering objective is nonincreasing throughout a run of kmeans.
1 point
3.(True/False) Running kmeans with a larger value of k always enables a lower possible final objective value than running kmeans with smaller k.
1 point
4.(True/False) Any initialization of the centroids in kmeans is just as good as any other.
1 point
5.(True/False) Initializing centroids using kmeans++ guarantees convergence to a global optimum.
1 point
6.(True/False) Initializing centroids using kmeans++ costs more than random initialization in the beginning, but can pay off eventually by speeding up convergence.
1 point
7.(True/False) Using kmeans++ can only influence the number of iterations to convergence, not the quality of the final assignments (i.e., objective value at convergence).
4 points
8.Consider the following dataset:
–  X1  X2 

Data point 1  1.88  2.05 
Data point 2  0.71  0.42 
Data point 3  2.41  0.67 
Data point 4  1.85  3.80 
Data point 5  3.69  1.33 
Perform kmeans with k=2 until the cluster assignment does not change between successive iterations. Use the following initialization for the centroids:
–  X1  X2 

Cluster 1  2.00  2.00 
Cluster 2  2.00  2.00 
Which of the five data points changed its cluster assignment most often during the kmeans run?
Data point 1
Data point 2
Data point 3
Data point 4
Data point 5
1 point
9.Suppose we initialize kmeans with the following centroids
Which of the following best describes the cluster assignment in the first iteration of kmeans?
Programming Assignment
Clustering text data with kmeans10 min
Quiz: Clustering text data with Kmeans8 questions
QUIZ
Clustering text data with Kmeans
8 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 29, 11:59 PM PDT
1 point
1.Make sure you have the latest versions of the notebook and the file kmeansarrays.npz Read this post if
… you downloaded the files before September 10
… you created an Amazon EC2 instance before October 1
I acknowledge.
1 point
2.(True/False) The clustering objective (heterogeneity) is nonincreasing for this example.
1 point
3.Let’s step back from this particular example. If the clustering objective (heterogeneity) would ever increase when running Kmeans, that would indicate: (choose one)
Kmeans algorithm got stuck in a bad local minimum
There is a bug in the Kmeans code
All data points consist of exact duplicates
Nothing is wrong. The objective should generally go down sooner or later.
1 point
4.Refer to the output of Kmeans for K=3 and seed=0. Which of the three clusters contains the greatest number of data points in the end?
Cluster #0
Cluster #1
Cluster #2
1 point
 Another way to capture the effect of changing initialization is to look at the distribution of cluster assignments. Compute the size (# of member data points) of clusters for each of the multiple runs of Kmeans.
Look at the size of the largest cluster (most # of member data points) across multiple runs, with seeds 0, 20000, …, 120000. What is the maximum value this quantity takes?
1 point
6.Refer to the section “Visualize clusters of documents”. Which of the 10 clusters above contains the greatest number of articles?
Cluster 0: artists, actors, film directors, playwrights
Cluster 4: professors, researchers, scholars
Cluster 5: Australian rules football players, American football players
Cluster 7: composers, songwriters, singers, music producers
Cluster 9: politicians
1 point
7.Refer to the section “Visualize clusters of documents”. Which of the 10 clusters above contains the least number of articles?
Cluster 1: soccer (association football) players, rugby players
Cluster 3: baseball players
Cluster 6: female figures from various fields
Cluster 7: composers, songwriters, singers, music producers
Cluster 8: ice hockey players
1 point
 Another sign of too large K is having lots of small clusters. Look at the distribution of cluster sizes (by number of member data points). How many of the 100 clusters have fewer than 236 articles, i.e. 0.4% of the dataset?
MapReduce for scaling kmeans
Motivating MapReduce8 min
The general MapReduce abstraction5 min
MapReduce execution overview and combiners6 min
MapReduce for kmeans7 min
Quiz: MapReduce for kmeans5 questions
QUIZ
MapReduce for kmeans
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 29, 11:59 PM PDT
1 point
1.Suppose we are operating on a 1D vector. Which of the following operation is not data parallel over the vector elements?
Add a constant to every element.
Multiply the vector by a constant.
Increment the vector by another vector of the same dimension.
Compute the average of the elements.
Compute the sign of each element.
1 point
2.(True/False) A single mapper call can emit multiple (key,value) pairs.
1 point
3.(True/False) More than one reducer can emit (key,value) pairs with the same key simultaneously.
1 point
4.(True/False) Suppose we are running kmeans using MapReduce. Some mappers may be launched for a new kmeans iteration even if some reducers from the previous iteration are still running.
1 point
5.Consider the following list of binary operations. Which can be used for the reduce step of MapReduce? Choose all that apply.
Hints: The reduce step requires a binary operator that satisfied both of the following conditions.
Commutative: OP(x1,x2)=OP(x2,x1)
Associative: OP(OP(x1,x2),x3)=OP(x1,OP(x2,x3))
OP1(x1,x2)=max(x1,x2)
OP2(x1,x2)=x1+x2−2
OP3(x1,x2)=3x1+2x2
OP4(x1,x2)=x21+x2
OP5(x1,x2)=(x1+x2)/2
Summarizing clustering with kmeans
Other applications of clustering7 min
Week 4 Mixture Models
In kmeans, observations are each hardassigned to a single cluster, and these assignments are based just on the cluster centers, rather than also incorporating shape information. In our second module on clustering, you will perform probabilistic modelbased clustering that provides (1) a more descriptive notion of a “cluster” and (2) accounts for uncertainty in assignments of datapoints to clusters via “soft assignments”. You will explore and implement a broadly useful algorithm called expectation maximization (EM) for inferring these soft assignments, as well as the model parameters. To gain intuition, you will first consider a visually appealing image clustering task. You will then cluster Wikipedia articles, handling the highdimensionality of the tfidf document representation considered.
Motivating and setting the foundation for mixture models
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
Motiving probabilistic clustering models8 min
Aggregating over unknown classes in an image dataset6 min
Univariate Gaussian distributions2 min
Bivariate and multivariate Gaussians7 min
Mixtures of Gaussians for clustering
Mixture of Gaussians6 min
Interpreting the mixture of Gaussian terms5 min
Scaling mixtures of Gaussians for document clustering5 min
Expectation Maximization (EM) building blocks
Computing soft assignments from known cluster parameters7 min
(OPTIONAL) Responsibilities as Bayes’ rule5 min
Estimating cluster parameters from known cluster assignments6 min
Estimating cluster parameters from soft assignments8 min
The EM algorithm
EM iterates in equations and pictures6 min
Convergence, initialization, and overfitting of EM9 min
Relationship to kmeans3 min
(OPTIONAL) A workedout example for EM10 min
Quiz: EM for Gaussian mixtures9 questions
QUIZ
EM for Gaussian mixtures
9 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 5, 11:59 PM PST
1 point
1.(True/False) While the EM algorithm maintains uncertainty about the cluster assignment for each observation via soft assignments, the model assumes that every observation comes from only one cluster.
1 point
2.(True/False) In high dimensions, the EM algorithm runs the risk of setting cluster variances to zero.
1 point
3.In the EM algorithm, what do the E step and M step represent, respectively?
Estimate cluster responsibilities, Maximize likelihood over parameters
Estimate likelihood over parameters, Maximize cluster responsibilities
Estimate number of parameters, Maximize likelihood over parameters
Estimate likelihood over parameters, Maximize number of parameters
1 point
4.Suppose we have data that come from a mixture of 6 Gaussians (i.e., that is the true data structure). Which model would we expect to have the highest loglikelihood after fitting via the EM algorithm?
A mixture of Gaussians with 2 component clusters
A mixture of Gaussians with 4 component clusters
A mixture of Gaussians with 6 component clusters
A mixture of Gaussians with 7 component clusters
A mixture of Gaussians with 10 component clusters
6
1 point
5.Which of the following correctly describes the differences between EM for mixtures of Gaussians and kmeans? Choose all that apply.
kmeans often gets stuck in a local minimum, while EM tends not to
EM is better at capturing clusters of different sizes and orientations
EM is better at capturing clusters with overlaps
EM is less prone to overfitting than kmeans
kmeans is equivalent to running EM with infinitesimally small diagonal covariances.
1 point
6.Suppose we are running the EM algorithm. After an Estep, we obtain the following responsibility matrix:
Cluster responsibilities  Cluster A  Cluster B  Cluster C 

Data point 1  0.20  0.40  0.40 
Data point 2  0.50  0.10  0.40 
Data point 3  0.70  0.20  0.10 
Which is the least probable cluster for data point 1?
Cluster A
Cluster B
Cluster C
1 point
7.Suppose we are running the EM algorithm. After an Estep, we obtain the following responsibility matrix:
Cluster responsibilities  Cluster A  Cluster B  Cluster C 

Data point 1  0.20  0.40  0.40 
Data point 2  0.50  0.10  0.40 
Data point 3  0.70  0.20  0.10 
Suppose also that the data points are as follows:
Dataset  X  Y  Z 

Data point 1  3  1  2 
Data point 2  0  0  3 
Data point 3  1  3  7 
Let us compute the new mean for Cluster A. What is the Z coordinate of the new mean? Round your answer to 3 decimal places.
(2*0.2 +3*0.5+7*0.7)/(.2+.5+.7)=
1 point
8.Which of the following contour plots describes a Gaussian distribution with diagonal covariance? Choose all that apply.
(1)
(2)
(3)
(4)
(5)
2 points
9.Suppose we initialize EM for mixtures of Gaussians (using full covariance matrices) with the following clusters:
Which of the following best describes the updated clusters after the first iteration of EM?
Summarizing mixture models
A brief recap1 min
Programming Assignment 1
Implementing EM for Gaussian mixtures10 min
Quiz: Implementing EM for Gaussian mixtures6 questions
QUIZ
Implementing EM for Gaussian mixtures
6 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 5, 11:59 PM PST
1 point
1.What is the weight that EM assigns to the first component after running the above codeblock? Round your answer to 3 decimal places.
1 point
2.Using the same set of results, obtain the mean that EM assigns the second component. What is the mean in the first dimension? Round your answer to 3 decimal places.
1 point
3.Using the same set of results, obtain the covariance that EM assigns the third component. What is the variance in the first dimension? Round your answer to 3 decimal places.
1 point
4.Is the loglikelihood plot monotonically increasing, monotonically decreasing, or neither?
Monotonically increasing
Monotonically decreasing
Neither
1 point
5.Calculate the likelihood (score) of the first image in our data set (img[0]) under each Gaussian component through a call to multivariate_normal.pdf
. Given these values, what cluster assignment should we make for this image?
Cluster 0
Cluster 1
Cluster 2
Cluster 3
1 point
6.Four of the following images are not in the list of top 5 images in the first cluster. Choose these four.
Image 1
Image 2
Image 3
Image 4
Image 5
Image 6
Image 7
Programming Assignment 2
Clustering text data with Gaussian mixtures10 min
Quiz: Clustering text data with Gaussian mixtures4 questions
QUIZ
Clustering text data with Gaussian mixtures
4 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 5, 11:59 PM PST
1 point
1.Select all the topics that have a cluster in the model created above.
Baseball
Basketball
Soccer/football
Music
Politics
Law
Finance
1 point
2.Try fitting EM with the random initial parameters you created above. What is the final loglikelihood that the algorithm converges to? Choose the range that contains this value.
Less than 2.2e9
Between 2.2e9 and 2.3e9
Between 2.3e9 and 2.4e9
Between 2.4e9 and 2.5e9
Greater than 2.5e9
1 point
3.Is the final loglikelihood larger or smaller than the final loglikelihood we obtained above when initializing EM with the results from running kmeans?
Initializing EM with kmeans led to a larger final loglikelihood
Initializing EM with kmeans led to a smaller final loglikelihood
1 point
4.For the above model, out_random_init
, use the visualize_EM_clusters
method you created above. Are the clusters more or less interpretable than the ones found after initializing using kmeans?
More interpretable
Less interpretable
Week 5 Mixed Membership Modeling via Latent Dirichlet Allocation
The clustering model inherently assumes that data divide into disjoint sets, e.g., documents by topic. But, often our data objects are better described via memberships in a collection of sets, e.g., multiple topics. In our fourth module, you will explore latent Dirichlet allocation (LDA) as an example of such a mixed membership model particularly useful in document analysis. You will interpret the output of LDA, and various ways the output can be utilized, like as a set of learned document features. The mixed membership modeling ideas you learn about through LDA for document analysis carry over to many other interesting models and applications, like social network models where people have multiple affiliations.
Throughout this module, we introduce aspects of Bayesian modeling and a Bayesian inference algorithm called Gibbs sampling. You will be able to implement a Gibbs sampler for LDA by the end of the module.
Introduction to latent Dirichlet allocation
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
Mixed membership models for documents3 min
An alternative document clustering model4 min
Components of latent Dirichlet allocation model2 min
Goal of LDA inference5 min
Quiz: Latent Dirichlet Allocation5 questions
QUIZ
Latent Dirichlet Allocation
5 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 12, 11:59 PM PST
1 point
1.(True/False) According to the assumptions of LDA, each document in the corpus contains words about a single topic.
1 point
2.(True/False) Using LDA to analyze a set of documents is an example of a supervised learning task.
1 point
3.(True/False) When training an LDA model, changing the ordering of words in a document does not affect the overall joint probability.
1 point
4.(True/False) Suppose in a trained LDA model two documents have no topics in common (i.e., one document has 0 weight on any topic with nonzero weight in the other document). As a result, a single word in the vocabulary cannot have high probability of occurring in both documents.
1 point
5.(True/False) Topic models are guaranteed to produce weights on words that are coherent and easily interpretable by humans.
Bayesian inference via Gibbs sampling
The need for Bayesian inference4 min
Gibbs sampling from 10,000 feet5 min
A standard Gibbs sampler for LDA9 min
Collapsed Gibbs sampling for LDA
What is collapsed Gibbs sampling?3 min
A worked example for LDA: Initial setup4 min
A worked example for LDA: Deriving the resampling distribution7 min
Using the output of collapsed Gibbs sampling4 min
Summarizing latent Dirichlet allocation
A brief recap1 min
Quiz: Learning LDA model via Gibbs sampling10 questions
QUIZ
Learning LDA model via Gibbs sampling
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 12, 11:59 PM PST
1 point
1.(True/False) Each iteration of Gibbs sampling for Bayesian inference in topic models is guaranteed to yield a higher joint model probability than the previous sample.
1 point
2.(Check all that are true) Bayesian methods such as Gibbs sampling can be advantageous because they
Account for uncertainty over parameters when making predictions
Are faster than methods such as EM
Maximize the log probability of the data under the model
Regularize parameter estimates to avoid extreme values
1 point
3.For the standard LDA model discussed in the lectures, how many parameters are required to represent the distributions defining the topics?
[# unique words]
[# unique words] * [# topics]
[# documents] * [# unique words]
[# documents] * [# topics]
2 points
4.Suppose we have a collection of documents, and we are focusing our analysis to the use of the following 10 words. We ran several iterations of collapsed Gibbs sampling for an LDA model with K=2 topics and alpha=10.0 and gamma=0.1 (with notation as in the collapsed Gibbs sampling lecture). The corpuswide assignments at our most recent collapsed Gibbs iteration are summarized in the following table of counts:
Word  Count in topic 1  Count in topic 2 

baseball  52  0 
homerun  15  0 
ticket  9  2 
price  9  25 
manager  20  37 
owner  17  32 
company  1  23 
stock  0  75 
bankrupt  0  19 
taxes  0  29 
We also have a single document i with the following topic assignments for each word:
topic  1  2  1  2  1 

word  baseball  manager  ticket  price  owner 
Suppose we want to recompute the topic assignment for the word “manager”. To sample a new topic, we need to compute several terms to determine how much the document likes each topic, and how much each topic likes the word “manager”. The following questions will all relate to this situation.
First, using the notation in the slides, what is the value of mmanager,1 (i.e., the number of times the word “manager” has been assigned to topic 1)?
1 point
5.Consider the situation described in Question 4.
What is the value of ∑wmw,1, where the sum is taken over all words in the vocabulary?
1 point
6.Consider the situation described in Question 4.
Following the notation in the slides, what is the value of ni,1 for this document i (i.e., the number of words in document i assigned to topic 1)?
1 point
7.In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.
After decrementing, what is the value of ni,2?
1 point
8.In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.
After decrementing, what is the value of mmanager,2?
1 point
9.In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.
After decrementing, what is the value of ∑wmw,2?
2 points
10.Consider the situation described in Question 4.
As discussed in the slides, the unnormalized probability of assigning to topic 1 is
p1=ni,1+αNi−1+Kαmmanager,1+γ∑wmw,1+Vγ
where V is the total size of the vocabulary.
Similarly the unnormalized probability of assigning to topic 2 is
p2=ni,2+αNi−1+Kαmmanager,2+γ∑wmw,2+Vγ
Using the above equations and the results computed in previous questions, compute the probability of assigning the word “manager” to topic 1.
(Reminder: Normalize across the two topic options so that the probabilities of all possible assignments—topic 1 and topic 2—sum to 1.)
Round your answer to 3 decimal places.
p1 = (3+10)/(4+210)(20+0.1)/(123+100.1)
p2 = (1+10)/(4+210)(36+0.1)/(241+100.1)
Programming Assignment
Modeling text topics with Latent Dirichlet Allocation10 min
Quiz: Modeling text topics with Latent Dirichlet Allocation12 questions
QUIZ
Modeling text topics with Latent Dirichlet Allocation
12 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 12, 11:59 PM PST
1 point
1.Identify the top 3 most probable words for the first topic.
institute
university
professor
research
studies
game
coach
1 point
2.What is the sum of the probabilities assigned to the top 50 words in the 3rd topic? Round your answer to 3 decimal places.
1 point
3.What is the topic most closely associated with the article about former US President George W. Bush? Use the average results from 100 topic predictions.
1 point
4.What are the top 3 topics corresponding to the article about English football (soccer) player Steven Gerrard? Use the average results from 100 topic predictions.
science and research
team sports
music, TV, and film
international athletics
Great Britain and Australia
1 point
5.Using the LDA representation, compute the 5000 nearest neighbors for American baseball player Alex Rodriguez. For what value of k is Mariano Rivera the kth nearest neighbor to Alex Rodriguez?
1 point
6.Using the TFIDF representation, compute the 5000 nearest neighbors for American baseball player Alex Rodriguez. For what value of k is Mariano Rivera the kth nearest neighbor to Alex Rodriguez?
1 point
7.What was the value of alpha used to fit our original topic model?
1 point
8.What was the value of gamma used to fit our original topic model? Remember that GraphLab Create uses “beta” instead of “gamma” to refer to the hyperparameter that influences topic distributions over words.
1 point
9.How many topics are assigned a weight greater than 0.3 or less than 0.05 for the article on Paul Krugman in the low alpha model? Use the average results from 100 topic predictions.
1 point
10.How many topics are assigned a weight greater than 0.3 or less than 0.05 for the article on Paul Krugman in the high alpha model? Use the average results from 100 topic predictions.
1 point
11.For each topic of the low gamma model, compute the number of words required to make a list with total probability 0.5. What is the average number of words required across all topics? (HINT: use the get_topics() function from GraphLab Create with the cdf_cutoff argument.)
1 point
12.For each topic of the high gamma model, compute the number of words required to make a list with total probability 0.5. What is the average number of words required across all topics? (HINT: use the get_topics() function from GraphLab Create with the cdf_cutoff argument).
Week 6 Hierarchical Clustering & Closing Remarks
In the conclusion of the course, we will recap what we have covered. This represents both techniques specific to clustering and retrieval, as well as foundational machine learning concepts that are more broadly useful.
We provide a quick tour into an alternative clustering approach called hierarchical clustering, which you will experiment with on the Wikipedia dataset. Following this exploration, we discuss how clusteringtype ideas can be applied in other areas like segmenting time series. We then briefly outline some important clustering and retrieval ideas that we did not cover in this course.
We conclude with an overview of what’s in store for you in the rest of the specialization.
What we’ve learned
Slides presented in this module10 min
For those interested, the slides presented in the videos for this module can be downloaded here:
Module 1 recap10 min
Module 2 recap3 min
Module 3 recap6 min
Module 4 recap7 min
Hierarchical clustering and clustering for time series segmentation
Why hierarchical clustering?2 min
Divisive clustering4 min
Agglomerative clustering2 min
The dendrogram4 min
Agglomerative clustering details7 min
Hidden Markov models9 min
Programming Assignment
Modeling text data with a hierarchy of clusters10 min
Quiz: Modeling text data with a hierarchy of clusters3 questions
QUIZ
Modeling text data with a hierarchy of clusters
3 questions
To Pass33% or higher
Attempts3 every 8 hours
Deadline
November 19, 11:59 PM PST
1 point
1.Make sure you have the latest versions of the notebook. Read this post if
… you downloaded the notebook before September 10
… you created an Amazon EC2 instance before October 1
I acknowledge.
1 point
2.Which diagram best describes the hierarchy right after splitting the ice_hockey_football cluster?
football golf
1 point
3.Let us bipartition the clusters male_non_athletes and female_non_athletes. Which diagram best describes the resulting hierarchy of clusters for the nonathletes?
Note. The clusters for the athletes are not shown to save space.