For quick searching
Course can be found here
Notes can be found in my Github

This Specialization from leading researchers at the University of Washington introduces you to the exciting, high-demand field of Machine Learning. Through a series of practical case studies, you will gain applied experience in major areas of Machine Learning including Prediction, Classification, Clustering, and Information Retrieval. You will learn to analyze large and complex datasets, create systems that adapt and improve over time, and build intelligent applications that can make predictions from data.

# Machine Learning Foundations: A Case Study Approach

Course can be found here
Lecture slides can be found here
Notes can be found in my Github

About this course: Do you have data and wonder what it can tell you? Do you need a deeper understanding of the core ways in which machine learning can improve your business? Do you want to be able to converse with specialists about anything from regression and classification to deep learning and recommender systems?

In this course, you will get hands-on experience with machine learning from a series of practical case-studies. At the end of the first course you will have studied

1. how to predict house prices based on house-level features,
2. analyze sentiment from user reviews,
3. retrieve documents of interest,
4. recommend products,
5. and search for images.

Through hands-on practice with these use cases, you will be able to apply machine learning methods in a wide range of domains.

This first course treats the machine learning method as a black box. Using this abstraction, you will focus on understanding tasks of interest, matching these tasks to machine learning tools, and assessing the quality of the output. In subsequent courses, you will delve into the components of this black box by examining models and algorithms. Together, these pieces form the machine learning pipeline, which you will use in developing intelligent applications.

Learning Outcomes: By the end of this course, you will be able to:
-Identify potential applications of machine learning in practice.
-Describe the core differences in analyses enabled by regression, classification, and clustering.
-Select the appropriate machine learning task for a potential application.
-Apply regression, classification, clustering, retrieval, recommender systems, and deep learning.
-Represent your data as features to serve as input to machine learning models.
-Assess the model quality in terms of relevant error metrics for each task.
-Utilize a dataset to fit a model to analyze new data.
-Build an end-to-end application that uses machine learning at its core.
-Implement these techniques in Python.

Welcome to Machine Learning Foundations: A Case Study Approach! By joining this course, you’ve taken a first step in becoming a machine learning expert. You will learn a broad range of machine learning methods for deriving intelligence from data, and by the end of the course you will be able to implement actual intelligent applications. These applications will allow you to perform predictions, personalized recommendations and retrieval, and much more. If you continue with the subsequent courses in the Machine Learning specialization, you will delve deeper into the methods and algorithms, giving you the power to develop and deploy new machine learning services.

To begin, we recommend taking a few minutes to explore the course site. Review the material we’ll cover each week, and preview the assignments you’ll need to complete to pass the course. These assignments—one per Module 2 through 6—will walk you through Python implementations of intelligent applications for:

• Predicting house prices
• Analyzing the sentiment of product reviews
• Retrieving Wikipedia articles
• Recommending songs
• Classifying images with deep learning

## Week 1 Welcome

Machine learning is everywhere, but is often operating behind the scenes.

This introduction to the specialization provides you with insights into the power of machine learning, and the multitude of intelligent applications you personally will be able to develop and deploy upon completion.

We also discuss who we are, how we got here, and our view of the future of intelligent applications.

### Why you should learn machine learning with us

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here: intro.pdf

## Week 2 Regression: Predicting House Prices

This week you will build your first intelligent application that makes predictions from data.

We will explore this idea within the context of our first case study, predicting house prices, where you will create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,…).

This is just one of the many places where regression can be applied.Other applications range from predicting health outcomes in medicine, stock prices in finance, and power usage in high-performance computing, to analyzing which regulators are important for gene expression.

You will also examine how to analyze the performance of your predictive model and implement regression in practice using an iPython notebook.

### Linear regression modeling

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here: regression-intro-annotated.pdf

## Week 3 Classification: Analyzing Sentiment

How do you guess whether a person felt positively or negatively about an experience, just from a short review they wrote?

In our second case study, analyzing sentiment, you will create models that predict a class (positive/negative sentiment) from input features (text of the reviews, user profile information,…).This task is an example of classification, one of the most widely used areas of machine learning, with a broad array of applications, including

• spam detection,
• medical diagnosis,
• image classification.

You will analyze the accuracy of your classifier, implement an actual classifier in an iPython notebook, and take a first stab at a core piece of the intelligent application you will build and deploy in your capstone.

### Classification modeling

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here: classification-annotated.pdf

## Week 4 Clustering and Similarity: Retrieving Documents

A reader is interested in a specific news article and you want to find a similar articles to recommend. What is the right notion of similarity? How do I automatically search over documents to find the one that is most similar? How do I quantitatively represent the documents in the first place?

In this third case study, retrieving documents, you will examine various document representations and an algorithm to retrieve the most similar subset. You will also consider structured representations of the documents that automatically group articles by similarity (e.g., document topic).

You will actually build an intelligent document retrieval system for Wikipedia entries in an iPython notebook.

### Algorithms for retrieval and measuring similarity of documents

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here: clustering-intro-annotated.pdf

# Machine Learning: Regression

Course can be found here
Summary can be found in my Github

In our first case study, predicting house prices, you will create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,…). This is just one of the many places where regression can be applied. Other applications range from predicting health outcomes in medicine, stock prices in finance, and power usage in high-performance computing, to analyzing which regulators are important for gene expression.

In this course, you will explore regularized linear regression models for the task of prediction and feature selection. You will be able to handle very large sets of features and select between models of various complexity. You will also analyze the impact of aspects of your data – such as outliers – on your selected models and predictions. To fit these models, you will implement optimization algorithms that scale to large datasets.

Learning Outcomes: By the end of this course, you will be able to:
-Describe the input and output of a regression model.
-Compare and contrast bias and variance when modeling data.
-Estimate model parameters using optimization algorithms.
-Tune parameters with cross validation.
-Analyze the performance of the model.
-Describe the notion of sparsity and how LASSO leads to sparse solutions.
-Deploy methods to select between models.
-Exploit the model to form predictions.
-Build a regression model to predict prices using a housing dataset.
-Implement these techniques in Python.

## Week 1

### Welcome

Regression is one of the most important and broadly used machine learning and statistics tools out there. It allows you to make predictions from data by learning the relationship between features of your data and some observed, continuous-valued response. Regression is used in a massive number of applications ranging from predicting stock prices to understanding gene regulatory networks.

This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.

#### What is this course about?

##### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

intro.pdf

### Simple Linear Regression

Our course starts from the most basic regression model: Just fitting a line to data. This simple model for forming predictions from a single, univariate feature of the data is appropriately called “simple linear regression”.

In this module, we describe the high-level regression task and then specialize these concepts to the simple linear regression case. You will learn how to formulate a simple regression model and fit the model to data using both a closed-form solution as well as an iterative optimization algorithm called gradient descent. Based on this fitted function, you will interpret the estimated model parameters and form predictions. You will also analyze the sensitivity of your fit to outlying observations.

You will examine all of these concepts in the context of a case study of predicting house prices from the square feet of the house.

#### Regression fundamentals

##### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week1_simpleregression-annotated.pdf

## Week 2 Multiple Regression

The next step in moving beyond simple linear regression is to consider “multiple regression” where multiple features of the data are used to form predictions.

More specifically, in this module, you will learn how to build models of more complex relationship between a single variable (e.g., ‘square feet’) and the observed response (like ‘house sales price’). This includes things like fitting a polynomial to your data, or capturing seasonal changes in the response value. You will also learn how to incorporate multiple input variables (e.g., ‘square feet’, ‘# bedrooms’, ‘# bathrooms’). You will then be able to describe how all of these models can still be cast within the linear regression framework, but now using multiple “features”. Within this multiple regression framework, you will fit models to data, interpret estimated coefficients, and form predictions.

Here, you will also implement a gradient descent algorithm for fitting a multiple regression model.

### Multiple features of one input

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week2_multipleregression-annotated.pdf

### Setting the stage for computing the least squares fit

#### Optional reading: review of matrix algebra10 min

This section involves some use of matrix algebra. If you’d like to brush up on it, we recommend a short tutorial.

### Programming assignment 2

#### Numpy tutorial10 min

More information on Numpy, beyond this tutorial, can be found in the Numpy getting started guide.

## Week 3 Assessing Performance

Having learned about linear regression models and algorithms for estimating the parameters of such models, you are now ready to assess how well your considered method should perform in predicting new data. You are also ready to select amongst possible models to choose the best performing.

This module is all about these important topics of model selection and assessment. You will examine both theoretical and practical aspects of such analyses. You will first explore the concept of measuring the “loss” of your predictions, and use this to define training, test, and generalization error. For these measures of error, you will analyze how they vary with model complexity and how they might be utilized to form a valid assessment of predictive performance. This leads directly to an important conversation about the bias-variance tradeoff, which is fundamental to machine learning. Finally, you will devise a method to first select amongst models and then assess the performance of the selected model.

The concepts described in this module are key to all machine learning problems, well-beyond the regression setting addressed in this course.

### Defining how we assess performance

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week3_assessingperformance-annotated.pdf

## Week 4 Ridge Regression

You have examined how the performance of a model varies with increasing model complexity, and can describe the potential pitfall of complex models becoming overfit to the training data. In this module, you will explore a very simple, but extremely effective technique for automatically coping with this issue. This method is called “ridge regression”. You start out with a complex model, but now fit the model in a manner that not only incorporates a measure of fit to the training data, but also a term that biases the solution away from overfitted functions. To this end, you will explore symptoms of overfitted functions and use this to define a quantitative measure to use in your revised optimization objective. You will derive both a closed-form and gradient descent algorithm for fitting the ridge regression objective; these forms are small modifications from the original algorithms you derived for multiple regression. To select the strength of the bias away from overfitting, you will explore a general-purpose method called “cross validation”.

You will implement both cross-validation and gradient descent to fit a ridge regression model and select the regularization constant.

### Characteristics of overfit models

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week4_ridgeregression-annotated.pdf

#### Symptoms of overfitting in polynomial regression2 min

Next, we will see a demo illustrating the concept of overfitting. We recommend you download the IPython Notebook used in the demo to follow along. (The second and third parts of this notebook will be used to demonstrate ridge regression and LASSO; two techniques to address overfitting.)

IPython Notebook:
Overfitting_Demo_Ridge_Lasso.ipynb.zip

## Week 5 Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions.

To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model.

Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

### Feature selection via explicit model enumeration

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week5_lassoregression-annotated.pdf

## Week 6

### Nearest Neighbors & Kernel Regression

Up to this point, we have focused on methods that fit parametric functions—like polynomials and hyperplanes—to the entire dataset. In this module, we instead turn our attention to a class of “nonparametric” methods. These methods allow the complexity of the model to increase as more data are observed, and result in fits that adapt locally to the observations.

We start by considering the simple and intuitive example of nonparametric methods, nearest neighbor regression: The prediction for a query point is based on the outputs of the most related observations in the training set. This approach is extremely simple, but can provide excellent predictions, especially for large datasets. You will deploy algorithms to search for the nearest neighbors and form predictions based on the discovered neighbors. Building on this idea, we turn to kernel regression. Instead of forming predictions based on a small set of neighboring observations, kernel regression uses all observations in the dataset, but the impact of these observations on the predicted value is weighted by their similarity to the query point. You will analyze the theoretical performance of these methods in the limit of infinite training data, and explore the scenarios in which these methods work well versus struggle. You will also implement these techniques and observe their practical behavior.

#### Motivating local fits

##### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

week6_NNkernelregression-annotated.pdf

### Closing Remarks

In the conclusion of the course, we will recap what we have covered. This represents both techniques specific to regression, as well as foundational machine learning concepts that will appear throughout the specialization. We also briefly discuss some important regression techniques we did not cover in this course.

We conclude with an overview of what’s in store for you in the rest of the specialization.

#### What we’ve learned

##### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

closing.pdf

# Machine Learning: Classification

Course can be found here
Lecture slides can be found here
Summary can be found in my Github

In our case study on analyzing sentiment, you will create models that predict a class (positive/negative sentiment) from input features (text of the reviews, user profile information,…). In our second case study for this course, loan default prediction, you will tackle financial data, and predict when a loan is likely to be risky or safe for the bank. These tasks are an examples of classification, one of the most widely used areas of machine learning, with a broad array of applications, including ad targeting, spam detection, medical diagnosis and image classification.

In this course, you will create classifiers that provide state-of-the-art performance on a variety of tasks. You will become familiar with the most successful techniques, which are most widely used in practice, including logistic regression, decision trees and boosting. In addition, you will be able to design and implement the underlying algorithms that can learn these models at scale, using stochastic gradient ascent. You will implement these technique on real-world, large-scale machine learning tasks. You will also address significant tasks you will face in real-world applications of ML, including handling missing data and measuring precision and recall to evaluate a classifier. This course is hands-on, action-packed, and full of visualizations and illustrations of how these techniques will behave on real data. We’ve also included optional content in every module, covering advanced topics for those who want to go even deeper!

Learning Objectives: By the end of this course, you will be able to:
-Describe the input and output of a classification model.
-Tackle both binary and multiclass classification problems.
-Implement a logistic regression model for large-scale classification.
-Create a non-linear model using decision trees.
-Improve the performance of any model using boosting.
-Describe the underlying decision boundaries.
-Build a classification model to predict sentiment in a product review dataset.
-Analyze financial data to predict loan defaults.
-Use techniques for handling missing data.
-Evaluate your models using precision-recall metrics.
-Implement these techniques in Python (or in the language of your choice, though Python is highly recommended).

## Week 1

### Welcome !

Classification is one of the most widely used techniques in machine learning, with a broad array of applications, including sentiment analysis, ad targeting, spam detection, risk assessment, medical diagnosis and image classification. The core goal of classification is to predict a category or class y from some inputs x. Through this course, you will become familiar with the fundamental models and algorithms used in classification, as well as a number of core machine learning concepts. Rather than covering all aspects of classification, you will focus on a few core techniques, which are widely used in the real-world to get state-of-the-art performance. By following our hands-on approach, you will implement your own algorithms on multiple real-world tasks, and deeply grasp the core techniques needed to be successful with these approaches in practice. This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.

#### Welcome to the course

##### Important Update regarding the Machine Learning Specialization 10 min

Hello Machine Learning learners,

Please know that due to unforeseen circumstances, courses 5 and 6 - Recommender Systems & Dimensionality Reduction and An Intelligent Application with Deep Learning - will not be launching as part of the Machine Learning Specialization. We understand this may come as very disappointing news and we’re deeply sorry for this inconvenience. If you have paid for these courses or have received financial aid from Coursera, you will remain eligible to earn your Specialization Certificate upon successfully completing courses 1-4 of the Specialization. If you paid for courses 5 & 6 via a pre-payment toward the Specialization, Coursera has provided you with free access to two other courses offered by the University of Washington: Computational Neuroscience and Data Manipulation at Scale: Systems and Algorithms. An email has been sent out with specific instructions on how to enroll in these courses. If you individually paid for either x or y course, you will receive a refund within the next two weeks.

If you have any questions or would like to request a refund, please feel free to contact Coursera’s 24/7 learner support team via the Request a Refund article in the Learner Help Center. The last day to request a refund will be April 30, 2017. We value you as a Coursera learner and want to ensure that your experience with the Machine Learning Specialization remains a positive one.

Regards,

The Coursera Team

##### Slides presented in this module 10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

intro.pdf

##### Welcome to the classification course, a part of the Machine Learning Specialization 1 min

https://www.coursera.org/learn/ml-classification/lecture/YMpzf/welcome-to-the-classification-course-a-part-of-the-machine-learning

##### Impact of classification 1 min

https://www.coursera.org/learn/ml-classification/lecture/OnpWH/impact-of-classification

#### Course overview and details

##### Course overview 3 min

https://www.coursera.org/learn/ml-classification/lecture/84fuF/course-overview

##### Outline of first half of course 5 min

https://www.coursera.org/learn/ml-classification/lecture/LyubT/outline-of-first-half-of-course

##### Outline of second half of course 5 min

https://www.coursera.org/learn/ml-classification/lecture/z1g9k/outline-of-second-half-of-course

##### Assumed background 3 min

https://www.coursera.org/learn/ml-classification/lecture/IindM/assumed-background

##### Let’s get started! 45 sec

https://www.coursera.org/learn/ml-classification/lecture/AktDn/lets-get-started

##### Reading: Software tools you’ll need 10 min
###### Software tools you’ll need for this course

How this specialization was designed. The learning approach in this specialization is to start from use cases and then dig into algorithms and methods, what we call a case-studies approach. We are very excited about this approach, since it has worked well in several other courses. The first course, Machine Learning: Foundations, was focused on understanding how ML can be used in various cases studies. The second course, Machine Learning: Regression, was focused on models that predict a continuous value from input features. The follow on courses will dig into more details of algorithms and methods of other ML areas. We expect all learners to have taken the first and second course, before taking this course.

Classification - A Machine Learning Approach. This course focuses classification, one of the most important types of data analysis, with a wide range of applications. After successfully completing this course, you will be able to use classification methods in practice, implement some of the most fundamental algorithms in this area, and choose the right model for your task. You will become familiar with the most successful techniques, which are most widely used in practice, including logistic regression, decision trees and boosting. In addition, you will be able to design and implement the underlying algorithms that can learn these models at scale, using stochastic gradient ascent.

###### Programming assignment format

Almost every module will be associated with one or two programming assignments. The goal of these assignments is to have hands-on experience on the techniques we discuss in lectures. To test your implementations, you will be asked questions in a quiz following the assignment.

You will be implementing core classification techniques or other ML concepts from scratch in most modules. In a few module, you will also explore fundamental ML concepts, such as regularization or precision-recall, using existing implementations of ML algorithms, with the goal of gaining proficiency in the ML concepts.

###### Why Python

In this course, we are going to use the Python programming language to build several intelligent applications that use machine learning. Python is a simple scripting language that makes it easy to interact with data. Furthermore, Python has a wide range of packages that make it easy to get started and build applications, from the simplest ones to the most complex. Python is widely used in industry, and is becoming the de facto language for data science in industry. (R is another alternative language. However, R tends to be significantly less scalable and has very few deployment tools, thus it is seldomly used for production code in industry. It is possible, but discouraged to use R in this specialization.)

We will also encourage the use the IPython Notebook in our assignments. The IPython Notebook is a simple interactive environment for programming with Python, which makes it really easy to share your results. Think about it as a combination of a Python terminal and a wiki page. Thus, you can combine code, plots and text to explain what you did. (You are not required to use IPython Notebook in the assignments, and should have no problem using straight up Python if you prefer.)

###### Useful software tools

Although you will be implementing algorithms from scratch in various assignments, some software tools will be useful in the process. In particular, there are four types of data tools that would be helpful:

• Data manipulation: to help you slice-and-dice the data, create new features, and clean the data.
• Matrix operations: in the inner loops of your algorithms, you will do various matrix operations, and libraries focus on these will speed-up your code significantly.
• Plotting library: so you can visualize data and models.
• Pre-implemented ML algorithms: in some assignments where we are focusing on exploring ML classification models, you will use a pre-implemented ML algorithms to help focus your efforts on the fundamentals.

1.Tools for data manipulation

For data manipulation, we recommend using SFrame, an open-source, highly-scalable Python library for data manipulation. An alternative is the Pandas library. A huge advantage of SFrame over Pandas is that with SFrame, you are not limited to datasets that fit in memory, which allows you to deal with large datasets, even on a laptop. (The SFrame API is very similar to Pandas’ API. Here is a doc showing the relationship between the two of them.)

2.Tools for matrix operation

For matrix operations, we strongly recommend Numpy, an open-source Python library that provides fast performance, for data that fits in memory.

3.Tools for plotting

For plotting, we strongly recommend you use Matplotlib, an open-source Python library with extensive plotting functionality.

4.Tools with pre-implemented ML algorithms

For the few assignments where you will be using pre-implemented ML algorithms, we recommend you use GraphLab Create, which we used in the first and second course, a package we have been working on for many years now, and has seen an exciting adoption curve, especially in industry with folks building real applications. A popular alternative is to use scikit-learn. GraphLab Create is more scalable than scikit-learn and simpler to use when your data is not numeric vectors. On the other hand, scikit-learn is open-source.

In this course, most of the assignments are about implementing algorithms from scratch, so this choice is more flexible than in the first course. We are happy, however, for you to use any tool(s) of your liking. As you will notice, we are only grading the output of your programs, so the specific software tool is not the focus of the course. More details on using other tools are at the end of this doc.

It’s important to emphasize that this specialization is not about providing training for a specific software package. The goal of the specialization is for your effort to be spent on learning the fundamental concepts and algorithms behind machine learning in a hands-on fashion. These concepts transcend any single package. What you learn here you can use whether you write code from scratch, use any existing ML packages out there, or any that may be developed in the future. We are happy to hear that so many of you are enjoying this approach so far!

5.Licenses for SFrame & GraphLab Create

The SFrame package is available in open-source under a permissive BSD license. So, you will always be able to use SFrames for free. GraphLab Create is free on a 1-year, renewable license for educational purposes, including Coursera. The reason we suggest you use GraphLab Create for this course is because this software will make it much easier for you see machine learning in action and to help you complete your assignments quickly.

If you are using GraphLab Create and already have it installed, please make sure you upgrade to the latest version! The simplest way to do this is to:

Open the GraphLab Launcher.
Click on ‘TERMINAL’.
On the terminal window, type:
pip install --upgrade graphlab-create

###### Resources

These are some good resources you can explore, if you are using the recommended software tools:

In the first course of this ML specialization, Machine Learning Foundations, we provided many tutorials and getting started guides. We recommend you go over those before tackling this course.
There are many Python resources available online. Here is a good place for documentation.
For SFrame & GraphLab Create, there is also a lot of information available online. Here are some starting points: the User Guide and detailed API docs.
For Numpy, here is a getting started guide. We will also provide a tutorial when it’s time to use it.

If you choose to use the recommended tools, you have two options: downloading and installing the required software or using a prepackaged version on a free instance on Amazon EC2.

Download and install Python, IPython Notebook, Numpy, SFrame and GraphLab Create. You can find the instructions here.

2.Option 2: Using a free Amazon EC2 with all the software pre-installed

If you do not have a 64-bit computer, you will not be able to run GraphLab Create. Additionally, some of you may want a simple experience where you don’t have to download the course content and install everything locally. Here, we’ll address these situations!

Amazon EC2 offers free cloud computing hours with what they call micro instances. These instances are all we need to do the work for this course. We have created an image for one such instance that is easy to launch and contains all the course content. This will allow you to run everything you need for this course in the cloud for free, without having to install anything locally. (You do need to create an Amazon EC2 account and have internet access.)

You can find step-by-step instructions here:

We note that installing all the software on your own local machine may be the right option for most people; especially since you can run locally everything without needing to be online to do the homeworks. But, the option using Amazon EC2 should be a great alternative.

###### Github repository with starter code

In each module of the course, we have a reading with the assignments for that module as well as some starter code. For those interested, the starter code and demos used in this course are also available in a public Github repository:

https://github.com/learnml/machine-learning-specialization

###### Using other software packages

We strongly encourage you to use the recommended software packages for this course, since they will allow you to learn the fundamental concepts more quickly. However, you are welcome to use others. Here are a few notes if you do so.

1.Installing other software tools

In the instructions above, you will be using the GraphLab Launcher, which will automatically install Python, IPython Notebook, Numpy, Matplotlib, SFrame and GraphLab Create. If you don’t use the GraphLab Launcher, you will need to install each of these tools separately, by following the pages linked above. Anaconda is a good tool to help simplify some of this installation.

2.If you are using SFrame, but not GraphLab Create

GraphLab Create uses SFrame under the hood, but you can use just SFrame for most assignments. If you choose to do so, in the starter code for the assignments, you should change the line

import graphlab

import sframe

import sframe
and everything should work with just some small modifications, e.g., the calls:

graphlab.SFrame(...)
will become

sframe.SFrame(...)

3.If you are using other software tools out there

You are welcome to use other packages, e.g., scikit-learn instead of GraphLab Create, or Pandas instead of SFrame, or even R instead of Python. If you choose to use all these different packages, we will provide the datasets (in standard CSV format) and the assignment questions will not depend specifically on the recommended tools.

### Linear Classifiers & Logistic Regression

Linear classifiers are amongst the most practical classification methods. For example, in our sentiment analysis case-study, a linear classifier associates a coefficient with the counts of each word in the sentence. In this module, you will become proficient in this type of representation. You will focus on a particularly useful type of linear classifier called logistic regression, which, in addition to allowing you to predict a class, provides a probability associated with the prediction. These probabilities are extremely useful, since they provide a degree of confidence in the predictions. In this module, you will also be able to construct features from categorical inputs, and to tackle classification problems with more than two class (multiclass problems). You will examine the results of these techniques on a real-world product sentiment analysis task.

#### Linear classifiers

##### Slides presented in this module 10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

logistic-regression-model-annotated.pdf

##### Linear classifiers: A motivating example 2 min

https://www.coursera.org/learn/ml-classification/lecture/HNKIj/linear-classifiers-a-motivating-example

##### Intuition behind linear classifiers 3 min

https://www.coursera.org/learn/ml-classification/lecture/lCBwS/intuition-behind-linear-classifiers

##### Decision boundaries 3 min

https://www.coursera.org/learn/ml-classification/lecture/NIdE0/decision-boundaries

##### Linear classifier model 5 min

https://www.coursera.org/learn/ml-classification/lecture/XBc9n/linear-classifier-model

##### Effect of coefficient values on decision boundary 2 min

https://www.coursera.org/learn/ml-classification/lecture/Qy2js/effect-of-coefficient-values-on-decision-boundary

##### Using features of the inputs 2 min

https://www.coursera.org/learn/ml-classification/lecture/WHIMY/using-features-of-the-inputs

#### Class probabilities

##### Predicting class probabilities 1 min

https://www.coursera.org/learn/ml-classification/lecture/j4Ji0/predicting-class-probabilities

##### Review of basics of probabilities 6 min

https://www.coursera.org/learn/ml-classification/lecture/p6rtM/review-of-basics-of-probabilities

##### Review of basics of conditional probabilities 8 min

https://www.coursera.org/learn/ml-classification/lecture/Cun2N/review-of-basics-of-conditional-probabilities

##### Using probabilities in classification 2 min

https://www.coursera.org/learn/ml-classification/lecture/f0nhO/using-probabilities-in-classification

#### Logistic regression

##### Predicting class probabilities with (generalized) linear models 5 min

https://www.coursera.org/learn/ml-classification/lecture/OV5Kt/predicting-class-probabilities-with-generalized-linear-models

##### Logistic regression model 5 min

https://www.coursera.org/learn/ml-classification/lecture/OJQXu/logistic-regression-model

##### Effect of coefficient values on predicted probabilities 7 min

https://www.coursera.org/learn/ml-classification/lecture/JkEEH/effect-of-coefficient-values-on-predicted-probabilities

##### Overview of learning logistic regression models 2 min

https://www.coursera.org/learn/ml-classification/lecture/GuxAJ/overview-of-learning-logistic-regression-models

#### Practical issues for classification

##### Encoding categorical inputs 4 min

https://www.coursera.org/learn/ml-classification/lecture/kCY0D/encoding-categorical-inputs

##### Multiclass classification with 1 versus all 7 min

https://www.coursera.org/learn/ml-classification/lecture/N7QA6/multiclass-classification-with-1-versus-all

#### Summarizing linear classifiers & logistic regression

##### Recap of logistic regression classifier 1 min

https://www.coursera.org/learn/ml-classification/lecture/laPcB/recap-of-logistic-regression-classifier

##### Quiz: Linear Classifiers & Logistic Regression 5 questions

QUIZ
Linear Classifiers & Logistic Regression
5 questions
To Pass80% or higher
Attempts3 every 8 hours
August 20, 11:59 PM PDT

1 point
1.(True/False) A linear classifier assigns the predicted class based on the sign of Score(x)=wTh(x).

1 point
2.(True/False) For a conditional probability distribution over y|x, where y takes on two values (+1, -1, i.e. good review, bad review) P(y=+1|x)+P(y=−1|x)=1.

1 point
3.Which function does logistic regression use to “squeeze” the real line to [0, 1]?

1 point
4.If Score(x)=wTh(x)>0, which of the following is true about P(y=+1|x)?

1 point
5.Consider training a 1 vs. all multiclass classifier for the problem of digit recognition using logistic regression. There are 10 digits, thus there are 10 classes. How many logistic regression classifiers will we have to train?

#### Programming Assignment

##### Quiz: Predicting sentiment from product reviews 12 questions

QUIZ
Predicting sentiment from product reviews
12 questions
To Pass70% or higher
Attempts3 every 8 hours
August 20, 11:59 PM PDT

1 point
1.How many weights are greater than or equal to 0?

1 point
2.Of the three data points in sample_test_data, which one has the lowest probability of being classified as a positive review?

1 point
3.Which of the following products are represented in the 20 most positive reviews?

1 point
4.Which of the following products are represented in the 20 most negative reviews?

1 point
5.What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places (e.g. 0.76).

1 point
6.Does a higher accuracy value on the training_data always imply that the classifier is better?

1 point
7.Consider the coefficients of simple_model. There should be 21 of them, an intercept term + one for each word in significant_words.

How many of the 20 coefficients (corresponding to the 20 significant_words and excluding the intercept term) are positive for the simple_model?

1 point
8.Are the positive words in the simple_model also positive words in the sentiment_model?

1 point
9.Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?

1 point
10.Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?

1 point
11.Enter the accuracy of the majority class classifier model on the test_data. Round your answer to two decimal places (e.g. 0.76).

1 point
12.Is the sentiment_model definitely better than the majority class classifier (the baseline)?

## Week 2

### Learning Linear Classifiers

Once familiar with linear classifiers and logistic regression, you can now dive in and write your first learning algorithm for classification. In particular, you will use gradient ascent to learn the coefficients of your classifier from data. You first will need to define the quality metric for these tasks using an approach called maximum likelihood estimation (MLE). You will also become familiar with a simple technique for selecting the step size for gradient ascent. An optional, advanced part of this module will cover the derivation of the gradient for logistic regression. You will implement your own learning algorithm for logistic regression from scratch, and use it to learn a sentiment analysis classifier.

#### Maximum likelihood estimation

##### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

logistic-regression-learning-annotated.pdf

##### Goal: Learning parameters of logistic regression2 min

https://www.coursera.org/learn/ml-classification/lecture/uxALW/goal-learning-parameters-of-logistic-regression

#### Summarizing learning linear classifiers

##### Quiz: Learning Linear Classifiers6 questions

QUIZ
Learning Linear Classifiers
6 questions
To Pass80% or higher
Attempts3 every 8 hours
August 27, 11:59 PM PDT

1 point
1.(True/False) A linear classifier can only learn positive coefficients.

1 point
2.(True/False) In order to train a logistic regression model, we find the weights that maximize the likelihood of the model.

1 point
3.(True/False) The data likelihood is the product of the probability of the inputs x given the weights w and response y.

1 point
4.Questions 4 and 5 refer to the following scenario.

Consider the setting where our inputs are 1-dimensional. We have data

x y
2.5 +1
0.3 -1
2.8 +1
0.5 +1

and the current estimates of the weights are w0=0 and w1=1. (w0: the intercept, w1: the weight for x).

Calculate the likelihood of this data. Round your answer to 2 decimal places.

1 point
5.Refer to the scenario given in Question 4 to answer the following:

Calculate the derivative of the log likelihood with respect to w1. Round your answer to 2 decimal places.

1 point
6.Which of the following is true about gradient ascent? Select all that apply.

#### Programming Assignment

##### Quiz: Implementing logistic regression from scratch8 questions

QUIZ
Implementing logistic regression from scratch
8 questions
To Pass80% or higher
Attempts3 every 8 hours
August 27, 11:59 PM PDT

1 point
1.How many reviews in amazon_baby_subset.gl contain the word perfect?

1 point
2.Consider the feature_matrix that was obtained by converting our data to NumPy format.

How many features are there in the feature_matrix?

1 point
3.Assuming that the intercept is present, how does the number of features in feature_matrix relate to the number of features in the logistic regression model? Let x = [number of features in feature_matrix] and y = [number of features in logistic regression model].

1 point
4.Run your logistic regression solver with provided parameters.

As each iteration of gradient ascent passes, does the log-likelihood increase or decrease?

1 point
5.We make predictions using the weights just learned.

How many reviews were predicted to have positive sentiment?

1 point
6.What is the accuracy of the model on predictions made above? (round to 2 digits of accuracy)

1 point
7.We look at “most positive” words, the words that correspond most strongly with positive reviews.

Which of the following words is not present in the top 10 “most positive” words?

1 point
8.Similarly, we look at “most negative” words, the words that correspond most strongly with negative reviews.

Which of the following words is not present in the top 10 “most negative” words?

### Overfitting & Regularization in Logistic Regression

As we saw in the regression course, overfitting is perhaps the most significant challenge you will face as you apply machine learning approaches in practice. This challenge can be particularly significant for logistic regression, as you will discover in this module, since we not only risk getting an overly complex decision boundary, but your classifier can also become overly confident about the probabilities it predicts. In this module, you will investigate overfitting in classification in significant detail, and obtain broad practical insights from some interesting visualizations of the classifiers’ outputs. You will then add a regularization term to your optimization to mitigate overfitting. You will investigate both L2 regularization to penalize large coefficient values, and L1 regularization to obtain additional sparsity in the coefficients. Finally, you will modify your gradient ascent algorithm to learn regularized logistic regression classifiers. You will implement your own regularized logistic regression classifier from scratch, and investigate the impact of the L2 penalty on real-world sentiment analysis data.

#### Overfitting in classification

##### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

logistic-regression-overfitting-annotated.pdf

##### Evaluating a classifier 3 min

https://www.coursera.org/learn/ml-classification/lecture/RzxaQ/evaluating-a-classifier

#### Summarizing overfitting & regularization in logistic regression

##### Quiz: Overfitting & Regularization in Logistic Regression8 questions

QUIZ
Overfitting & Regularization in Logistic Regression
8 questions
To Pass80% or higher
Attempts3 every 8 hours
August 27, 11:59 PM PDT

1 point
1.Consider four classifiers, whose classification performance is given by the following table:

Classification error on training set Classification error on validation set
Classifier 1 0.2 0.6
Classifier 2 0.8 0.6
Classifier 3 0.2 0.2
Classifier 4 0.5 0.4

Which of the four classifiers is most likely overfit?

1 point
2.Suppose a classifier classifies 23100 examples correctly and 1900 examples incorrectly. Compute error by hand. Round your answer to 3 decimal places.

1 point
3.(True/False) Accuracy and error measured on the same dataset always sum to 1.

1 point
4.Which of the following is NOT a correct description of complex models?

1 point
5.Which of the following is a symptom of overfitting in the context of logistic regression? Select all that apply.

1 point
6.Suppose we perform L2 regularized logistic regression to fit a sentiment classifier. Which of the following plots does NOT describe a possible coefficient path? Choose all that apply.

Note. Assume that the algorithm runs for a wide range of L2 penalty values and each coefficient plot is zoomed out enough to capture all long-term trends.

1 point
7.Suppose we perform L1 regularized logistic regression to fit a sentiment classifier. Which of the following plots does NOT describe a possible coefficient path? Choose all that apply.

Note. Assume that the algorithm runs for a wide range of L1 penalty values and each coefficient plot is zoomed out enough to capture all long-term trends.

1 point
8.In the context of L2 regularized logistic regression, which of the following occurs as we increase the L2 penalty λ? Choose all that apply.

#### Programming Assignment

##### Quiz: Logistic Regression with L2 regularization8 questions

QUIZ
Logistic Regression with L2 regularization
8 questions
To Pass80% or higher
Attempts3 every 8 hours
August 27, 11:59 PM PDT

1 point
1.In the function feature_derivative_with_L2, was the intercept term regularized?

1 point
2.Does the term with L2 regularization increase or decrease the log likelihood ℓℓ(w)?

1 point
3.Which of the following words is not listed in either positive_words or negative_words?

1 point
4.Questions 5 and 6 use the coefficient plot of the words in positive_words and negative_words.

(True/False) All coefficients consistently get smaller in size as the L2 penalty is increased.

1 point
5.Questions 5 and 6 use the coefficient plot of the words in positive_words and negative_words.

(True/False) The relative order of coefficients is preserved as the L2 penalty is increased. (For example, if the coefficient for ‘cat’ was more positive than that for ‘dog’, this remains true as the L2 penalty increases.)

1 point
6.Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.

Which of the following models has the highest accuracy on the training data?

1 point
7.Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.

Which of the following models has the highest accuracy on the validation data?

1 point
8.Questions 7, 8, and 9 ask you about the 6 models trained with different L2 penalties.

Does the highest accuracy on the training data imply that the model is the best one?

## Week 3 Decision Trees

Along with linear classifiers, decision trees are amongst the most widely used classification techniques in the real world. This method is extremely intuitive, simple to implement and provides interpretable predictions.

• In this module, you will become familiar with the core decision trees representation.
• You will then design a simple, recursive greedy algorithm to learn decision trees from data.
• Finally, you will extend this approach to deal with continuous inputs, a fundamental requirement for practical problems.
• In this module, you will investigate a brand new case-study in the financial sector: predicting the risk associated with a bank loan. You will implement your own decision tree learning algorithm on real loan data.

### Intuition behind decision trees

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

decision-trees-annotated.pdf

### Summarizing decision trees

#### Quiz: Decision Trees11 questions

QUIZ
Decision Trees
11 questions
To Pass80% or higher
Attempts3 every 8 hours
September 3, 11:59 PM PDT

1 point
1.Questions 1 to 6 refer to the following common scenario:

Consider the following dataset:

x1 x2 x3 y
1 1 1 +1
0 1 0 -1
1 0 1 -1
0 0 1 +1

Let us train a decision tree with this data. Let’s call this tree T1. What feature will we split on at the root?

x1: .5
x2: .5
x3: .25

1 point
2.Refer to the dataset presented in Question 1 to answer the following.

Fully train T1 (until each leaf has data points of the same output label). What is the depth of T1?

1 point
3.Refer to the dataset presented in Question 1 to answer the following.

What is the training error of T1?

1 point
4.Refer to the dataset presented in Question 1 to answer the following.

Now consider a tree T2, which splits on x1 at the root, and splits on x2 in the 1st level, and has leaves at the 2nd level. Note: this is the XOR function on features 1 and 2. What is the depth of T2?

1 point
5.Refer to the dataset presented in Question 1 to answer the following.

What is the training error of T2?

1 point
6.Refer to the dataset presented in Question 1 to answer the following.

Which has smaller depth, T1 or T2?

1 point
7.(True/False) When deciding to split a node, we find the best feature to split on that minimizes classification error.

1 point
8.If you are learning a decision tree, and you are at a node in which all of its data has the same y value, you should

3: False

1 point
8.Let’s say we have learned a decision tree on dataset D. Consider the split learned at the root of the decision tree. Which of the following is true if one of the data points in D is removed and we re-train the tree?

1 point
9.Consider two datasets D1 and D2, where D2 has the same data points as D1, but has an extra feature for each data point. Let T1 be the decision tree trained with D1, and T2 be the tree trained with D2. Which of the following is true?

1 point
10.(True/False) Logistic regression with polynomial degree 1 features will always have equal or lower training error than decision stumps (depth 1 decision trees).

1 point
11.(True/False) Decision trees (with depth > 1) are always linear classifiers.

1 point
11.(True/False) Decision stumps (depth 1 decision trees) are always linear classifiers.

### Programming Assignment 1

#### Quiz: Identifying safe loans with decision trees7 questions

QUIZ
Identifying safe loans with decision trees
7 questions
To Pass80% or higher
Attempts3 every 8 hours
September 3, 11:59 PM PDT

1 point
1.What percentage of the predictions on sample_validation_data did decision_tree_model get correct?

1 point
2.Which loan has the highest probability of being classified as a safe loan?

1 point
3.Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?

1 point
4.Based on the visualized tree, what prediction would you make for this data point?

1 point
5.What is the accuracy of decision_tree_model on the validation set, rounded to the nearest .01 (e.g. 0.76)?

1 point
6.How does the performance of big_model on the validation set compare to decision_tree_model on the validation set? Is this a sign of overfitting?

1 point
7.Let us assume that each mistake costs money:

Assume a cost of $10,000 per false negative. Assume a cost of$20,000 per false positive.
What is the total cost of mistakes made by decision_tree_model on validation_data? Please enter your answer as a plain integer, without the dollar sign or the comma separator, e.g. 3002000.

### Programming Assignment 2

#### Quiz: Implementing binary decision trees7 questions

QUIZ
Implementing binary decision trees
7 questions
To Pass80% or higher
Attempts3 every 8 hours
September 3, 11:59 PM PDT

1 point
1.What was the feature that my_decision_tree first split on while making the prediction for test_data[0]?

1 point
2.What was the first feature that lead to a right split of test_data[0]?

1 point
3.What was the last feature split on before reaching a leaf node for test_data[0]?

1 point
4.Rounded to 2nd decimal point (e.g. 0.76), what is the classification error of my_decision_tree on the test_data?

1 point
5.What is the feature that is used for the split at the root node?

1 point
6.What is the path of the first 3 feature splits considered along the left-most branch of my_decision_tree?

1 point
7.What is the path of the first 3 feature splits considered along the right-most branch of my_decision_tree?

## Week 4

### Preventing Overfitting in Decision Trees

Out of all machine learning techniques, decision trees are amongst the most prone to overfitting. No practical implementation is possible without including approaches that mitigate this challenge. In this module, through various visualizations and investigations, you will investigate why decision trees suffer from significant overfitting problems. Using the principle of Occam’s razor, you will mitigate overfitting by learning simpler trees. At first, you will design algorithms that stop the learning process before the decision trees become overly complex. In an optional segment, you will design a very practical approach that learns an overly-complex tree, and then simplifies it with pruning. Your implementation will investigate the effect of these techniques on mitigating overfitting on our real-world loan data set.

#### Overfitting in decision trees

##### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

decision-trees-overfitting-annotated.pdf

#### Summarizing preventing overfitting in decision trees

##### Quiz: Preventing Overfitting in Decision Trees11 questions

QUIZ
Preventing Overfitting in Decision Trees
11 questions
To Pass80% or higher
Attempts3 every 8 hours
September 10, 11:59 PM PDT

1 point
1.(True/False) When learning decision trees, smaller depth USUALLY translates to lower training error.

1 point
2.(True/False) If no two data points have the same input values, we can always learn a decision tree that achieves 0 training error.

1 point
3.(True/False) If decision tree T1 has lower training error than decision tree T2, then T1 will always have better test error than T2.

1 point
4.Which of the following is true for decision trees?

1 point
5.Pruning and early stopping in decision trees is used to

1 point
6.Which of the following is NOT an early stopping method?

1 point
7.Consider decision tree T1 learned with minimum node size parameter = 1000. Now consider decision tree T2 trained on the same dataset and parameters, except that the minimum node size parameter is now 100. Which of the following is always true?

1 point
8.Questions 8 to 11 refer to the following common scenario:

Imagine we are training a decision tree, and we are at a node. Each data point is (x1, x2, y), where x1,x2 are features, and y is the label. The data at this node is:

x1 x2 y
0 1 +1
1 0 +1
0 1 +1
1 1 -1

What is the classification error at this node (assuming a majority class classifier)?

1 point
9.Refer to the scenario presented in Question 8.

If we split on x1, what is the classification error?

1
point

1. Refer to the scenario presented in Question 8.

If we split on x2, what is the classification error?

1 point
11.Refer to the scenario presented in Question 8.

If our parameter for minimum gain in error reduction is 0.1, do we split or stop early?

#### Programming Assignment

##### Quiz: Decision Trees in Practice14 questions

QUIZ
Decision Trees in Practice
14 questions
To Pass80% or higher
Attempts3 every 8 hours
September 10, 11:59 PM PDT

1 point
1.Given an intermediate node with 6 safe loans and 3 risky loans, if the min_node_size parameter is 10, what should the tree learning algorithm do next?

1 point
2.Assume an intermediate node has 6 safe loans and 3 risky loans. For each of 4 possible features to split on, the error reduction is 0.0, 0.05, 0.1, and 0.14, respectively. If the minimum gain in error reduction parameter is set to 0.2, what should the tree learning algorithm do next?

1 point
3.Consider the prediction path validation_set[0] with my_decision_tree_old and my_decision_tree_new. For my_decision_tree_new trained with

is the prediction path shorter, longer, or the same as the prediction path using my_decision_tree_old that ignored the early stopping conditions 2 and 3?

1 point
4.Consider the prediction path for ANY new data point. For my_decision_tree_new trained with

is the prediction path for a data point always shorter, always longer, always the same, shorter or the same, or longer or the same as for my_decision_tree_old that ignored the early stopping conditions 2 and 3?

1 point
5.For a tree trained on any dataset using parameters

what is the maximum possible number of splits encountered while making a single prediction?

1 point
6.Is the validation error of the new decision tree (using early stopping conditions 2 and 3) lower than, higher than, or the same as that of the old decision tree from the previous assigment?

1 point
7.Which tree has the smallest error on the validation data?

1 point
8.Does the tree with the smallest error in the training data also have the smallest error in the validation data?

1 point
9.Is it always true that the tree with the lowest classification error on the training set will result in the lowest classification error in the validation set?

1 point
10.Which tree has the largest complexity?

1 point
11.Is it always true that the most complex tree will result in the lowest classification error in the validation_set?

1 point
12.Using the complexity definition, which model (model_4, model_5, or model_6) has the largest complexity?

1 point
13.model_4 and model_5 have similar classification error on the validation set but model_5 has lower complexity. Should you pick model_5 over model_4?

1 point
14.Using the results obtained in this section, which model (model_7, model_8, or model_9) would you choose to use?

### Handling Missing Data

Real-world machine learning problems are fraught with missing data. That is, very often, some of the inputs are not observed for all data points. This challenge is very significant, happens in most cases, and needs to be addressed carefully to obtain great performance. And, this issue is rarely discussed in machine learning courses. In this module, you will tackle the missing data challenge head on. You will start with the two most basic techniques to convert a dataset with missing data into a clean dataset, namely skipping missing values and inputing missing values. In an advanced section, you will also design a modification of the decision tree learning algorithm that builds decisions about missing data right into the model. You will also explore these techniques in your real-data implementation.

#### Basic strategies for handling missing data

##### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

decision-trees-missing-values-annotated.pdf

#### Summarizing handling missing data

##### Quiz: Handling Missing Data7 questions

QUIZ
Handling Missing Data
7 questions
To Pass80% or higher
Attempts3 every 8 hours
September 10, 11:59 PM PDT

1 point
1.(True/False) Skipping data points (i.e., skipping rows of the data) that have missing features only works when the learning algorithm we are using is decision tree learning.

1 point
2.What are potential downsides of skipping features with missing values (i.e., skipping columns of the data) to handle missing data?

1 point
3.(True/False) It’s always better to remove missing data points (i.e., rows) as opposed to removing missing features (i.e., columns).

1 point
4.Consider a dataset with N training points. After imputing missing values, the number of data points in the data set is

1 point
5.Consider a dataset with D features. After imputing missing values, the number of features in the data set is

1 point
6.Which of the following are always true when imputing missing data? Select all that apply.

1 point
7.Consider data that has binary features (i.e. the feature values are 0 or 1) with some feature values of some data points missing. When learning the best feature split at a node, how would we best modify the decision tree learning algorithm to handle data points with missing values for a feature?

## Week 5 Boosting

One of the most exciting theoretical questions that have been asked about machine learning is whether simple classifiers can be combined into a highly accurate ensemble. This question lead to the developing of boosting, one of the most important and practical techniques in machine learning today. This simple approach can boost the accuracy of any classifier, and is widely used in practice, e.g., it’s used by more than half of the teams who win the Kaggle machine learning competitions. In this module, you will first define the ensemble classifier, where multiple models vote on the best prediction. You will then explore a boosting algorithm called AdaBoost, which provides a great approach for boosting classifiers. Through visualizations, you will become familiar with many of the practical aspects of this techniques. You will create your very own implementation of AdaBoost, from scratch, and use it to boost the performance of your loan risk predictor on real data.

### The amazing idea of boosting a classifier

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

boosting-annotated.pdf

### Programming Assignment 1

#### Quiz: Exploring Ensemble Methods9 questions

QUIZ
Exploring Ensemble Methods
9 questions
To Pass80% or higher
Attempts3 every 8 hours
September 17, 11:59 PM PDT

1 point
1.What percentage of the predictions on sample_validation_data did model_5 get correct?

1 point
2.According to model_5, which loan is the least likely to be a safe loan?

1 point
3.What is the number of false positives on the validation data?

1 point
4.Using the same costs of the false positives and false negatives, what is the cost of the mistakes made by the boosted tree model (model_5) as evaluated on the validation_set?

1 point
5.What grades are the top 5 loans?

1 point
6.Which model has the best accuracy on the validation_data?

1 point
7.Is it always true that the model with the most trees will perform best on the test/validation set?

1 point
8.Does the training error reduce as the number of trees increases?

1 point
9.Is it always true that the test/validation error will reduce as the number of trees increases?

### Summarizing boosting

#### Quiz:Boosting11 questions

QUIZ
Boosting
11 questions
To Pass80% or higher
Attempts3 every 8 hours
September 17, 11:59 PM PDT

1 point
1.Which of the following is NOT an ensemble method?

1 point
2.Each binary classifier in an ensemble makes predictions on an input x as listed in the table below. Based on the ensemble coefficients also listed in the table, what is the final ensemble model’s prediction for x?

Classifier coefficient wt Prediction for x
Classifier 1 0.61 +1
Classifier 2 0.53 -1
Classifier 3 0.88 -1
Classifier 4 0.34 +1

1 point
3.(True/False) Boosted decision stumps is a linear classifier.

1 point
4.(True/False) For AdaBoost, test error is an appropriate criterion for choosing the optimal number of iterations.

1 point
5.In an iteration in AdaBoost, recall that learning the coefficient w_t for learned weak learner f_t is calculated by

$$\displaystyle{\frac{1}{2}\ln{\left( \frac{1-\mathtt{weighted_error}(f_t)}{\mathtt{weighted_error}(f_t)} \right)}}$$
If the weighted error of f_t is equal to .25, what is the value of w_t? Round your answer to 2 decimal places.

1 point
6.Which of the following classifiers is most accurate as computed on a weighted dataset? A classifier with:

1 point
7.Imagine we are training a decision stump in an iteration of AdaBoost, and we are at a node. Each data point is (x1, x2, y), where x1,x2 are features, and y is the label. Also included are the weights of the data. The data at this node is:

Weight x1 x2 y
0.3 0 1 +1
0.35 1 0 -1
0.1 0 1 +1
0.25 1 1 +1

Suppose we assign the same class label to all data in this node. (Pick the class label with the greater total weight.) What is the weighted error at the node? Round your answer to 2 decimal places.

1 point
8.After each iteration of AdaBoost, the weights on the data points are typically normalized to sum to 1. This is used because

1 point
9.Consider the following 2D dataset with binary labels.

We train a series of weak binary classifiers using AdaBoost. In one iteration, the weak binary classifier produces the decision boundary as follows:

Which of the five points (indicated in the second figure) will receive higher weight in the following iteration? Choose all that apply.

1 point
10.Suppose we are running AdaBoost using decision tree stumps. At a particular iteration, the data points have weights according the figure. (Large points indicate heavy weights.)

Which of the following decision tree stumps is most likely to be fit in the next iteration?

1 point
11.(True/False) AdaBoost achieves zero training error after a sufficient number of iterations, as long as we can find weak learners that perform better than random chance at each iteration of AdaBoost (i.e., on weighted data).

### Programming Assignment 2

#### Quiz:Boosting a decision stump5 questions

QUIZ
Boosting a decision stump
5 questions
To Pass80% or higher
Attempts3 every 8 hours
September 17, 11:59 PM PDT
You can still pass this quiz before the course ends.

1 point
1.Recall that the classification error for unweighted data is defined as follows:

classification error=# mistakes# all data points
Meanwhile, the weight of mistakes for weighted data is given by

$$WM(α,y^)=∑i=1nαi×1[yi≠y^i].$$
If we set the weights α=1 for all data points, how is the weight of mistakes WM(α,ŷ) related to the classification error?

1 point
2.Refer to section Example: Training a weighted decision tree.

Will you get the same model as small_data_decision_tree_subset_20 if you trained a decision tree with only 20 data points from the set of points in subset_20?

1 point
3.Refer to the 10-component ensemble of tree stumps trained with Adaboost.

As each component is trained sequentially, are the component weights monotonically decreasing, monotonically increasing, or neither?

1 point
4.Which of the following best describes a general trend in accuracy as we add more and more components? Answer based on the 30 components learned so far.

1 point
5.From this plot (with 30 trees), is there massive overfitting as the # of iterations increases?

## Week 6 Precision-Recall

In many real-world settings, accuracy or error are not the best quality metrics for classification. You will explore a case-study that significantly highlights this issue: using sentiment analysis to display positive reviews on a restaurant website. Instead of accuracy, you will define two metrics: precision and recall, which are widely used in real-world applications to measure the quality of classifiers. You will explore how the probabilities output by your classifier can be used to trade-off precision with recall, and dive into this spectrum, using precision-recall curves. In your hands-on implementation, you will compute these metrics with your learned classifier on real-world sentiment analysis data.

### Why use precision & recall as quality metrics

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

precision-recall.pdf

### Summarizing precision-recall

#### Quiz: Precision-Recall9 questions

QUIZ
Precision-Recall
9 questions
To Pass80% or higher
Attempts3 every 8 hours
October 1, 11:59 PM PDT

1 point
1.Questions 1 to 5 refer to the following scenario:

Suppose a binary classifier produced the following confusion matrix.

Predicted Positive Predicted Negative
Actual Positive 5600 40
Actual Negative 1900 2460

What is the recall of this classifier? Round your answer to 2 decimal places.

1 point
2.Refer to the scenario presented in Question 1 to answer the following:

(True/False) This classifier is better than random guessing.

1 point
3.Refer to the scenario presented in Question 1 to answer the following:

(True/False) This classifier is better than the majority class classifier.

1 point
4.Refer to the scenario presented in Question 1 to answer the following:

Which of the following points in the precision-recall space corresponds to this classifier?

(1)

(2)

(3)

(4)

(5)

1 point
5.Refer to the scenario presented in Question 1 to answer the following:

Which of the following best describes this classifier?

It is optimistic

It is pessimistic

None of the

1 point
6.Suppose we are fitting a logistic regression model on a dataset where the vast majority of the data points are labeled as positive. To compensate for overfitting to the dominant class, we should

Require higher confidence level for positive predictions

Require lower confidence level for positive predictions

1 point
7.It is often the case that false positives and false negatives incur different costs. In situations where false negatives cost much more than false positives, we should

Require higher confidence level for positive predictions

Require lower confidence level for positive predictions

1 point
8.We are interested in reducing the number of false negatives. Which of the following metrics should we primarily look at?

Accuracy

Precision

Recall

1 point
9.Suppose we set the threshold for positive predictions at 0.9. What is the lowest score that is classified as positive? Round your answer to 2 decimal places.

### Programming Assignment

#### Quiz: Exploring precision and recall13 questions

QUIZ
Exploring precision and recall
13 questions
To Pass80% or higher
Attempts3 every 8 hours
October 1, 11:59 PM PDT

1 point
1.Consider the logistic regression model trained on amazon_baby.gl using GraphLab Create.

Using accuracy as the evaluation metric, was our logistic regression model better than the majority class classifier?

1 point
2.How many predicted values in the test set are false positives?

1 point
3.Consider the scenario where each false positive costs $100 and each false negative$1.

Given the stipulation, what is the cost associated with the logistic regression classifier’s performance on the test set?

Between $0 and$100,000

Between $100,000 and$200,000

Between $200,000 and$300,000

Above \$300,000

1 point
4.Out of all reviews in the test set that are predicted to be positive, what fraction of them are false positives? (Round to the second decimal place e.g. 0.25)

1 point
5.Based on what we learned in lecture, if we wanted to reduce this fraction of false positives to be below 3.5%, we would:

Discard a sufficient number of positive predictions

Discard a sufficient number of negative predictions

Increase threshold for predicting the positive class (y^=+1)

Decrease threshold for predicting the positive class (y^=+1)

1 point
6.What fraction of the positive reviews in the test_set were correctly predicted as positive by the classifier? Round your answer to 2 decimal places.

1 point
7.What is the recall value for a classifier that predicts +1 for all data points in the test_data?

1 point
8.What happens to the number of positive predicted reviews as the threshold increased from 0.5 to 0.9?

More reviews are predicted to be positive.

Fewer reviews are predicted to be positive.

1 point
9.Consider the metrics obtained from setting the threshold to 0.5 and to 0.9.

Does the recall increase with a higher threshold?

1 point
10.Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better? Round your answer to 3 decimal places.

1 point
11.Using threshold = 0.98, how many false negatives do we get on the test_data? (Hint: You may use the graphlab.evaluation.confusion_matrix function implemented in GraphLab Create.)

1 point
12.Questions 13 and 14 are concerned with the reviews that contain the word baby.

Among all the threshold values tried, what is the smallest threshold value that achieves a precision of 96.5% or better for the reviews of data in baby_reviews? Round your answer to 3 decimal places.

1 point
13.Questions 13 and 14 are concerned with the reviews that contain the word baby.

Is this threshold value smaller or larger than the threshold used for the entire dataset to achieve the same specified precision of 96.5%?

Larger

Smaller

## Week 7 Scaling to Huge Datasets & Online Learning

With the advent of the internet, the growth of social media, and the embedding of sensors in the world, the magnitudes of data that our machine learning algorithms must handle have grown tremendously over the last decade. This effect is sometimes called “Big Data”. Thus, our learning algorithms must scale to bigger and bigger datasets. In this module, you will develop a small modification of gradient ascent called stochastic gradient, which provides significant speedups in the running time of our algorithms. This simple change can drastically improve scaling, but makes the algorithm less stable and harder to use in practice. In this module, you will investigate the practical techniques needed to make stochastic gradient viable, and to thus to obtain learning algorithms that scale to huge datasets. You will also address a new kind of machine learning problem, online learning, where the data streams in over time, and we must learn the coefficients as the data arrives. This task can also be solved with stochastic gradient. You will implement your very own stochastic gradient ascent algorithm for logistic regression from scratch, and evaluate it on sentiment analysis data.

### Scaling ML to huge datasets

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

online-learning-annotated.pdf

### Summarizing scaling to huge datasets & online learning

#### Quiz: Scaling to Huge Datasets & Online Learning10 questions

QUIZ
Scaling to Huge Datasets & Online Learning
10 questions
To Pass80% or higher
Attempts3 every 8 hours
October 8, 11:59 PM PDT

1 point
1.(True/False) Stochastic gradient ascent often requires fewer passes over the dataset than batch gradient ascent to achieve a similar log likelihood.

1 point
2.(True/False) Choosing a large batch size results in less noisy gradients

1 point
3.(True/False) The set of coefficients obtained at the last iteration represents the best coefficients found so far.

1 point
4.Suppose you obtained the plot of log likelihood below after running stochastic gradient ascent.

Which of the following actions would help the most to improve the rate of convergence?

Increase step size

Decrease step size

Decrease batch size

1 point
5.Suppose you obtained the plot of log likelihood below after running stochastic gradient ascent.

Which of the following actions would help to improve the rate of convergence?

Increase batch size

Increase step size

Decrease step size

1 point
6.Suppose it takes about 1 milliseconds to compute a gradient for a single example. You run an online advertising company and would like to do online learning via mini-batch stochastic gradient ascent. If you aim to update the coefficients once every 5 minutes, how many examples can you cover in each update? Overhead and other operations take up 2 minutes, so you only have 3 minutes for the coefficient update.

1 point
7.In search for an optimal step size, you experiment with multiple step sizes and obtain the following convergence plot.

Which line corresponds to the best step size?

(1)

(2)

(3)

(4)

(5)

1 point
8.Suppose you run stochastic gradient ascent with two different batch sizes. Which of the two lines below corresponds to the smaller batch size (assuming both use the same step size)?

(1)

(2)

1 point
9.Which of the following is NOT a benefit of stochastic gradient ascent over batch gradient ascent? Choose all that apply.

Each coefficient step is very fast.

Log likelihood of data improves monotonically.

Stochastic gradient ascent can be used for online learning.

Stochastic gradient ascent can achieve higher likelihood than batch gradient ascent for the same amount of running time.

Stochastic gradient ascent is highly robust with respect to parameter choices.

1 point
10.Suppose we run the stochastic gradient ascent algorithm described in the lecture with batch size of 100. To make 10 passes over a dataset consisting of 15400 examples, how many iterations does it need to run?

### Programming Assignment

#### Quiz: Training Logistic Regression via Stochastic Gradient Ascent12 questions

QUIZ
Training Logistic Regression via Stochastic Gradient Ascent
12 questions
To Pass80% or higher
Attempts3 every 8 hours
October 8, 11:59 PM PDT

1 point
1.In Module 3 assignment, there were 194 features (an intercept + one feature for each of the 193 important words). In this assignment, we will use stochastic gradient ascent to train the classifier using logistic regression. How does the changing the solver to stochastic gradient ascent affect the number of features?

Increases

Decreases

Stays the same

1 point
2.Recall from the lecture and the earlier assignment, the log likelihood (without the averaging term) is given by

$$ℓℓ(w)=∑i=1N((1[yi=+1]−1)wTh(xi)−ln(1+exp(−wTh(xi))))$$
whereas the average log likelihood is given by

$$ℓℓA(w)=1/N∑i=1N((1[yi=+1]−1)wTh(xi)−ln(1+exp(−wTh(xi))))$$
How are the functions ℓℓ(w) and ℓℓA(w) related?

ℓℓA(w)=ℓℓ(w)

ℓℓA(w)=(1/N)⋅ℓℓ(w)

ℓℓA(w)=N⋅ℓℓ(w)

ℓℓA(w)=ℓℓ(w)−∥w∥

1 point
3.Refer to the sub-section Computing the gradient for a single data point.

The code block above computed

∂ℓi(w)∂wj
for j = 1 and i = 10. Is this quantity a scalar or a 194-dimensional vector?

A scalar

A 194-dimensional vector

1 point
4.Refer to the sub-section Modifying the derivative for using a batch of data points.

The code block computed

∑s=ii+B∂ℓs(w)∂wj
for j = 10, i = 10, and B = 10. Is this a scalar or a 194-dimensional vector?

A scalar

A 194-dimensional vector

1 point
5.For what value of B is the term

∑s=1B∂ℓs(w)∂wj
the same as the full gradient

∂ℓ(w)∂wj
? A numeric answer is expected for this question. Hint: consider the training set we are using now.

1 point
6.For what value of batch size B above is the stochastic gradient ascent function logistic_regression_SG act as a standard gradient ascent algorithm? A numeric answer is expected for this question. Hint: consider the training set we are using now.

1 point
7.When you set batch_size = 1, as each iteration passes, how does the average log likelihood in the batch change?

Increases

Decreases

Fluctuates

1 point
8.When you set batch_size = len(feature_matrix_train), as each iteration passes, how does the average log likelihood in the batch change?

Increases

Decreases

Fluctuates

1 point
9.Suppose that we run stochastic gradient ascent with a batch size of 100. How many gradient updates are performed at the end of two passes over a dataset consisting of 50000 data points?

1 point

In the first figure, how many passes does batch gradient ascent need to achieve a similar log likelihood as stochastic gradient ascent?

It’s always better

10 passes

20 passes

150 passes or more

1 point
11.Questions 11 and 12 refer to the section Plotting the log likelihood as a function of passes for each step size.

Which of the following is the worst step size? Pick the step size that results in the lowest log likelihood in the end.

1e-2

1e-1

1e0

1e1

1e2

1 point
12.Questions 11 and 12 refer to the section Plotting the log likelihood as a function of passes for each step size.

Which of the following is the best step size? Pick the step size that results in the highest log likelihood in the end.

1e-4

1e-2

1e0

1e1

1e2

# Machine Learning: Clustering & Retrieval

Course can be found here
Lecture slides can be found here
Summary can be found in my Github

A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover?

In this third case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce.

Learning Outcomes: By the end of this course, you will be able to:
-Create a document retrieval system using k-nearest neighbors.
-Identify various similarity metrics for text data.
-Reduce computations in k-nearest neighbor search by using KD-trees.
-Produce approximate nearest neighbors using locality sensitive hashing.
-Compare and contrast supervised and unsupervised learning tasks.
-Cluster documents by topic using k-means.
-Describe how to parallelize k-means using MapReduce.
-Examine probabilistic clustering approaches using mixtures models.
-Fit a mixture of Gaussian model using expectation maximization (EM).
-Perform mixed membership modeling using latent Dirichlet allocation (LDA).
-Describe the steps of a Gibbs sampler and how to use its output to draw inferences.
-Compare and contrast initialization techniques for non-convex optimization objectives.
-Implement these techniques in Python.

## Week 1 Welcome

Clustering and retrieval are some of the most high-impact machine learning tools out there. Retrieval is used in almost every applications and device we interact with, like in providing a set of products related to one a shopper is currently considering, or a list of people you might want to connect with on a social media platform. Clustering can be used to aid retrieval, but is a more broadly useful tool for automatically discovering structure in data, like uncovering groups of similar patients.

This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.

### What is this course about?

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

intro.pdf

#### Software tools you’ll need for this course10 min

Github repository with starter code

In each module of the course, we have a reading with the assignments for that module as well as some starter code. For those interested, the starter code and demos used in this course are also available in a public Github repository:

https://github.com/learnml/machine-learning-specialization

#### A big week ahead!10 min

We start the course by considering a retrieval task of fetching a document similar to one someone is currently reading. We cast this problem as one of nearest neighbor search, which is a concept we have seen in the Foundations and Regression courses. However, here, you will take a deep dive into two critical components of the algorithms: the data representation and metric for measuring similarity between pairs of datapoints. You will examine the computational burden of the naive nearest neighbor search algorithm, and instead implement scalable alternatives using KD-trees for handling large datasets and locality sensitive hashing (LSH) for providing approximate nearest neighbors, even in high-dimensional spaces. You will explore all of these ideas on a Wikipedia dataset, comparing and contrasting the impact of the various choices you can make on the nearest neighbor results produced.

### Introduction to nearest neighbor search and algorithms

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

retrieval-intro-annotated.pdf

### The importance of data representations and distance metrics

#### Quiz: Representations and metrics6 questions

QUIZ
Representations and metrics
6 questions
To Pass80% or higher
Attempts3 every 8 hours
October 22, 11:59 PM PDT

1 point
1.Consider three data points with two features as follows:

Among the three points, which two are closest to each other in terms of having the ​smallest Euclidean distance?

A and B

A and C

B and C

1 point
2.Consider three data points with two features as follows:

Among the three points, which two are closest to each other in terms of having the ​largest cosine similarity (or equivalently, ​smallest cosine distance)?

A and B

A and C

B and C

1 point
3.Consider the following two sentences.

Sentence 1: The quick brown fox jumps over the lazy dog.
Sentence 2: A quick brown dog outpaces a quick fox.
Compute the Euclidean distance using word counts. To compute word counts, turn all words into lower case and strip all punctuation, so that “The” and “the” are counted as the same token. That is, document 1 would be represented as

x=[# the,# a,# quick,# brown,# fox,# jumps,# over,# lazy,# dog,# outpaces]
where # word is the count of that word in the document.

sum = 13

1 point
4.Consider the following two sentences.

Sentence 1: The quick brown fox jumps over the lazy dog.
Sentence 2: A quick brown dog outpaces a quick fox.
Recall that

cosine distance = 1 - cosine similarity = 1−xTy||x||||y||
Compute the cosine distance between sentence 1 and sentence 2 using word counts. To compute word counts, turn all words into lower case and strip all punctuation, so that “The” and “the” are counted as the same token. That is, document 1 would be represented as

x=[# the,# a,# quick,# brown,# fox,# jumps,# over,# lazy,# dog,# outpaces]
where # word is the count of that word in the document.

1 point
5.(True/False) For positive features, cosine similarity is always between 0 and 1.

1 point
6.Which of the following does not describe the word count document representation? (Note: this is different from TF-IDF document representation.)

Ignores the order of the words

Assigns a high score to a frequently occurring word

Penalizes words that appear in every document

### Programming Assignment 1

#### Quiz: Choosing features and metrics for nearest neighbor search5 questions

QUIZ
Choosing features and metrics for nearest neighbor search
5 questions
To Pass80% or higher
Attempts3 every 8 hours
October 22, 11:59 PM PDT

1 point
1.Among the words that appear in both Barack Obama and Francisco Barrio, take the 5 that appear most frequently in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?

1 point
2.Measure the pairwise distance between the Wikipedia pages of Barack Obama, George W. Bush, and Joe Biden. Which of the three pairs has the smallest distance?

Between Obama and Biden

Between Obama and Bush

Between Biden and Bush

1 point
3.Collect all words that appear both in Barack Obama and George W. Bush pages. Out of those words, find the 10 words that show up most often in Obama’s page. Which of the following is NOT one of the 10 words?

the

presidential

in

act

his

1 point
4.Among the words that appear in both Barack Obama and Phil Schiliro, take the 5 that have largest weights in Obama. How many of the articles in the Wikipedia dataset contain all of those 5 words?

1
point

1. Compute the Euclidean distance between TF-IDF features of Obama and Biden. Round your answer to 3 decimal places. Use American-style decimals (e.g. 110.921).

### Scaling up k-NN search using KD-trees

#### Quiz: KD-trees5 questions

QUIZ
KD-trees
5 questions
To Pass80% or higher
Attempts3 every 8 hours
October 22, 11:59 PM PDT

1 point
1.Which of the following is not true about KD-trees?

It divides the feature space into nested axis-aligned boxes.

It can be used only for approximate nearest neighbor search but not for exact nearest neighbor search.

It prunes parts of the feature space away from consideration by inspecting smallest possible distances that can be achieved.

The query time scales sublinearly with the number of data points and exponentially with the number of dimensions.

It works best in low to medium dimension settings.

1 point
2.Questions 2, 3, 4, and 5 involves training a KD-tree on the following dataset:

X1 X2
Data point 1 -1.58 -2.01
Data point 2 0.91 3.98
Data point 3 -0.73 4.00
Data point 4 -4.22 1.16
Data point 5 4.19 -2.02
Data point 6 -0.33 2.15

Train a KD-tree by hand as follows:

• First split using X1 and then using X2. Alternate between X1 and X2 in order.
• Use “middle-of-the-range” heuristic for each split. Take the maximum and minimum of the coordinates of the member points.
• Keep subdividing until every leaf node contains two or fewer data points.

What is the split value used for the first split? Enter the exact value, as you are expected to obtain a finite number of decimals. Use American-style decimals (e.g. 0.026).

1 point
3.Refer to Question 2 for context.

What is the split value used for the second split? Enter the exact value, as you are expected to obtain a finite number of decimals. Use American-style decimals (e.g. 0.026).

1 point
4.Refer to Question 2 for context.

Given a query point (-3, 1.5), which of the data points belong to the same leaf node as the query point? Choose all that apply.

Data point 1

Data point 2

Data point 3

Data point 4

Data point 5

Data point 6

1 point
5.Refer to Question 2 for context.

Perform backtracking with the query point (-3, 1.5) to perform exact nearest neighbor search. Which of the data points would be pruned from the search? Choose all that apply.

Hint: Assume that each node in the KD-tree remembers the tight bound on the coordinates of its member points, as follows:

Data point 1

Data point 2

Data point 3

Data point 4

Data point 5

Data point 6

#### Quiz: Locality Sensitive Hashing5 questions

QUIZ
Locality Sensitive Hashing
5 questions
To Pass80% or higher
Attempts3 every 8 hours
October 22, 11:59 PM PDT

1 point
1.(True/False) Like KD-trees, Locality Sensitive Hashing lets us compute exact nearest neighbors while inspecting only a fraction of the data points in the training set.

1 point
2.(True/False) Given two data points with high cosine similarity, the probability that a randomly drawn line would separate the two points is small.

1 point
3.(True/False) The true nearest neighbor of the query is guaranteed to fall into the same bin as the query.

1 point
4.(True/False) Locality Sensitive Hashing is more efficient than KD-trees in high dimensional setting.

1 point
5.Suppose you trained an LSH model and performed a lookup using the bin index of the query. You notice that the list of candidates returned are not at all similar to the query item. Which of the following changes would not produce a more relevant list of candidates?

Use multiple tables.

Increase the number of random lines/hyperplanes.

Inspect more neighboring bins to the bin containing the query.

Decrease the number of random lines/hyperplanes.

### Programming Assignment 2

#### Quiz: Implementing Locality Sensitive Hashing from scratch5 questions

QUIZ
Implementing Locality Sensitive Hashing from scratch
5 questions
To Pass80% or higher
Attempts3 every 8 hours
October 22, 11:59 PM PDT

1 point
1.What is the document ID of Barack Obama’s article?

1 point
2.Which bin contains Barack Obama’s article? Enter its integer index.

1 point
3.Examine the bit representations of the bins containing Barack Obama and Joe Biden. In how many places do they agree?

16 out of 16 places (Barack Obama and Joe Biden fall into the same bin)

14 out of 16 places

12 out of 16 places

10 out of 16 places

8 out of 16 places

1 point
4.Refer to the section “Effect of nearby bin search”. What was the smallest search radius that yielded the correct nearest neighbor for Obama, namely Joe Biden?

1 point
5.Suppose our goal was to produce 10 approximate nearest neighbors whose average distance from the query document is within 0.01 of the average for the true 10 nearest neighbors. For Barack Obama, the true 10 nearest neighbors are on average about 0.77. What was the smallest search radius for Barack Obama that produced an average distance of 0.78 or better?

## Week 3 Clustering with k-means

In clustering, our goal is to group the datapoints in our dataset into disjoint sets. Motivated by our document analysis case study, you will use clustering to discover thematic groups of articles by “topic”. These topics are not provided in this unsupervised learning task; rather, the idea is to output such cluster labels that can be post-facto associated with known topics like “Science”, “World News”, etc. Even without such post-facto labels, you will examine how the clustering output can provide insights into the relationships between datapoints in the dataset. The first clustering algorithm you will implement is k-means, which is the most widely used clustering algorithm out there. To scale up k-means, you will learn about the general MapReduce framework for parallelizing and distributing computations, and then how the iterates of k-means can utilize this framework. You will show that k-means can provide an interpretable grouping of Wikipedia articles when appropriately tuned.

### Clustering via k-means

#### Quiz: k-means9 questions

QUIZ
k-means
9 questions
To Pass80% or higher
Attempts3 every 8 hours
October 29, 11:59 PM PDT

1 point
1.(True/False) k-means always converges to a local optimum.

1 point
2.(True/False) The clustering objective is non-increasing throughout a run of k-means.

1 point
3.(True/False) Running k-means with a larger value of k always enables a lower possible final objective value than running k-means with smaller k.

1 point
4.(True/False) Any initialization of the centroids in k-means is just as good as any other.

1 point
5.(True/False) Initializing centroids using k-means++ guarantees convergence to a global optimum.

1 point
6.(True/False) Initializing centroids using k-means++ costs more than random initialization in the beginning, but can pay off eventually by speeding up convergence.

1 point
7.(True/False) Using k-means++ can only influence the number of iterations to convergence, not the quality of the final assignments (i.e., objective value at convergence).

4 points
8.Consider the following dataset:

X1 X2
Data point 1 -1.88 2.05
Data point 2 -0.71 0.42
Data point 3 2.41 -0.67
Data point 4 1.85 -3.80
Data point 5 -3.69 -1.33

Perform k-means with k=2 until the cluster assignment does not change between successive iterations. Use the following initialization for the centroids:

X1 X2
Cluster 1 2.00 2.00
Cluster 2 -2.00 -2.00

Which of the five data points changed its cluster assignment most often during the k-means run?

Data point 1

Data point 2

Data point 3

Data point 4

Data point 5

1 point
9.Suppose we initialize k-means with the following centroids

Which of the following best describes the cluster assignment in the first iteration of k-means?

### Programming Assignment

#### Quiz: Clustering text data with K-means8 questions

QUIZ
Clustering text data with K-means
8 questions
To Pass80% or higher
Attempts3 every 8 hours
October 29, 11:59 PM PDT

1 point
1.Make sure you have the latest versions of the notebook and the file kmeans-arrays.npz Read this post if

… you created an Amazon EC2 instance before October 1

I acknowledge.
1 point
2.(True/False) The clustering objective (heterogeneity) is non-increasing for this example.

1 point
3.Let’s step back from this particular example. If the clustering objective (heterogeneity) would ever increase when running K-means, that would indicate: (choose one)

K-means algorithm got stuck in a bad local minimum

There is a bug in the K-means code

All data points consist of exact duplicates

Nothing is wrong. The objective should generally go down sooner or later.
1 point
4.Refer to the output of K-means for K=3 and seed=0. Which of the three clusters contains the greatest number of data points in the end?

Cluster #0

Cluster #1

Cluster #2
1 point

1. Another way to capture the effect of changing initialization is to look at the distribution of cluster assignments. Compute the size (# of member data points) of clusters for each of the multiple runs of K-means.

Look at the size of the largest cluster (most # of member data points) across multiple runs, with seeds 0, 20000, …, 120000. What is the maximum value this quantity takes?

1 point
6.Refer to the section “Visualize clusters of documents”. Which of the 10 clusters above contains the greatest number of articles?

Cluster 0: artists, actors, film directors, playwrights

Cluster 4: professors, researchers, scholars

Cluster 5: Australian rules football players, American football players

Cluster 7: composers, songwriters, singers, music producers

Cluster 9: politicians

1 point
7.Refer to the section “Visualize clusters of documents”. Which of the 10 clusters above contains the least number of articles?

Cluster 1: soccer (association football) players, rugby players

Cluster 3: baseball players

Cluster 6: female figures from various fields

Cluster 7: composers, songwriters, singers, music producers

Cluster 8: ice hockey players

1 point

1. Another sign of too large K is having lots of small clusters. Look at the distribution of cluster sizes (by number of member data points). How many of the 100 clusters have fewer than 236 articles, i.e. 0.4% of the dataset?

### MapReduce for scaling k-means

#### Quiz: MapReduce for k-means5 questions

QUIZ
MapReduce for k-means
5 questions
To Pass80% or higher
Attempts3 every 8 hours
October 29, 11:59 PM PDT

1 point
1.Suppose we are operating on a 1D vector. Which of the following operation is not data parallel over the vector elements?

Add a constant to every element.

Multiply the vector by a constant.

Increment the vector by another vector of the same dimension.

Compute the average of the elements.

Compute the sign of each element.

1 point
2.(True/False) A single mapper call can emit multiple (key,value) pairs.

1 point
3.(True/False) More than one reducer can emit (key,value) pairs with the same key simultaneously.

1 point
4.(True/False) Suppose we are running k-means using MapReduce. Some mappers may be launched for a new k-means iteration even if some reducers from the previous iteration are still running.

1 point
5.Consider the following list of binary operations. Which can be used for the reduce step of MapReduce? Choose all that apply.

Hints: The reduce step requires a binary operator that satisfied both of the following conditions.

Commutative: OP(x1,x2)=OP(x2,x1)
Associative: OP(OP(x1,x2),x3)=OP(x1,OP(x2,x3))

OP1(x1,x2)=max(x1,x2)

OP2(x1,x2)=x1+x2−2

OP3(x1,x2)=3x1+2x2

OP4(x1,x2)=x21+x2

OP5(x1,x2)=(x1+x2)/2

## Week 4 Mixture Models

In k-means, observations are each hard-assigned to a single cluster, and these assignments are based just on the cluster centers, rather than also incorporating shape information. In our second module on clustering, you will perform probabilistic model-based clustering that provides (1) a more descriptive notion of a “cluster” and (2) accounts for uncertainty in assignments of datapoints to clusters via “soft assignments”. You will explore and implement a broadly useful algorithm called expectation maximization (EM) for inferring these soft assignments, as well as the model parameters. To gain intuition, you will first consider a visually appealing image clustering task. You will then cluster Wikipedia articles, handling the high-dimensionality of the tf-idf document representation considered.

### Motivating and setting the foundation for mixture models

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

mixmodel-EM-annotated.pdf

### The EM algorithm

#### Quiz: EM for Gaussian mixtures9 questions

QUIZ
EM for Gaussian mixtures
9 questions
To Pass80% or higher
Attempts3 every 8 hours
November 5, 11:59 PM PST

1 point
1.(True/False) While the EM algorithm maintains uncertainty about the cluster assignment for each observation via soft assignments, the model assumes that every observation comes from only one cluster.

1 point
2.(True/False) In high dimensions, the EM algorithm runs the risk of setting cluster variances to zero.

1 point
3.In the EM algorithm, what do the E step and M step represent, respectively?

Estimate cluster responsibilities, Maximize likelihood over parameters

Estimate likelihood over parameters, Maximize cluster responsibilities

Estimate number of parameters, Maximize likelihood over parameters

Estimate likelihood over parameters, Maximize number of parameters

1 point
4.Suppose we have data that come from a mixture of 6 Gaussians (i.e., that is the true data structure). Which model would we expect to have the highest log-likelihood after fitting via the EM algorithm?

A mixture of Gaussians with 2 component clusters

A mixture of Gaussians with 4 component clusters

A mixture of Gaussians with 6 component clusters

A mixture of Gaussians with 7 component clusters

A mixture of Gaussians with 10 component clusters
6

1 point
5.Which of the following correctly describes the differences between EM for mixtures of Gaussians and k-means? Choose all that apply.

k-means often gets stuck in a local minimum, while EM tends not to

EM is better at capturing clusters of different sizes and orientations

EM is better at capturing clusters with overlaps

EM is less prone to overfitting than k-means

k-means is equivalent to running EM with infinitesimally small diagonal covariances.

1 point
6.Suppose we are running the EM algorithm. After an E-step, we obtain the following responsibility matrix:

Cluster responsibilities Cluster A Cluster B Cluster C
Data point 1 0.20 0.40 0.40
Data point 2 0.50 0.10 0.40
Data point 3 0.70 0.20 0.10

Which is the least probable cluster for data point 1?

Cluster A

Cluster B

Cluster C

1 point
7.Suppose we are running the EM algorithm. After an E-step, we obtain the following responsibility matrix:

Cluster responsibilities Cluster A Cluster B Cluster C
Data point 1 0.20 0.40 0.40
Data point 2 0.50 0.10 0.40
Data point 3 0.70 0.20 0.10

Suppose also that the data points are as follows:

Dataset X Y Z
Data point 1 3 1 2
Data point 2 0 0 3
Data point 3 1 3 7

Let us compute the new mean for Cluster A. What is the Z coordinate of the new mean? Round your answer to 3 decimal places.
(2*0.2 +3*0.5+7*0.7)/(.2+.5+.7)=

1 point
8.Which of the following contour plots describes a Gaussian distribution with diagonal covariance? Choose all that apply.

(1)

(2)

(3)

(4)

(5)

2 points
9.Suppose we initialize EM for mixtures of Gaussians (using full covariance matrices) with the following clusters:

Which of the following best describes the updated clusters after the first iteration of EM?

### Programming Assignment 1

#### Quiz: Implementing EM for Gaussian mixtures6 questions

QUIZ
Implementing EM for Gaussian mixtures
6 questions
To Pass80% or higher
Attempts3 every 8 hours
November 5, 11:59 PM PST

1 point
1.What is the weight that EM assigns to the first component after running the above codeblock? Round your answer to 3 decimal places.

1 point
2.Using the same set of results, obtain the mean that EM assigns the second component. What is the mean in the first dimension? Round your answer to 3 decimal places.

1 point
3.Using the same set of results, obtain the covariance that EM assigns the third component. What is the variance in the first dimension? Round your answer to 3 decimal places.

1 point
4.Is the loglikelihood plot monotonically increasing, monotonically decreasing, or neither?

Monotonically increasing

Monotonically decreasing

Neither

1 point
5.Calculate the likelihood (score) of the first image in our data set (img[0]) under each Gaussian component through a call to multivariate_normal.pdf. Given these values, what cluster assignment should we make for this image?

Cluster 0

Cluster 1

Cluster 2

Cluster 3

1 point
6.Four of the following images are not in the list of top 5 images in the first cluster. Choose these four.

Image 1

Image 2

Image 3

Image 4

Image 5

Image 6

Image 7

### Programming Assignment 2

#### Quiz: Clustering text data with Gaussian mixtures4 questions

QUIZ
Clustering text data with Gaussian mixtures
4 questions
To Pass80% or higher
Attempts3 every 8 hours
November 5, 11:59 PM PST

1 point
1.Select all the topics that have a cluster in the model created above.

Baseball

Soccer/football

Music

Politics

Law

Finance

1 point
2.Try fitting EM with the random initial parameters you created above. What is the final loglikelihood that the algorithm converges to? Choose the range that contains this value.

Less than 2.2e9

Between 2.2e9 and 2.3e9

Between 2.3e9 and 2.4e9

Between 2.4e9 and 2.5e9

Greater than 2.5e9

1 point
3.Is the final loglikelihood larger or smaller than the final loglikelihood we obtained above when initializing EM with the results from running k-means?

Initializing EM with k-means led to a larger final loglikelihood

Initializing EM with k-means led to a smaller final loglikelihood

1 point
4.For the above model, out_random_init, use the visualize_EM_clusters method you created above. Are the clusters more or less interpretable than the ones found after initializing using k-means?

More interpretable

Less interpretable

## Week 5 Mixed Membership Modeling via Latent Dirichlet Allocation

The clustering model inherently assumes that data divide into disjoint sets, e.g., documents by topic. But, often our data objects are better described via memberships in a collection of sets, e.g., multiple topics. In our fourth module, you will explore latent Dirichlet allocation (LDA) as an example of such a mixed membership model particularly useful in document analysis. You will interpret the output of LDA, and various ways the output can be utilized, like as a set of learned document features. The mixed membership modeling ideas you learn about through LDA for document analysis carry over to many other interesting models and applications, like social network models where people have multiple affiliations.

Throughout this module, we introduce aspects of Bayesian modeling and a Bayesian inference algorithm called Gibbs sampling. You will be able to implement a Gibbs sampler for LDA by the end of the module.

## Introduction to latent Dirichlet allocation

### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

LDA-annotated.pdf

### Quiz: Latent Dirichlet Allocation5 questions

QUIZ
Latent Dirichlet Allocation
5 questions
To Pass80% or higher
Attempts3 every 8 hours
November 12, 11:59 PM PST

1 point
1.(True/False) According to the assumptions of LDA, each document in the corpus contains words about a single topic.

1 point
2.(True/False) Using LDA to analyze a set of documents is an example of a supervised learning task.

1 point
3.(True/False) When training an LDA model, changing the ordering of words in a document does not affect the overall joint probability.

1 point
4.(True/False) Suppose in a trained LDA model two documents have no topics in common (i.e., one document has 0 weight on any topic with non-zero weight in the other document). As a result, a single word in the vocabulary cannot have high probability of occurring in both documents.

1 point
5.(True/False) Topic models are guaranteed to produce weights on words that are coherent and easily interpretable by humans.

## Summarizing latent Dirichlet allocation

### Quiz: Learning LDA model via Gibbs sampling10 questions

QUIZ
Learning LDA model via Gibbs sampling
10 questions
To Pass80% or higher
Attempts3 every 8 hours
November 12, 11:59 PM PST

1 point
1.(True/False) Each iteration of Gibbs sampling for Bayesian inference in topic models is guaranteed to yield a higher joint model probability than the previous sample.

1 point
2.(Check all that are true) Bayesian methods such as Gibbs sampling can be advantageous because they

Account for uncertainty over parameters when making predictions

Are faster than methods such as EM

Maximize the log probability of the data under the model

Regularize parameter estimates to avoid extreme values

1 point
3.For the standard LDA model discussed in the lectures, how many parameters are required to represent the distributions defining the topics?

[# unique words]

[# unique words] * [# topics]

[# documents] * [# unique words]

[# documents] * [# topics]

2 points
4.Suppose we have a collection of documents, and we are focusing our analysis to the use of the following 10 words. We ran several iterations of collapsed Gibbs sampling for an LDA model with K=2 topics and alpha=10.0 and gamma=0.1 (with notation as in the collapsed Gibbs sampling lecture). The corpus-wide assignments at our most recent collapsed Gibbs iteration are summarized in the following table of counts:

Word Count in topic 1 Count in topic 2
baseball 52 0
homerun 15 0
ticket 9 2
price 9 25
manager 20 37
owner 17 32
company 1 23
stock 0 75
bankrupt 0 19
taxes 0 29

We also have a single document i with the following topic assignments for each word:

topic 1 2 1 2 1
word baseball manager ticket price owner

Suppose we want to re-compute the topic assignment for the word “manager”. To sample a new topic, we need to compute several terms to determine how much the document likes each topic, and how much each topic likes the word “manager”. The following questions will all relate to this situation.

First, using the notation in the slides, what is the value of mmanager,1 (i.e., the number of times the word “manager” has been assigned to topic 1)?

1 point
5.Consider the situation described in Question 4.

What is the value of ∑wmw,1, where the sum is taken over all words in the vocabulary?

1 point
6.Consider the situation described in Question 4.

Following the notation in the slides, what is the value of ni,1 for this document i (i.e., the number of words in document i assigned to topic 1)?

1 point
7.In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.

After decrementing, what is the value of ni,2?

1 point
8.In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.

After decrementing, what is the value of mmanager,2?

1 point
9.In the situation described in Question 4, “manager” was assigned to topic 2. When we remove that assignment prior to sampling, we need to decrement the associated counts.

After decrementing, what is the value of ∑wmw,2?

2 points
10.Consider the situation described in Question 4.

As discussed in the slides, the unnormalized probability of assigning to topic 1 is

p1=ni,1+αNi−1+Kαmmanager,1+γ∑wmw,1+Vγ
where V is the total size of the vocabulary.

Similarly the unnormalized probability of assigning to topic 2 is

p2=ni,2+αNi−1+Kαmmanager,2+γ∑wmw,2+Vγ
Using the above equations and the results computed in previous questions, compute the probability of assigning the word “manager” to topic 1.

(Reminder: Normalize across the two topic options so that the probabilities of all possible assignments—topic 1 and topic 2—sum to 1.)

p1 = (3+10)/(4+210)(20+0.1)/(123+100.1)
p2 = (1+10)/(4+2
10)(36+0.1)/(241+100.1)

## Programming Assignment

### Quiz: Modeling text topics with Latent Dirichlet Allocation12 questions

QUIZ
Modeling text topics with Latent Dirichlet Allocation
12 questions
To Pass80% or higher
Attempts3 every 8 hours
November 12, 11:59 PM PST

1 point
1.Identify the top 3 most probable words for the first topic.

institute

university

professor

research

studies

game

coach

1 point
2.What is the sum of the probabilities assigned to the top 50 words in the 3rd topic? Round your answer to 3 decimal places.

1 point
3.What is the topic most closely associated with the article about former US President George W. Bush? Use the average results from 100 topic predictions.

1 point
4.What are the top 3 topics corresponding to the article about English football (soccer) player Steven Gerrard? Use the average results from 100 topic predictions.

science and research

team sports

music, TV, and film

international athletics

Great Britain and Australia

1 point
5.Using the LDA representation, compute the 5000 nearest neighbors for American baseball player Alex Rodriguez. For what value of k is Mariano Rivera the k-th nearest neighbor to Alex Rodriguez?

1 point
6.Using the TF-IDF representation, compute the 5000 nearest neighbors for American baseball player Alex Rodriguez. For what value of k is Mariano Rivera the k-th nearest neighbor to Alex Rodriguez?

1 point
7.What was the value of alpha used to fit our original topic model?

1 point
8.What was the value of gamma used to fit our original topic model? Remember that GraphLab Create uses “beta” instead of “gamma” to refer to the hyperparameter that influences topic distributions over words.

1 point
9.How many topics are assigned a weight greater than 0.3 or less than 0.05 for the article on Paul Krugman in the low alpha model? Use the average results from 100 topic predictions.

1 point
10.How many topics are assigned a weight greater than 0.3 or less than 0.05 for the article on Paul Krugman in the high alpha model? Use the average results from 100 topic predictions.

1 point
11.For each topic of the low gamma model, compute the number of words required to make a list with total probability 0.5. What is the average number of words required across all topics? (HINT: use the get_topics() function from GraphLab Create with the cdf_cutoff argument.)

1 point
12.For each topic of the high gamma model, compute the number of words required to make a list with total probability 0.5. What is the average number of words required across all topics? (HINT: use the get_topics() function from GraphLab Create with the cdf_cutoff argument).

## Week 6 Hierarchical Clustering & Closing Remarks

In the conclusion of the course, we will recap what we have covered. This represents both techniques specific to clustering and retrieval, as well as foundational machine learning concepts that are more broadly useful.

We provide a quick tour into an alternative clustering approach called hierarchical clustering, which you will experiment with on the Wikipedia dataset. Following this exploration, we discuss how clustering-type ideas can be applied in other areas like segmenting time series. We then briefly outline some important clustering and retrieval ideas that we did not cover in this course.

We conclude with an overview of what’s in store for you in the rest of the specialization.

### What we’ve learned

#### Slides presented in this module10 min

For those interested, the slides presented in the videos for this module can be downloaded here:

closing-annotated.pdf

### Programming Assignment

#### Quiz: Modeling text data with a hierarchy of clusters3 questions

QUIZ
Modeling text data with a hierarchy of clusters
3 questions
To Pass33% or higher
Attempts3 every 8 hours
November 19, 11:59 PM PST

1 point

… you created an Amazon EC2 instance before October 1

I acknowledge.

1 point
2.Which diagram best describes the hierarchy right after splitting the ice_hockey_football cluster?
football golf

1 point
3.Let us bipartition the clusters male_non_athletes and female_non_athletes. Which diagram best describes the resulting hierarchy of clusters for the non-athletes?

Note. The clusters for the athletes are not shown to save space.