For quick searching
Course can be found here
Video in YouTube
Lecture Slides can be found in my Github(PDF version)

If you want to break into AI, this Specialization will help you do so. Deep Learning is one of the most highly sought after skills in tech. We will help you become good at Deep Learning.

In five courses, you will learn the foundations of Deep Learning, understand how to build neural networks, and learn how to lead successful machine learning projects. You will learn about Convolutional networks, RNNs, LSTM, Adam, Dropout, BatchNorm, Xavier/He initialization, and more. You will work on case studies from healthcare, autonomous driving, sign language reading, music generation, and natural language processing. You will master not only the theory, but also see how it is applied in industry. You will practice all these ideas in Python and in TensorFlow, which we will teach.

You will also hear from many top leaders in Deep Learning, who will share with you their personal stories and give you career advice.

AI is transforming multiple industries. After finishing this specialization, you will likely find creative ways to apply it to your work.

We will help you master Deep Learning, understand how to apply it, and build a career in AI.

Neural Networks and Deep Learning

Course can be found here
Lecture slides can be found here
About this course: If you want to break into cutting-edge AI, this course will help you do so. Deep learning engineers are highly sought after, and mastering deep learning will give you numerous new career opportunities. Deep learning is also a new “superpower” that will let you build AI systems that just weren’t possible a few years ago.

In this course, you will learn the foundations of deep learning. When you finish this class, you will:

Understand the major technology trends driving Deep Learning
Be able to build, train and apply fully connected deep neural networks
Know how to implement efficient (vectorized) neural networks
Understand the key parameters in a neural network’s architecture

This course also teaches you how Deep Learning actually works, rather than presenting only a cursory or surface-level description. So after completing it, you will be able to apply deep learning to a your own applications. If you are looking for a job in AI, after this course you will also be able to answer basic interview questions.

This is the first course of the Deep Learning Specialization.

Who is this class for: Prerequisites: Expected: - Programming: Basic Python programming skills, with the capability to work effectively with data structures. Recommended: - Mathematics: Matrix vector operations and notation. - Machine Learning: Understanding how to frame a machine learning problem, including how data is represented will be beneficial. If you have taken my Machine Learning Course here, you have much more than the needed level of knowledge.

Week 1 Introduction to deep learning

Be able to explain the major trends driving the rise of deep learning, and understand where and how it is applied today.

Learning Objectives

Understand the major trends driving the rise of deep learning.
Be able to explain how deep learning is applied to supervised learning.
Understand what are the major categories of models (such as CNNs and RNNs), and when they should be applied.
Be able to recognize the basics of when deep learning will (or will not) work well.

Welcome to the Deep Learning Specialization

Welcome5 min

Introduction to Deep Learning

What is a neural network?7 min

Supervised Learning with Neural Networks8 min

Why is Deep Learning taking off?10 min

About this Course2 min

Frequently Asked Questions10 min

Course Resources1 min

How to use Discussion Forums10 min

Practice Questions

Quiz: Introduction to deep learning10 questions

QUIZ
Introduction to deep learning
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 17, 11:59 PM PDT

1 point
1.What does the analogy “AI is the new electricity” refer to?

AI is powering personal devices in our homes and offices, similar to electricity.

AI runs on computers and is thus powered by electricity, but it is letting computers do things not possible before.

Through the “smart grid”, AI is delivering a new wave of electricity.

Similar to electricity starting about 100 years ago, AI is transforming multiple industries.

Correct
Yes. AI is transforming many fields from the car industry to agriculture to supply-chain…

1 point
2.Which of these are reasons for Deep Learning recently taking off? (Check the three options that apply.)

Deep learning has resulted in significant improvements in important applications such as online advertising, speech recognition, and image recognition.

Correct
These were all examples discussed in lecture 3.

Neural Networks are a brand new field.

Un-selected is correct

We have access to a lot more data.

Correct
Yes! The digitalization of our society has played a huge role in this.

We have access to a lot more computational power.

Correct
Yes! The development of hardware, perhaps especially GPU computing, has significantly improved deep learning algorithms’ performance.

1 point
3.Recall this diagram of iterating over different ML ideas. Which of the statements below are true? (Check all that apply.)

Being able to try out ideas quickly allows deep learning engineers to iterate more quickly.

Faster computation can help speed up how long a team takes to iterate to a good idea.

It is faster to train on a big dataset than a small dataset.

Recent progress in deep learning algorithms has allowed us to train good models faster (even without changing the CPU/GPU hardware).

1 point
4.When an experienced deep learning engineer works on a new problem, they can usually use insight from previous problems to train a good model on the first try, without needing to iterate multiple times through different models. True/False?

True

This should not be selected
No. Finding the characteristics of a model is key to have good performance. Although experience can help, it requires multiple iterations to build a good model.

False

1 point
5.Which one of these plots represents a ReLU activation function?

Figure 1:

Figure 2:

Figure 3:

Figure 4:

1 point
6.Images for cat recognition is an example of “structured” data, because it is represented as a structured array in a computer. True/False?

True

This should not be selected
No. Images for cat recognition is an example of “unstructured” data.

False

1 point
7.A demographic dataset with statistics on different cities’ population, GDP per capita, economic growth is an example of “unstructured” data because it contains data coming from different sources. True/False?

True

False

1 point
8.Why is an RNN (Recurrent Neural Network) used for machine translation, say translating English to French? (Check all that apply.)

It can be trained as a supervised learning problem.

Correct
Yes. We can train it on many pairs of sentences x (English) and y (French).

It is strictly more powerful than a Convolutional Neural Network (CNN).

This should not be selected
No. RNN and CNN are two distinct classes of models, with their own advantages and disadvantages.

It is applicable when the input/output is a sequence (e.g., a sequence of words).

Correct
Yes. An RNN can map from a sequence of english words to a sequence of french words.

RNNs represent the recurrent process of Idea->Code->Experiment->Idea->….

1 point
9.In this diagram which we hand-drew in lecture, what do the horizontal axis (x-axis) and vertical axis (y-axis) represent?

x-axis is the amount of data
y-axis (vertical axis) is the performance of the algorithm.

x-axis is the performance of the algorithm
y-axis (vertical axis) is the amount of data.

x-axis is the amount of data
y-axis is the size of the model you train.

x-axis is the input to the algorithm
y-axis is outputs.

1 point
10.Assuming the trends described in the previous question’s figure are accurate (and hoping you got the axis labels right), which of the following are true? (Check all that apply.)

Increasing the size of a neural network generally does not hurt an algorithm’s performance, and it may help significantly.

Decreasing the size of a neural network generally does not hurt an algorithm’s performance, and it may help significantly.

Increasing the training set size generally does not hurt an algorithm’s performance, and it may help significantly.

Decreasing the training set size generally does not hurt an algorithm’s performance, and it may help significantly.

Heroes of Deep Learning (Optional)

Geoffrey Hinton interview40 min

Week 2 Neural Networks Basics

Learn to set up a machine learning problem with a neural network mindset. Learn to use vectorization to speed up your models.

Learning Objectives

Build a logistic regression model, structured as a shallow neural network
Implement the main steps of an ML algorithm, including making predictions, derivative computation, and gradient descent.
Implement computationally efficient, highly vectorized, versions of models.
Understand how to compute derivatives for logistic regression, using a backpropagation mindset.
Become familiar with Python and Numpy
Work with iPython Notebooks
Be able to implement vectorization across multiple training examples

Logistic Regression as a Neural Network

Binary Classification8 min

Logistic Regression5 min

Logistic Regression Cost Function8 min

Gradient Descent11 min

Derivatives7 min

More Derivative Examples10 min

Computation graph3 min

Derivatives with a Computation Graph14 min

Logistic Regression Gradient Descent6 min

Gradient Descent on m Examples8 min

Python and Vectorization

Vectorization8 min

More Vectorization Examples6 min

Vectorizing Logistic Regression7 min

Vectorizing Logistic Regression’s Gradient Output9 min

Broadcasting in Python11 min

A note on python/numpy vectors6 min

Quick tour of Jupyter/iPython Notebooks3 min

Explanation of logistic regression cost function (optional)7 min

Practice Questions

Quiz: Neural Network Basics10 questions

QUIZ
Neural Network Basics
10 questions
To Pass80% or higher
Deadline
September 24, 11:59 PM PDT

1 point
1.What does a neuron compute?

A neuron computes an activation function followed by a linear function (z = Wx + b)

A neuron computes a function g that scales the input x linearly (Wx + b)

A neuron computes a linear function (z = Wx + b) followed by an activation function

A neuron computes the mean of all features before applying the output to an activation function

1 point
2.Which of these is the “Logistic Loss”?

$\mathcal{L}^{(i)}(\hat{y}^{(i)}, y^{(i)}) = -( y^{(i)}\log(\hat{y}^{(i)}) + (1- y^{(i)})\log(1-\hat{y}^{(i)}))$

$\mathcal{L}^{(i)}(\hat{y}^{(i)}, y^{(i)}) = \mid y^{(i)} - \hat{y}^{(i)} \mid$

$\mathcal{L}^{(i)}(\hat{y}^{(i)}, y^{(i)}) = \mid y^{(i)} - \hat{y}^{(i)} \mid^{2}$

$\mathcal{L}^{(i)}(\hat{y}^{(i)}, y^{(i)}) = max(0, y^{(i)} - \hat{y}^{(i)})$

1 point
3.Suppose img is a (32,32,3) array, representing a 32x32 image with 3 color channels red, green and blue. How do you reshape this into a column vector?

x = img.reshape((1,3232,3))

x = img.reshape((32*32,3))

x = img.reshape((32323,1))

x = img.reshape((3,32*32))

1 point
4.Consider the two following random arrays “a” and “b”:

1
2
3

a = np.random.randn(2, 3) # a.shape = (2, 3)
b = np.random.randn(2, 1) # b.shape = (2, 1)
c = a + b

What will be the shape of “c”?

c.shape = (2, 1)

c.shape = (2, 3)

The computation cannot happen because the sizes don’t match. It’s going to be “Error”!

c.shape = (3, 2)

1 point
5.Consider the two following random arrays “a” and “b”:

1
2
3

a = np.random.randn(4, 3) # a.shape = (4, 3)
b = np.random.randn(3, 2) # b.shape = (3, 2)
c = a*b

What will be the shape of “c”?

The computation cannot happen because the sizes don’t match. It’s going to be “Error”!

c.shape = (4,2)

c.shape = (3, 3)

c.shape = (4, 3)

1 point
6.Suppose you have $$n_x$$ input features per example. Recall that $$X = [x^{(1)} x^{(2)} … x^{(m)}]$$. What is the dimension of X?

$$(m,n_x)$$

$$(m,1)$$

$$(n_x, m)$$

$$(1,m)$$

1 point
7.Recall that “np.dot(a,b)” performs a matrix multiplication on a and b, whereas “a*b” performs an element-wise multiplication.

Consider the two following random arrays “a” and “b”:

1
2
3

a = np.random.randn(12288, 150) # a.shape = (12288, 150)
b = np.random.randn(150, 45) # b.shape = (150, 45)
c = np.dot(a,b)

What is the shape of c?

The computation cannot happen because the sizes don’t match. It’s going to be “Error”!

c.shape = (12288, 45)

c.shape = (12288, 150)

c.shape = (150,150)

1 point
8.Consider the following code snippet:

# a.shape = (3,4)
# b.shape = (4,1)
for i in range(3):
  for j in range(4):
    c[i][j] = a[i][j] + b[j]

How do you vectorize this?

c = a.T + b.T

c = a + b.T

c = a + b

c = a.T + b

1 point
9.Consider the following code:

1
2
3

a = np.random.randn(3, 3)
b = np.random.randn(3, 1)
c = a*b

What will be c? (If you’re not sure, feel free to run this in python to find out).

This will invoke broadcasting, so b is copied three times to become (3,3), and $$*$$ is an element-wise product so c.shape will be (3, 3)

This will invoke broadcasting, so b is copied three times to become (3, 3), and $$*$$ invokes a matrix multiplication operation of two 3x3 matrices so c.shape will be (3, 3)

This will multiply a 3x3 matrix a with a 3x1 vector, thus resulting in a 3x1 vector. That is, c.shape = (3,1).

It will lead to an error since you cannot use “*” to operate on these two matrices. You need to instead use np.dot(a,b)

1 point
10.Consider the following computation graph.

What is the output J?

J = (c - 1)*(b + a)

J = (a - 1) * (b + c)

J = ab + bc + a*c

J = (b - 1) * (c + a)

Programming Assignments

Deep Learning Honor Code2 min

Deep Learning Honor Code

We strongly encourage students to form study groups, and discuss the lecture videos (including in-video questions). We also encourage you to get together with friends to watch the videos together as a group. However, the answers that you submit for the review questions should be your own work. For the programming exercises, you are welcome to discuss them with other students, discuss specific algorithms, properties of algorithms, etc.; we ask only that you not look at any source code written by a different student, nor show your solution code to other students.

You are also not allowed to post your code publicly on github.

Programming Assignment FAQ10 min

Python Basics with numpy (optional)1h

Python Basics with numpy (optional)
Welcome to your first (Optional) programming exercise of the deep learning specialization. In this assignment you will:

Learn how to use numpy.
Implement some basic core deep learning functions such as the softmax, sigmoid, dsigmoid, etc…
Learn how to handle data by normalizing inputs and reshaping images.
Recognize the importance of vectorization.
Understand how python broadcasting works.

This assignment prepares you well for the upcoming assignment. Take your time to complete it and make sure you get the expected outputs when working through the different exercises. In some code blocks, you will find a “#GRADED FUNCTION: functionName” comment. Please do not modify it. After you are done, submit your work and check your results. You need to score 70% to pass. Good luck :) !
Open Notebook

Practice Programming Assignment: Python Basics with numpy (optional)1h

Logistic Regression with a Neural Network mindset2h

Logistic Regression with a Neural Network mindset
Welcome to the first (required) programming exercise of the deep learning specialization. In this notebook you will build your first image recognition algorithm. You will build a cat classifier that recognizes cats with 70% accuracy!

As you keep learning new techniques you will increase it to 80+ % accuracy on cat vs. non-cat datasets. By completing this assignment you will:

Work with logistic regression in a way that builds intuition relevant to neural networks.
Learn how to minimize the cost function.
Understand how derivatives of the cost are used to update parameters.

Take your time to complete this assignment and make sure you get the expected outputs when working through the different exercises. In some code blocks, you will find a “#GRADED FUNCTION: functionName” comment. Please do not modify these comments. After you are done, submit your work and check your results. You need to score 70% to pass. Good luck :) !
Open Notebook

Programming Assignment: Logistic Regression with a Neural Network mindset

Heroes of Deep Learning (Optional)

Pieter Abbeel interview16 min

Week 3 Shallow neural networks

Learn to build a neural network with one hidden layer, using forward propagation and backpropagation.

Learning Objectives

Understand hidden units and hidden layers
Be able to apply a variety of activation functions in a neural network.
Build your first forward and backward propagation with a hidden layer
Apply random initialization to your neural network
Become fluent with Deep Learning notations and Neural Network Representations
Build and train a neural network with one hidden layer.

Shallow Neural Network

Neural Networks Overview4 min

Neural Network Representation5 min

Computing a Neural Network’s Output9 min

Vectorizing across multiple examples9 min

Explanation for Vectorized Implementation7 min

Activation functions10 min

Why do you need non-linear activation functions?5 min

Derivatives of activation functions7 min

Gradient descent for Neural Networks9 min

Backpropagation intuition (optional)15 min

Random Initialization7 min

Practice Questions

Quiz: Shallow Neural Networks10 questions

QUIZ
Shallow Neural Networks
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 1, 11:59 PM PDT

1 point
1.Which of the following are true? (Check all that apply.)

X is a matrix in which each column is one training example.

a[2] denotes the activation vector of the 2nd layer.

a2 denotes the activation vector of the 2nd layer for the 12th training example.

a[2]4 is the activation output by the 4th neuron of the 2nd layer

X is a matrix in which each row is one training example.

a2 denotes activation vector of the 12th layer on the 2nd training example.

a[2]4 is the activation output of the 2nd layer for the 4th training example

1 point
2.The tanh activation usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, and so it centers the data better for the next layer. True/False?

True

False

1 point
3.Which of these is a correct vectorized implementation of forward propagation for layer l, where 1≤l≤L?

Z[l]=W[l]A[l]+b[l]
A[l+1]=gl+1

Z[l]=W[l]A[l−1]+b[l]
A[l]=gl

Z[l]=W[l−1]A[l]+b[l−1]
A[l]=gl

Z[l]=W[l]A[l]+b[l]
A[l+1]=gl

1 point
4.You are building a binary classifier for recognizing cucumbers (y=1) vs. watermelons (y=0). Which one of these activation functions would you recommend using for the output layer?

ReLU

Leaky ReLU

sigmoid

tanh

1 point
5.Consider the following code:

1 2	A = np.random.randn(4,3) B = np.sum(A, axis = 1, keepdims = True)

What will be B.shape? (If you’re not sure, feel free to run this in python to find out).

(1, 3)

(, 3)

(4, )

(4, 1)

1 point
6.Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

Each neuron in the first hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.

Each neuron in the first hidden layer will perform the same computation in the first iteration. But after one iteration of gradient descent they will learn to compute different things because we have “broken symmetry”.

Each neuron in the first hidden layer will compute the same thing, but neurons in different layers will compute different things, thus we have accomplished “symmetry breaking” as described in lecture.

The first hidden layer’s neurons will perform different computations from each other even in the first iteration; their parameters will thus keep evolving in their own way.

1 point
7.Logistic regression’s weights w should be initialized randomly rather than to all zeros, because if you initialize to all zeros, then logistic regression will fail to learn a useful decision boundary because it will fail to “break symmetry”, True/False?

True

False

1 point
8.You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*1000. What will happen?

This will cause the inputs of the tanh to also be very large, causing the units to be “highly activated” and thus speed up learning compared to if the weights had to start from small values.

It doesn’t matter. So long as you initialize the weights randomly gradient descent is not affected by whether the weights are large or small.

This will cause the inputs of the tanh to also be very large, thus causing gradients to be close to zero. The optimization algorithm will thus become slow.

This will cause the inputs of the tanh to also be very large, thus causing gradients to also become large. You therefore have to set α to be very small to prevent divergence; this will slow down learning.

1 point
9.Consider the following 1 hidden layer neural network:

Which of the following statements are True? (Check all that apply).

W[1] will have shape (2, 4)

b[1] will have shape (4, 1)

W[1] will have shape (4, 2)

b[1] will have shape (2, 1)

W[2] will have shape (1, 4)

b[2] will have shape (4, 1)

W[2] will have shape (4, 1)

b[2] will have shape (1, 1)

1 point
10.In the same network as the previous question, what are the dimensions of Z[1] and A[1]?

Z[1] and A[1] are (1,4)

Z[1] and A[1] are (4,m)

Z[1] and A[1] are (4,1)

Z[1] and A[1] are (4,2)

Programming Assignment

Planar data classification with a hidden layer2h 30m

Programming Assignment: Planar data classification with a hidden layer

Heroes of Deep Learning (Optional)

Ian Goodfellow interview14 min

Week 4 Deep Neural Networks

Understand the key computations underlying deep learning, use them to build and train deep neural networks, and apply it to computer vision.

See deep neural networks as successive blocks put one after each other
Build and train a deep L-layer Neural Network
Analyze matrix and vector dimensions to check neural network implementations.
Understand how to use a cache to pass information from forward propagation to back propagation.
Understand the role of hyperparameters in deep learning

Deep Neural Network

Deep L-layer neural network5 min

Forward Propagation in a Deep Network7 min

Getting your matrix dimensions right11 min

Why deep representations?10 min

Building blocks of deep neural networks8 min

Forward and Backward Propagation10 min

Parameters vs Hyperparameters7 min

What does this have to do with the brain?3 min

Practice Questions

Quiz: Key concepts on Deep Neural Networks10 questions

QUIZ
Key concepts on Deep Neural Networks
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 5, 11:59 PM PST

1 point
1.What is the “cache” used for in our implementation of forward propagation and backward propagation?

We use it to pass variables computed during backward propagation to the corresponding forward propagation step. It contains useful values for forward propagation to compute activations.

It is used to cache the intermediate values of the cost function during training.

It is used to keep track of the hyperparameters that we are searching over, to speed up computation.

We use it to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.

1 point
2.Among the following, which ones are “hyperparameters”? (Check all that apply.)

learning rate α

number of iterations

bias vectors b[l]

number of layers L in the neural network

activation values a[l]

weight matrices W[l]

size of the hidden layers n[l]

1 point
3.Which of the following statements is true?

The deeper layers of a neural network are typically computing more complex features of the input than the earlier layers.

The earlier layers of a neural network are typically computing more complex features of the input than the deeper layers.

1 point
4.Vectorization allows you to compute forward propagation in an L-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L. True/False?

True

False

1 point
5.Assume we store the values for n[l] in an array called layers, as follows: layer_dims = [nx, 4,3,2,1]. So layer 1 has four hidden units, layer 2 has 3 hidden units and so on. Which of the following for-loops will allow you to initialize the parameters for the model?

for(i in range(1, len(layer_dims)/2)):
  parameter[‘W’ + str(i)] = np.random.randn(layers[i], layers[i-1])) * 
      0.01
  parameter[‘b’ + str(i)] = np.random.randn(layers[i], 1) * 0.01

for(i in range(1, len(layer_dims)/2)):
  parameter[‘W’ + str(i)] = np.random.randn(layers[i], layers[i-1])) * 
      0.01
  parameter[‘b’ + str(i)] = np.random.randn(layers[i-1], 1) * 0.01

for(i in range(1, len(layer_dims))):
  parameter[‘W’ + str(i)] = np.random.randn(layers[i-1], layers[i])) * 
      0.01
  parameter[‘b’ + str(i)] = np.random.randn(layers[i], 1) * 0.01

for(i in range(1, len(layer_dims))):
  parameter[‘W’ + str(i)] = np.random.randn(layers[i], layers[i-1])) * 
      0.01
  parameter[‘b’ + str(i)] = np.random.randn(layers[i], 1) * 0.01

1 point
6.Consider the following neural network.

How many layers does this network have?

The number of layers L is 4. The number of hidden layers is 3.

The number of layers L is 3. The number of hidden layers is 3.

The number of layers L is 4. The number of hidden layers is 4.

The number of layers L is 5. The number of hidden layers is 4.

1 point
7.During forward propagation, in the forward function for a layer l you need to know what is the activation function in a layer (Sigmoid, tanh, ReLU, etc.). During backpropagation, the corresponding backward function also needs to know what is the activation function for layer l, since the gradient depends on it. True/False?

True

False

1 point
8.There are certain functions with the following properties:

(i) To compute the function using a shallow network circuit, you will need a large network (where we measure size by the number of logic gates in the network), but (ii) To compute it using a deep network circuit, you need only an exponentially smaller network. True/False?

True

False

1 point
9.Consider the following 2 hidden layer neural network:

Which of the following statements are True? (Check all that apply).

W[1] will have shape (4, 4)

b[1] will have shape (4, 1)

W[1] will have shape (3, 4)

b[1] will have shape (3, 1)

W[2] will have shape (3, 4)

b[2] will have shape (1, 1)

W[2] will have shape (3, 1)

b[2] will have shape (3, 1)

W[3] will have shape (3, 1)

b[3] will have shape (1, 1)

W[3] will have shape (1, 3)

b[3] will have shape (3, 1)

1 point
10.Whereas the previous question used a specific network, in the general case what is the dimension of W^{[l]}, the weight matrix associated with layer l?

W[l] has shape (n[l−1],n[l])

W[l] has shape (n[l],n[l+1])

W[l] has shape (n[l+1],n[l])

W[l] has shape (n[l],n[l−1])

Programming Assignments

Building your Deep Neural Network: Step by Step2h 30m

Programming Assignment: Building your deep neural network: Step by Step

Deep Neural Network - Application1h

Programming Assignment: Deep Neural Network Application

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Course can be found here
Lecture slides can be found here

About this course: This course will teach you the “magic” of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.

After 3 weeks, you will:

Understand industry best-practices for building deep learning applications.
Be able to effectively use the common neural network “tricks”, including initialization, L2 and dropout regularization, Batch normalization, gradient checking,
Be able to implement and apply a variety of optimization algorithms, such as mini-batch gradient descent, Momentum, RMSprop and Adam, and check for their convergence.
Understand new best-practices for the deep learning era of how to set up train/dev/test sets and analyze bias/variance
Be able to implement a neural network in TensorFlow.

This is the second course of the Deep Learning Specialization.

Who is this class for: This class is for: - Learners that took the first course of the specialization: “Neural Networks and Deep Learning” - Anyone that already understands fully-connected neural networks, and wants to learn the practical aspects of making them work well.

Week 1 Practical aspects of Deep Learning

Learning Objectives

Recall that different types of initializations lead to different results
Recognize the importance of initialization in complex neural networks.
Recognize the difference between train/dev/test sets
Diagnose the bias and variance issues in your model
Learn when and how to use regularization methods such as dropout or L2 regularization.
Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them
Use gradient checking to verify the correctness of your backpropagation implementation

Setting up your Machine Learning Application

Train / Dev / Test sets12 min

Bias / Variance8 min

Basic Recipe for Machine Learning6 min

Regularizing your neural network

Regularization9 min

Why regularization reduces overfitting?7 min

Dropout Regularization9 min

Understanding Dropout7 min

Other regularization methods8 min

Setting up your optimization problem

Normalizing inputs5 min

Vanishing / Exploding gradients6 min

Weight Initialization for Deep Networks6 min

Numerical approximation of gradients6 min

Gradient checking6 min

Gradient Checking Implementation Notes5 min

Practice Questions

Quiz: Practical aspects of deep learning10 questions

QUIZ
Practical aspects of deep learning
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 15, 11:59 PM PDT

1 point
1.If you have 10,000,000 examples, how would you split the train/dev/test set?

60% train . 20% dev . 20% test

98% train . 1% dev . 1% test

33% train . 33% dev . 33% test

1 point
2.The dev and test set should:

Come from the same distribution

Come from different distributions

Be identical to each other (same (x,y) pairs)

Have the same number of examples

1 point
3.If your Neural Network model seems to have high variance, what of the following would be promising things to try?

Add regularization

Get more training data

Increase the number of units in each hidden layer

Get more test data

Make the Neural Network deeper

1 point
4.You are working on an automated check-out kiosk for a supermarket, and are building a classifier for apples, bananas and oranges. Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%. Which of the following are promising things to try to improve your classifier? (Check all that apply.)

Increase the regularization parameter lambda

Decrease the regularization parameter lambda

Get more training data

Use a bigger neural network

1 point
5.What is weight decay?

A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.

A technique to avoid vanishing gradient by imposing a ceiling on the values of the weights.

The process of gradually decreasing the learning rate during training.

Gradual corruption of the weights in the neural network if it is trained on noisy data.

1 point
6.What happens when you increase the regularization hyperparameter lambda?

Weights are pushed toward becoming smaller (closer to 0)

Weights are pushed toward becoming bigger (further from 0)

Doubling lambda should roughly result in doubling the weights

Gradient descent taking bigger steps with each iteration (proportional to lambda)

1 point
7.With the inverted dropout technique, at test time:

You apply dropout (randomly eliminating units) and do not keep the 1/keep_prob factor in the calculations used in training

You apply dropout (randomly eliminating units) but keep the 1/keep_prob factor in the calculations used in training.

You do not apply dropout (do not randomly eliminate units), but keep the 1/keep_prob factor in the calculations used in training.

You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training

1 point
8.Increasing the parameter keep_prob from (say) 0.5 to 0.6 will likely cause the following: (Check the two that apply)

Increasing the regularization effect

Reducing the regularization effect

Causing the neural network to end up with a higher training set error

Causing the neural network to end up with a lower training set error

1 point
9.Which of these techniques are useful for reducing variance (reducing overfitting)? (Check all that apply.)

Dropout

Data augmentation

Vanishing gradient

Xavier initialization

Gradient Checking

Exploding gradient

L2 regularization

1 point
10.Why do we normalize the inputs x?

It makes the parameter initialization faster

It makes the cost function faster to optimize

It makes it easier to visualize the data

Normalization is another word for regularization–It helps to reduce variance

Programming assignments

Initialization1h

Programming Assignment: Initialization

Regularization1h 30m

Programming Assignment: Regularization1h

https://www.coursera.org/learn/deep-neural-network/programming/SXQaI

Gradient Checking1h

Programming Assignment: Gradient Checking1h

https://www.coursera.org/learn/deep-neural-network/programming/n6NBD

Heroes of Deep Learning (Optional)

Yoshua Bengio interview25 min

Week 2 Optimization algorithms

Learning Objectives

Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
Use random minibatches to accelerate the convergence and improve the optimization
Know the benefits of learning rate decay and apply it to your optimization

Optimization algorithms

Mini-batch gradient descent11 min

Understanding mini-batch gradient descent11 min

Exponentially weighted averages5 min

Understanding exponentially weighted averages9 min

Bias correction in exponentially weighted averages4 min

Gradient descent with momentum9 min

RMSprop7 min

Adam optimization algorithm7 min

Learning rate decay6 min

The problem of local optima5 min

Practice Questions

Quiz: Optimization algorithms10 questions

QUIZ
Optimization algorithms
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT

1 point
1.Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?

a[8]{3}(7)

a[8]{7}(3)

a[3]{8}(7)

a[3]{7}(8)

1 point
2.Which of these statements about mini-batch gradient descent do you agree with?

You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).

Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.

One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.

1 point
3.Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

If the mini-batch size is 1, you end up having to process the entire training set before making any progress.

If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.

If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.

If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.

1 point
4.Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:

Which of the following do you agree with?

If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.

If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.

Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.

Whether you’re using batch gradient descent or mini-batch gradient descent, something is wrong.

1 point
5.Suppose the temperature in Casablanca over the first three days of January are the same:

Jan 1st: θ1=10oC
Jan 2nd: θ210oC
(We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)

Say you use an exponentially weighted average with β=0.5 to track the temperature: v0=0, vt=βvt−1+(1−β)θt. If v2 is the value computed after day 2 without bias correction, and vcorrected2 is the value you compute with bias correction. What are these values? (You might be able to do this without a calculator, but you don’t actually need one. Remember what is bias correction doing.)

v2=10, vcorrected2=10

v2=7.5, vcorrected2=7.5

v2=10, vcorrected2=7.5

v2=7.5, vcorrected2=10

1 point
6.Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.

α=1t√α0

α=11+2∗tα0

α=etα0

α=0.95tα0

1 point
7.You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: vt=βvt−1+(1−β)θt. The red line below was computed using β=0.9. What would happen to your red curve as you vary β? (Check the two that apply)

Decreasing β will shift the red line slightly to the right.

Increasing β will shift the red line slightly to the right.

Decreasing β will create more oscillation within the red line.

Increasing β will create more oscillations within the red line.

1 point
8.Consider this figure:

These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?

(1) is gradient descent. (2) is gradient descent with momentum (large β) . (3) is gradient descent with momentum (small β)

(1) is gradient descent with momentum (small β), (2) is gradient descent with momentum (small β), (3) is gradient descent

(1) is gradient descent. (2) is gradient descent with momentum (small β). (3) is gradient descent with momentum (large β)

(1) is gradient descent with momentum (small β). (2) is gradient descent. (3) is gradient descent with momentum (large β)

1 point
9.Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W[1],b[1],…,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value forJ? (Check all that apply)

Try better random initialization for the weights

Try tuning the learning rate α

Try mini-batch gradient descent

Try using Adam

Try initializing all the weights to zero

1 point
10.Which of the following statements about Adam is False?

Adam combines the advantages of RMSProp and momentum

Adam should be used with batch gradient computations, not with mini-batches.

We usually use “default” values for the hyperparameters β1,β2 and ε in Adam (β1=0.9, β2=0.999, ε=10−8)

The learning rate hyperparameter α in Adam usually needs to be tuned.

Programming assignment

Optimization2h

Programming Assignment: Optimization30 min

https://www.coursera.org/learn/deep-neural-network/programming/Ckiv2

Heroes of Deep Learning (Optional)

Yuanqing Lin interview13 min

Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks

Master the process of hyperparameter tuning

Hyperparameter tuning

Tuning process7 min

Using an appropriate scale to pick hyperparameters8 min

Hyperparameters tuning in practice: Pandas vs. Caviar6 min

Batch Normalization

Normalizing activations in a network8 min

Fitting Batch Norm into a neural network12 min

Why does Batch Norm work?11 min

Batch Norm at test time5 min

Multi-class classification

Softmax Regression11 min

Training a softmax classifier10 min

Introduction to programming frameworks

Deep learning frameworks4 min

TensorFlow16 min

Practice Questions

Quiz:Hyperparameter tuning, Batch Normalization, Programming Frameworks10 questions

QUIZ
Hyperparameter tuning, Batch Normalization, Programming Frameworks
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 29, 11:59 PM PDT

1 point
1.If searching among a large number of hyperparameters, you should try values in a grid rather than random values, so that you can carry out the search more systematically and not rely on chance. True or False?

True

False

1 point
2.Every hyperparameter, if set poorly, can have a huge negative impact on training, and so all hyperparameters are about equally important to tune well. True or False?

True

False

1 point
3.During hyperparameter search, whether you try to babysit one model (“Panda” strategy) or train a lot of models in parallel (“Caviar”) is largely determined by:

Whether you use batch or mini-batch optimization

The presence of local minima (and saddle points) in your neural network

The amount of computational power you can access

The number of hyperparameters you have to tune

1 point
4.If you think β (hyperparameter for momentum) is between on 0.9 and 0.99, which of the following is the recommended way to sample a value for beta?

1 2	r = np.random.rand() beta = r*0.09 + 0.9

1 2	r = np.random.rand() beta = 1-10**(- r - 1)

1 2	r = np.random.rand() beta = 1-10**(- r + 1)

1 2	r = np.random.rand() beta = r*0.9 + 0.09

1 point
5.Finding good hyperparameter values is very time-consuming. So typically you should do it once at the start of the project, and try to find very good hyperparameters so that you don’t ever have to revisit tuning them again. True or false?

True

False

1 point
6.In batch normalization as presented in the videos, if you apply it on the lth layer of your neural network, what are you normalizing?

z[l]

b[l]

W[l]

a[l]

1 point
7.In the normalization formula z(i)norm=z(i)−μσ2+ε√, why do we use epsilon?

To speed up convergence

To avoid division by zero

To have a more accurate normalization

In case μ is too small

1 point
8.Which of the following statements about γ and β in Batch Norm are true?

There is one global value of γ∈R and one global value of β∈R for each layer, and applies to all the hidden units in that layer.

β and γ are hyperparameters of the algorithm, which we tune via random sampling.

They set the mean and variance of the linear variable z[l] of a given layer.

They can be learned using Adam, Gradient descent with momentum, or RMSprop, not just with gradient descent.

The optimal values are γ=σ2+ε−−−−−√, and β=μ.

1 point
9.After training a neural network with Batch Norm, at test time, to evaluate the neural network on a new example you should:

Perform the needed normalizations, use μ and σ2 estimated using an exponentially weighted average across mini-batches seen during training.

Skip the step where you normalize using μ and σ2 since a single test example cannot be normalized.

Use the most recent mini-batch’s value of μ and σ2 to perform the needed normalizations.

If you implemented Batch Norm on mini-batches of (say) 256 examples, then to evaluate on one test example, duplicate that example 256 times so that you’re working with a mini-batch the same size as during training.

1 point
10.Which of these statements about deep learning programming frameworks are true? (Check all that apply)

Even if a project is currently open source, good governance of the project helps ensure that the it remains open even in the long term, rather than become closed or modified to benefit only one company.

Deep learning programming frameworks require cloud-based machines to run.

A programming framework allows you to code up deep learning algorithms with typically fewer lines of code than a lower-level language such as Python.

Programming assignment

Tensorflow3h

Programming Assignment:Tensorflow

Structuring Machine Learning Projects

About this course: You will learn how to build a successful machine learning project. If you aspire to be a technical leader in AI, and know how to set direction for your team’s work, this course will show you how.

Much of this content has never been taught elsewhere, and is drawn from my experience building and shipping many deep learning products. This course also has two “flight simulators” that let you practice decision-making as a machine learning project leader. This provides “industry experience” that you might otherwise get only after years of ML work experience.

After 2 weeks, you will:

Understand how to diagnose errors in a machine learning system, and
Be able to prioritize the most promising directions for reducing error
Understand complex ML settings, such as mismatched training/test sets, and comparing to and/or surpassing human-level performance
Know how to apply end-to-end learning, transfer learning, and multi-task learning

I’ve seen teams waste months or years through not understanding the principles taught in this course. I hope this two week course will save you months of time.

This is a standalone course, and you can take this so long as you have basic machine learning knowledge. This is the third course in the Deep Learning Specialization.

Who is this class for: Pre-requisites: - This course is aimed at individuals with basic knowledge of machine learning, who want to know how to set technical direction and prioritization for their work. - It is recommended that you take course one and two of this specialization (Neural Networks and Deep Learning, and Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization) prior to beginning this course.

Course can be found here

Week 1 ML Strategy (1)

Learning Objectives

Understand why Machine Learning strategy is important
Apply satisficing and optimizing metrics to set up your goal for ML projects
Choose a correct train/dev/test split of your dataset
Understand how to define human-level performance
Use human-level perform to define your key priorities in ML projects
Take the correct ML Strategic decision based on observations of performances and dataset

Introduction to ML Strategy

Why ML Strategy2 min

Orthogonalization10 min

Setting up your goal

Single number evaluation metric7 min

Satisficing and Optimizing metric5 min

Train/dev/test distributions6 min

Size of the dev and test sets5 min

When to change dev/test sets and metrics11 min

Comparing to human-level performance

Why human-level performance?5 min

Avoidable bias6 min

Understanding human-level performance11 min

Surpassing human-level performance6 min

Improving your model performance6 min

Machine Learning flight simulator

Machine Learning flight simulator2 min

Quiz: Bird recognition in the city of Peacetopia (case study)15 questions

QUIZ
Bird recognition in the city of Peacetopia (case study)
15 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 26, 11:59 PM PST

Question 11 point
1.Question 1
Problem Statement

This example is adapted from a real production application, but with details disguised to protect confidentiality.

You are a famous researcher in the City of Peacetopia. The people of Peacetopia have a common characteristic: they are afraid of birds. To save them, you have to build an algorithm that will detect any bird flying over Peacetopia and alert the population.

The City Council gives you a dataset of 10,000,000 images of the sky above Peacetopia, taken from the city’s security cameras. They are labelled:

y = 0: There is no bird on the image
y = 1: There is a bird on the image

Your goal is to build an algorithm able to classify new images taken by security cameras from Peacetopia.

There are a lot of decisions to make:

What is the evaluation metric?
How do you structure your data into train/dev/test sets?

Metric of success

The City Council tells you the following that they want an algorithm that

Has high accuracy
Runs quickly and takes only a short time to classify a new image.
Can fit in a small amount of memory, so that it can run in a small processor that the city will attach to many different security cameras.

Note: Having three evaluation metrics makes it harder for you to quickly choose between two different algorithms, and will slow down the speed with which your team can iterate. True/False?

True

False

Question 21 point
2.Question 2
After further discussions, the city narrows down its criteria to:

“We need an algorithm that can let us know a bird is flying over Peacetopia as accurately as possible.”
“We want the trained model to take no more than 10sec to classify a new image.”
“We want the model to fit in 10MB of memory.”

If you had the three following models, which one would you choose?

Test Accuracy	Runtime	Memory size
97%	1 sec	3MB

Test Accuracy	Runtime	Memory size
99%	13 sec	9MB

Test Accuracy	Runtime	Memory size
97%	3 sec	2MB

Test Accuracy	Runtime	Memory size
98%	9 sec	9MB

Question 31 point
3.Question 3
Based on the city’s requests, which of the following would you say is true?

Accuracy is an optimizing metric; running time and memory size are a satisficing metrics.

Accuracy is a satisficing metric; running time and memory size are an optimizing metric.

Accuracy, running time and memory size are all optimizing metrics because you want to do well on all three.

Accuracy, running time and memory size are all satisficing metrics because you have to do sufficiently well on all three for your system to be acceptable.

Question 41 point
4.Question 4
Structuring your data

Before implementing your algorithm, you need to split your data into train/dev/test sets. Which of these do you think is the best choice?

Train	Dev	Test
6,000,000	1,000,000	3,000,000

Train	Dev	Test
9,500,000	250,000	250,000

Train	Dev	Test
3,333,334	3,333,333	3,333,333

Train	Dev	Test
6,000,000	3,000,000	1,000,000

Question 51 point
5.Question 5
After setting up your train/dev/test sets, the City Council comes across another 1,000,000 images, called the “citizens’ data”. Apparently the citizens of Peacetopia are so scared of birds that they volunteered to take pictures of the sky and label them, thus contributing these additional 1,000,000 images. These images are different from the distribution of images the City Council had originally given you, but you think it could help your algorithm.

You should not add the citizens’ data to the training set, because this will cause the training and dev/test set distributions to become different, thus hurting dev and test set performance. True/False?

True

False

Question 61 point
6.Question 6
One member of the City Council knows a little about machine learning, and thinks you should add the 1,000,000 citizens’ data images to the test set. You object because:

This would cause the dev and test set distributions to become different. This is a bad idea because you’re not aiming where you want to hit.

A bigger test set will slow down the speed of iterating because of the computational expense of evaluating models on the test set.

The 1,000,000 citizens’ data images do not have a consistent x–>y mapping as the rest of the data (similar to the New York City/Detroit housing prices example from lecture).

The test set no longer reflects the distribution of data (security cameras) you most care about.

Question 71 point
7.Question 7
You train a system, and its errors are as follows (error = 100%-Accuracy):

Training set error	4.0%
Dev set error	4.5%

This suggests that one good avenue for improving performance is to train a bigger network so as to drive down the 4.0% training error. Do you agree?

Yes, because having 4.0% training error shows you have high bias.

Yes, because this shows your bias is higher than your variance.

No, because this shows your variance is higher than your bias.

No, because there is insufficient information to tell.

Question 81 point
8.Question 8
You ask a few people to label the dataset so as to find out what is human-level performance. You find the following levels of accuracy:

Bird watching expert #1	0.3% error
Bird watching expert #2	0.5% error
Normal person #1 (not a bird watching expert)	1.0% error
Normal person #2 (not a bird watching expert)	1.2% error

If your goal is to have “human-level performance” be a proxy (or estimate) for Bayes error, how would you define “human-level performance”?

0.0% (because it is impossible to do better than this)

0.3% (accuracy of expert #1)

0.4% (average of 0.3 and 0.5)

0.75% (average of all four numbers above)

Question 91 point
9.Question 9
Which of the following statements do you agree with?

A learning algorithm’s performance can be better than human-level performance but it can never be better than Bayes error.

A learning algorithm’s performance can never be better than human-level performance but it can be better than Bayes error.

A learning algorithm’s performance can never be better than human-level performance nor better than Bayes error.

A learning algorithm’s performance can be better than human-level performance and better than Bayes error.

Question 101 point
10.Question 10
You find that a team of ornithologists debating and discussing an image gets an even better 0.1% performance, so you define that as “human-level performance.” After working further on your algorithm, you end up with the following:

Human-level performance	0.1%
Training set error	2.0%
Dev set error	2.1%

Based on the evidence you have, which two of the following four options seem the most promising to try? (Check two options.)

Try decreasing regularization.

Train a bigger model to try to do better on the training set.

Try increasing regularization.

Get a bigger training set to reduce variance.

Question 111 point
11.Question 11
You also evaluate your model on the test set, and find the following:

Human-level performance	0.1%
Training set error	2.0%
Dev set error	2.1%
Test set error	7.0%

What does this mean? (Check the two best options.)

You should try to get a bigger dev set.

You have underfit to the dev set.

You have overfit to the dev set.

You should get a bigger test set.

Question 121 point
12.Question 12
After working on this project for a year, you finally achieve:

Human-level performance	0.10%
Training set error	0.05%
Dev set error	0.05%

What can you conclude? (Check all that apply.)

If the test set is big enough for the 0.05% error estimate to be accurate, this implies Bayes error is ≤0.05

This is a statistical anomaly (or must be the result of statistical noise) since it should not be possible to surpass human-level performance.

With only 0.09% further progress to make, you should quickly be able to close the remaining gap to 0%

It is now harder to measure avoidable bias, thus progress will be slower going forward.

Question 131 point
13.Question 13
It turns out Peacetopia has hired one of your competitors to build a system as well. Your system and your competitor both deliver systems with about the same running time and memory size. However, your system has higher accuracy! However, when Peacetopia tries out your and your competitor’s systems, they conclude they actually like your competitor’s system better, because even though you have higher overall accuracy, you have more false negatives (failing to raise an alarm when a bird is in the air). What should you do?

Look at all the models you’ve developed during the development process and find the one with the lowest false negative error rate.

Ask your team to take into account both accuracy and false negative rate during development.

Rethink the appropriate metric for this task, and ask your team to tune to the new metric.

Pick false negative rate as the new metric, and use this new metric to drive all further development.

Question 141 point
14.Question 14
You’ve handily beaten your competitor, and your system is now deployed in Peacetopia and is protecting the citizens from birds! But over the last few months, a new species of bird has been slowly migrating into the area, so the performance of your system slowly degrades because your data is being tested on a new type of data.

You have only 1,000 images of the new species of bird. The city expects a better system from you within the next 3 months. Which of these should you do first?

Use the data you have to define a new evaluation metric (using a new dev/test set) taking into account the new species, and use that to drive further progress for your team.

Put the 1,000 images into the training set so as to try to do better on these birds.

Try data augmentation/data synthesis to get more images of the new type of bird.

Add the 1,000 images into your dataset and reshuffle into a new train/dev/test split.

Question 151 point
15.Question 15
The City Council thinks that having more Cats in the city would help scare off birds. They are so happy with your work on the Bird detector that they also hire you to build a Cat detector. (Wow Cat detectors are just incredibly useful aren’t they.) Because of years of working on Cat detectors, you have such a huge dataset of 100,000,000 cat images that training on this data takes about two weeks. Which of the statements do you agree with? (Check all that agree.)

Having built a good Bird detector, you should be able to take the same model and hyperparameters and just apply it to the Cat dataset, so there is no need to iterate.

Needing two weeks to train will limit the speed at which you can iterate.

If 100,000,000 examples is enough to build a good enough Cat detector, you might be better of training with just 10,000,000 examples to gain a ≈10x improvement in how quickly you can run experiments, even if each model performs a bit worse because it’s trained on less data.

Buying faster computers could speed up your teams’ iteration speed and thus your team’s productivity.

Heroes of Deep Learning (Optional)

Andrej Karpathy interview15 min

Week 2 ML Strategy (2)

Learning Objectives

Understand what multi-task learning and transfer learning are
Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets

Error Analysis

Carrying out error analysis10 min

Cleaning up incorrectly labeled data13 min

Build your first system quickly, then iterate6 min

Mismatched training and dev/test set

Training and testing on different distributions10 min

Bias and Variance with mismatched data distributions18 min

Addressing data mismatch10 min

Learning from multiple tasks

Transfer learning11 min

Multi-task learning12 min

End-to-end deep learning

What is end-to-end deep learning?11 min

Whether to use end-to-end deep learning10 min

Machine Learning flight simulator

Quiz: Autonomous driving (case study)15 questions

QUIZ
Autonomous driving (case study)
15 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
December 3, 11:59 PM PST

Question 11 point
1.Question 1
To help you practice strategies for machine learning, in this week we’ll present another scenario and ask how you would act. We think this “simulator” of working in a machine learning project will give a task of what leading a machine learning project could be like!

You are employed by a startup building self-driving cars. You are in charge of detecting road signs (stop sign, pedestrian crossing sign, construction ahead sign) and traffic signals (red and green lights) in images. The goal is to recognize which of these objects appear in each image. As an example, the above image contains a pedestrian crossing sign and red traffic lights

Your 100,000 labeled images are taken using the front-facing camera of your car. This is also the distribution of data you care most about doing well on. You think you might be able to get a much larger dataset off the internet, that could be helpful for training even if the distribution of internet data is not the same.

You are just getting started on this project. What is the first thing you do? Assume each of the steps below would take about an equal amount of time (a few days).

Spend a few days collecting more data using the front-facing camera of your car, to better understand how much data per unit time you can collect.

Spend a few days getting the internet data, so that you understand better what data is available.

Spend a few days checking what is human-level performance for these tasks so that you can get an accurate estimate of Bayes error.

Spend a few days training a basic model and see what mistakes it makes.

Question 21 point
2.Question 2
Your goal is to detect road signs (stop sign, pedestrian crossing sign, construction ahead sign) and traffic signals (red and green lights) in images. The goal is to recognize which of these objects appear in each image. You plan to use a deep neural network with ReLU units in the hidden layers.

For the output layer, a softmax activation would be a good choice for the output layer because this is a multi-task learning problem. True/False?

True

False

Question 31 point
3.Question 3
You are carrying out error analysis and counting up what errors the algorithm makes. Which of these datasets do you think you should manually go through and carefully examine, one image at a time?

500 images on which the algorithm made a mistake

10,000 images on which the algorithm made a mistake

500 randomly chosen images

10,000 randomly chosen images

Question 41 point
4.Question 4
After working on the data for several weeks, your team ends up with the following data:

100,000 labeled images taken using the front-facing camera of your car.
900,000 labeled images of roads downloaded from the internet.
Each image’s labels precisely indicate the presence of any specific road signs and traffic signals or combinations of them. For example, y(i) = 10010 means the image contains a stop sign and a red traffic light.

Because this is a multi-task learning problem, you need to have all your y(i) vectors fully labeled. If one example is equal to 0?11? then the learning algorithm will not be able to use that example. True/False?

True

False

Question 51 point
5.Question 5
The distribution of data you care about contains images from your car’s front-facing camera; which comes from a different distribution than the images you were able to find and download off the internet. How should you split the dataset into train/dev/test sets?

Choose the training set to be the 900,000 images from the internet along with 20,000 images from your car’s front-facing camera. The 80,000 remaining images will be split equally in dev and test sets.

Mix all the 100,000 images with the 900,000 images you found online. Shuffle everything. Split the 1,000,000 images dataset into 600,000 for the training set, 200,000 for the dev set and 200,000 for the test set.

Choose the training set to be the 900,000 images from the internet along with 80,000 images from your car’s front-facing camera. The 20,000 remaining images will be split equally in dev and test sets.

Mix all the 100,000 images with the 900,000 images you found online. Shuffle everything. Split the 1,000,000 images dataset into 980,000 for the training set, 10,000 for the dev set and 10,000 for the test set.

Question 61 point
6.Question 6
Assume you’ve finally chosen the following split between of the data:

Dataset:	Contains:	Error of the algorithm:
Training	940,000 images randomly picked from (900,000 internet images + 60,000 car’s front-facing camera images)	8.8%
Training-Dev	20,000 images randomly picked from (900,000 internet images + 60,000 car’s front-facing camera images)	9.1%
Dev	20,000 images from your car’s front-facing camera	14.3%
Test	20,000 images from the car’s front-facing camera	14.8%

You also know that human-level error on the road sign and traffic signals classification task is around 0.5%. Which of the following are True? (Check all that apply).

You have a large avoidable-bias problem because your training error is quite a bit higher than the human-level error.

You have a large variance problem because your model is not generalizing well to data from the same training distribution but that it has never seen before.

Your algorithm overfits the dev set because the error of the dev and test sets are very close.

You have a large data-mismatch problem because your model does a lot better on the training-dev set than on the dev set

You have a large variance problem because your training error is quite higher than the human-level error.

Question 71 point
7.Question 7
Based on table from the previous question, a friend thinks that the training data distribution is much easier than the dev/test distribution. What do you think?

Your friend is right. (I.e., Bayes error for the training data distribution is probably lower than for the dev/test distribution.)

Your friend is wrong. (I.e., Bayes error for the training data distribution is probably higher than for the dev/test distribution.)

There’s insufficient information to tell if your friend is right or wrong.

Question 81 point
8.Question 8
You decide to focus on the dev set and check by hand what are the errors due to. Here is a table summarizing your discoveries:

Overall dev set error	14.3%
Errors due to incorrectly labeled data	4.1%
Errors due to foggy pictures	8.0%
Errors due to rain drops stuck on your car’s front-facing camera	2.2%
Errors due to other causes	1.0%

In this table, 4.1%, 8.0%, etc.are a fraction of the total dev set (not just examples your algorithm mislabeled). I.e. about 8.0/14.3 = 56% of your errors are due to foggy pictures.

The results from this analysis implies that the team’s highest priority should be to bring more foggy pictures into the training set so as to address the 8.0% of errors in that category. True/False?

True because it is the largest category of errors. As discussed in lecture, we should prioritize the largest category of error to avoid wasting the team’s time.

True because it is greater than the other error categories added together (8.0 > 4.1+2.2+1.0).

False because this would depend on how easy it is to add this data and how much you think your team thinks it’ll help.

False because data augmentation (synthesizing foggy images by clean/non-foggy images) is more efficient.

Question 91 point
9.Question 9
You can buy a specially designed windshield wiper that help wipe off some of the raindrops on the front-facing camera. Based on the table from the previous question, which of the following statements do you agree with?

2.2% would be a reasonable estimate of the maximum amount this windshield wiper could improve performance.

2.2% would be a reasonable estimate of the minimum amount this windshield wiper could improve performance.

2.2% would be a reasonable estimate of how much this windshield wiper will improve performance.

2.2% would be a reasonable estimate of how much this windshield wiper could worsen performance in the worst case.

Question 101 point
10.Question 10
You decide to use data augmentation to address foggy images. You find 1,000 pictures of fog off the internet, and “add” them to clean images to synthesize foggy days, like this:

Which of the following statements do you agree with?

So long as the synthesized fog looks realistic to the human eye, you can be confident that the synthesized data is accurately capturing the distribution of real foggy images, since human vision is very accurate for the problem you’re solving.

There is little risk of overfitting to the 1,000 pictures of fog so long as you are combing it with a much larger (>>1,000) of clean/non-foggy images.

Adding synthesized images that look like real foggy pictures taken from the front-facing camera of your car to training dataset won’t help the model improve because it will introduce avoidable-bias.

Question 111 point
11.Question 11
After working further on the problem, you’ve decided to correct the incorrectly labeled data on the dev set. Which of these statements do you agree with? (Check all that apply).

You should also correct the incorrectly labeled data in the test set, so that the dev and test sets continue to come from the same distribution

You should correct incorrectly labeled data in the training set as well so as to avoid your training set now being even more different from your dev set.

You should not correct the incorrectly labeled data in the test set, so that the dev and test sets continue to come from the same distribution

You should not correct incorrectly labeled data in the training set as well so as to avoid your training set now being even more different from your dev set.

Question 121 point
12.Question 12
So far your algorithm only recognizes red and green traffic lights. One of your colleagues in the startup is starting to work on recognizing a yellow traffic light. (Some countries call it an orange light rather than a yellow light; we’ll use the US convention of calling it yellow.) Images containing yellow lights are quite rare, and she doesn’t have enough data to build a good model. She hopes you can help her out using transfer learning.

What do you tell your colleague?

She should try using weights pre-trained on your dataset, and fine-tuning further with the yellow-light dataset.

If she has (say) 10,000 images of yellow lights, randomly sample 10,000 images from your dataset and put your and her data together. This prevents your dataset from “swamping” the yellow lights dataset.

You cannot help her because the distribution of data you have is different from hers, and is also lacking the yellow label.

Recommend that she try multi-task learning instead of transfer learning using all the data.

Question 131 point
13.Question 13
Another colleague wants to use microphones placed outside the car to better hear if there’re other vehicles around you. For example, if there is a police vehicle behind you, you would be able to hear their siren. However, they don’t have much to train this audio system. How can you help?

Transfer learning from your vision dataset could help your colleague get going faster. Multi-task learning seems significantly less promising.

Multi-task learning from your vision dataset could help your colleague get going faster. Transfer learning seems significantly less promising.

Either transfer learning or multi-task learning could help our colleague get going faster.

Neither transfer learning nor multi-task learning seems promising.

Question 141 point
14.Question 14
To recognize red and green lights, you have been using this approach:

(A) Input an image (x) to a neural network and have it directly learn a mapping to make a prediction as to whether there’s a red light and/or green light (y).
A teammate proposes a different, two-step approach:
(B) In this two-step approach, you would first (i) detect the traffic light in the image (if any), then (ii) determine the color of the illuminated lamp in the traffic light.

Between these two, Approach B is more of an end-to-end approach because it has distinct steps for the input end and the output end. True/False?

True

False

Question 151 point
15.Question 15
Approach A (in the question above) tends to be more promising than approach B if you have a __ (fill in the blank).

Large training set

Multi-task learning problem.

Large bias problem.

Problem with a high Bayes error.

Heroes of Deep Learning (Optional)

Ruslan Salakhutdinov interview17 min

Convolutional Neural Networks

About this course: This course will teach you how to build convolutional neural networks and apply it to image data. Thanks to deep learning, computer vision is working far better than just two years ago, and this is enabling numerous exciting applications ranging from safe autonomous driving, to accurate face recognition, to automatic reading of radiology images.

You will:

Understand how to build a convolutional neural network, including recent variations such as residual networks.
Know how to apply convolutional networks to visual detection and recognition tasks.
Know to use neural style transfer to generate art.
Be able to apply these algorithms to a variety of image, video, and other 2D or 3D data.

This is the fourth course of the Deep Learning Specialization.

Who is this class for: - Learners that took the first two courses of the specialization. The third course is recommended. - Anyone that already has a solid understanding of densely connected neural networks, and wants to learn convolutional neural networks or work with image data.

Week 1 Foundations of Convolutional Neural Networks

Learn to implement the foundational layers of CNNs (pooling, convolutions) and to stack them properly in a deep network to solve multi-class image classification problems.

Learning Objectives

Understand the convolution operation
Understand the pooling operation
Remember the vocabulary used in convolutional neural network (padding, stride, filter, …)
Build a convolutional neural network for image multi-class classification