For quick searching
Course can be found here
Video in YouTube
Lecture Slides can be found in my Github(PDF version)

If you want to break into AI, this Specialization will help you do so. Deep Learning is one of the most highly sought after skills in tech. We will help you become good at Deep Learning.

In five courses, you will learn the foundations of Deep Learning, understand how to build neural networks, and learn how to lead successful machine learning projects. You will learn about Convolutional networks, RNNs, LSTM, Adam, Dropout, BatchNorm, Xavier/He initialization, and more. You will work on case studies from healthcare, autonomous driving, sign language reading, music generation, and natural language processing. You will master not only the theory, but also see how it is applied in industry. You will practice all these ideas in Python and in TensorFlow, which we will teach.

You will also hear from many top leaders in Deep Learning, who will share with you their personal stories and give you career advice.

AI is transforming multiple industries. After finishing this specialization, you will likely find creative ways to apply it to your work.

We will help you master Deep Learning, understand how to apply it, and build a career in AI.

Neural Networks and Deep Learning

Course can be found here
Lecture slides can be found here
About this course: If you want to break into cutting-edge AI, this course will help you do so. Deep learning engineers are highly sought after, and mastering deep learning will give you numerous new career opportunities. Deep learning is also a new “superpower” that will let you build AI systems that just weren’t possible a few years ago.

In this course, you will learn the foundations of deep learning. When you finish this class, you will:

• Understand the major technology trends driving Deep Learning
• Be able to build, train and apply fully connected deep neural networks
• Know how to implement efficient (vectorized) neural networks
• Understand the key parameters in a neural network’s architecture

This course also teaches you how Deep Learning actually works, rather than presenting only a cursory or surface-level description. So after completing it, you will be able to apply deep learning to a your own applications. If you are looking for a job in AI, after this course you will also be able to answer basic interview questions.

This is the first course of the Deep Learning Specialization.

Who is this class for: Prerequisites: Expected: - Programming: Basic Python programming skills, with the capability to work effectively with data structures. Recommended: - Mathematics: Matrix vector operations and notation. - Machine Learning: Understanding how to frame a machine learning problem, including how data is represented will be beneficial. If you have taken my Machine Learning Course here, you have much more than the needed level of knowledge.

Week 1 Introduction to deep learning

Be able to explain the major trends driving the rise of deep learning, and understand where and how it is applied today.

Learning Objectives

Understand the major trends driving the rise of deep learning.
Be able to explain how deep learning is applied to supervised learning.
Understand what are the major categories of models (such as CNNs and RNNs), and when they should be applied.
Be able to recognize the basics of when deep learning will (or will not) work well.

Practice Questions

Quiz: Introduction to deep learning10 questions

QUIZ
Introduction to deep learning
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
September 17, 11:59 PM PDT

1 point
1.What does the analogy “AI is the new electricity” refer to?

Correct
Yes. AI is transforming many fields from the car industry to agriculture to supply-chain…

1 point
2.Which of these are reasons for Deep Learning recently taking off? (Check the three options that apply.)

Correct
These were all examples discussed in lecture 3.

Un-selected is correct

Correct
Yes! The digitalization of our society has played a huge role in this.

Correct
Yes! The development of hardware, perhaps especially GPU computing, has significantly improved deep learning algorithms’ performance.

1 point
3.Recall this diagram of iterating over different ML ideas. Which of the statements below are true? (Check all that apply.)

1 point
4.When an experienced deep learning engineer works on a new problem, they can usually use insight from previous problems to train a good model on the first try, without needing to iterate multiple times through different models. True/False?

This should not be selected
No. Finding the characteristics of a model is key to have good performance. Although experience can help, it requires multiple iterations to build a good model.

1 point
5.Which one of these plots represents a ReLU activation function?

1 point
6.Images for cat recognition is an example of “structured” data, because it is represented as a structured array in a computer. True/False?

This should not be selected
No. Images for cat recognition is an example of “unstructured” data.

1 point
7.A demographic dataset with statistics on different cities’ population, GDP per capita, economic growth is an example of “unstructured” data because it contains data coming from different sources. True/False?

1 point
8.Why is an RNN (Recurrent Neural Network) used for machine translation, say translating English to French? (Check all that apply.)

Correct
Yes. We can train it on many pairs of sentences x (English) and y (French).

This should not be selected
No. RNN and CNN are two distinct classes of models, with their own advantages and disadvantages.

Correct
Yes. An RNN can map from a sequence of english words to a sequence of french words.

1 point
9.In this diagram which we hand-drew in lecture, what do the horizontal axis (x-axis) and vertical axis (y-axis) represent?

1 point
10.Assuming the trends described in the previous question’s figure are accurate (and hoping you got the axis labels right), which of the following are true? (Check all that apply.)

Week 2 Neural Networks Basics

Learn to set up a machine learning problem with a neural network mindset. Learn to use vectorization to speed up your models.

Learning Objectives

Build a logistic regression model, structured as a shallow neural network
Implement the main steps of an ML algorithm, including making predictions, derivative computation, and gradient descent.
Implement computationally efficient, highly vectorized, versions of models.
Understand how to compute derivatives for logistic regression, using a backpropagation mindset.
Become familiar with Python and Numpy
Work with iPython Notebooks
Be able to implement vectorization across multiple training examples

Practice Questions

Quiz: Neural Network Basics10 questions

QUIZ
Neural Network Basics
10 questions
To Pass80% or higher
Deadline
September 24, 11:59 PM PDT

1 point
1.What does a neuron compute?

1 point
2.Which of these is the “Logistic Loss”?

1 point
3.Suppose img is a (32,32,3) array, representing a 32x32 image with 3 color channels red, green and blue. How do you reshape this into a column vector?

1 point
4.Consider the two following random arrays “a” and “b”:

What will be the shape of “c”?

1 point
5.Consider the two following random arrays “a” and “b”:

What will be the shape of “c”?

1 point
6.Suppose you have $$n_x$$ input features per example. Recall that $$X = [x^{(1)} x^{(2)} … x^{(m)}]$$. What is the dimension of X?

1 point
7.Recall that “np.dot(a,b)” performs a matrix multiplication on a and b, whereas “a*b” performs an element-wise multiplication.

Consider the two following random arrays “a” and “b”:

What is the shape of c?

1 point
8.Consider the following code snippet:

How do you vectorize this?

1 point
9.Consider the following code:

What will be c? (If you’re not sure, feel free to run this in python to find out).

1 point
10.Consider the following computation graph.

What is the output J?

Programming Assignments

Deep Learning Honor Code2 min

Deep Learning Honor Code

We strongly encourage students to form study groups, and discuss the lecture videos (including in-video questions). We also encourage you to get together with friends to watch the videos together as a group. However, the answers that you submit for the review questions should be your own work. For the programming exercises, you are welcome to discuss them with other students, discuss specific algorithms, properties of algorithms, etc.; we ask only that you not look at any source code written by a different student, nor show your solution code to other students.

You are also not allowed to post your code publicly on github.

Python Basics with numpy (optional)1h

Python Basics with numpy (optional)
Welcome to your first (Optional) programming exercise of the deep learning specialization. In this assignment you will:

• Learn how to use numpy.

• Implement some basic core deep learning functions such as the softmax, sigmoid, dsigmoid, etc…

• Learn how to handle data by normalizing inputs and reshaping images.

• Recognize the importance of vectorization.

• Understand how python broadcasting works.

This assignment prepares you well for the upcoming assignment. Take your time to complete it and make sure you get the expected outputs when working through the different exercises. In some code blocks, you will find a “#GRADED FUNCTION: functionName” comment. Please do not modify it. After you are done, submit your work and check your results. You need to score 70% to pass. Good luck :) !
Open Notebook

Logistic Regression with a Neural Network mindset2h

Logistic Regression with a Neural Network mindset
Welcome to the first (required) programming exercise of the deep learning specialization. In this notebook you will build your first image recognition algorithm. You will build a cat classifier that recognizes cats with 70% accuracy!

As you keep learning new techniques you will increase it to 80+ % accuracy on cat vs. non-cat datasets. By completing this assignment you will:

• Work with logistic regression in a way that builds intuition relevant to neural networks.

• Learn how to minimize the cost function.

• Understand how derivatives of the cost are used to update parameters.

Take your time to complete this assignment and make sure you get the expected outputs when working through the different exercises. In some code blocks, you will find a “#GRADED FUNCTION: functionName” comment. Please do not modify these comments. After you are done, submit your work and check your results. You need to score 70% to pass. Good luck :) !
Open Notebook

Week 3 Shallow neural networks

Learn to build a neural network with one hidden layer, using forward propagation and backpropagation.

Learning Objectives

Understand hidden units and hidden layers
Be able to apply a variety of activation functions in a neural network.
Build your first forward and backward propagation with a hidden layer
Apply random initialization to your neural network
Become fluent with Deep Learning notations and Neural Network Representations
Build and train a neural network with one hidden layer.

Practice Questions

Quiz: Shallow Neural Networks10 questions

QUIZ
Shallow Neural Networks
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 1, 11:59 PM PDT

1 point
1.Which of the following are true? (Check all that apply.)

1 point
2.The tanh activation usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, and so it centers the data better for the next layer. True/False?

1 point
3.Which of these is a correct vectorized implementation of forward propagation for layer l, where 1≤l≤L?

1 point
4.You are building a binary classifier for recognizing cucumbers (y=1) vs. watermelons (y=0). Which one of these activation functions would you recommend using for the output layer?

1 point
5.Consider the following code:

What will be B.shape? (If you’re not sure, feel free to run this in python to find out).

1 point
6.Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

1 point
7.Logistic regression’s weights w should be initialized randomly rather than to all zeros, because if you initialize to all zeros, then logistic regression will fail to learn a useful decision boundary because it will fail to “break symmetry”, True/False?

1 point
8.You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*1000. What will happen?

1 point
9.Consider the following 1 hidden layer neural network:

Which of the following statements are True? (Check all that apply).

1 point
10.In the same network as the previous question, what are the dimensions of Z[1] and A[1]?

Week 4 Deep Neural Networks

Understand the key computations underlying deep learning, use them to build and train deep neural networks, and apply it to computer vision.

• See deep neural networks as successive blocks put one after each other
• Build and train a deep L-layer Neural Network
• Analyze matrix and vector dimensions to check neural network implementations.
• Understand how to use a cache to pass information from forward propagation to back propagation.
• Understand the role of hyperparameters in deep learning

Practice Questions

Quiz: Key concepts on Deep Neural Networks10 questions

QUIZ
Key concepts on Deep Neural Networks
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 5, 11:59 PM PST

1 point
1.What is the “cache” used for in our implementation of forward propagation and backward propagation?

We use it to pass variables computed during backward propagation to the corresponding forward propagation step. It contains useful values for forward propagation to compute activations.

It is used to cache the intermediate values of the cost function during training.

It is used to keep track of the hyperparameters that we are searching over, to speed up computation.

We use it to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.

1 point
2.Among the following, which ones are “hyperparameters”? (Check all that apply.)

learning rate α

number of iterations

bias vectors b[l]

number of layers L in the neural network

activation values a[l]

weight matrices W[l]

size of the hidden layers n[l]

1 point
3.Which of the following statements is true?

The deeper layers of a neural network are typically computing more complex features of the input than the earlier layers.

The earlier layers of a neural network are typically computing more complex features of the input than the deeper layers.

1 point
4.Vectorization allows you to compute forward propagation in an L-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L. True/False?

1 point
5.Assume we store the values for n[l] in an array called layers, as follows: layer_dims = [nx, 4,3,2,1]. So layer 1 has four hidden units, layer 2 has 3 hidden units and so on. Which of the following for-loops will allow you to initialize the parameters for the model?

1 point
6.Consider the following neural network.

How many layers does this network have?

The number of layers L is 4. The number of hidden layers is 3.

The number of layers L is 3. The number of hidden layers is 3.

The number of layers L is 4. The number of hidden layers is 4.

The number of layers L is 5. The number of hidden layers is 4.

1 point
7.During forward propagation, in the forward function for a layer l you need to know what is the activation function in a layer (Sigmoid, tanh, ReLU, etc.). During backpropagation, the corresponding backward function also needs to know what is the activation function for layer l, since the gradient depends on it. True/False?

1 point
8.There are certain functions with the following properties:

(i) To compute the function using a shallow network circuit, you will need a large network (where we measure size by the number of logic gates in the network), but (ii) To compute it using a deep network circuit, you need only an exponentially smaller network. True/False?

1 point
9.Consider the following 2 hidden layer neural network:

Which of the following statements are True? (Check all that apply).

W[1] will have shape (4, 4)

b[1] will have shape (4, 1)

W[1] will have shape (3, 4)

b[1] will have shape (3, 1)

W[2] will have shape (3, 4)

b[2] will have shape (1, 1)

W[2] will have shape (3, 1)

b[2] will have shape (3, 1)

W[3] will have shape (3, 1)

b[3] will have shape (1, 1)

W[3] will have shape (1, 3)

b[3] will have shape (3, 1)

1 point
10.Whereas the previous question used a specific network, in the general case what is the dimension of W^{[l]}, the weight matrix associated with layer l?

W[l] has shape (n[l−1],n[l])

W[l] has shape (n[l],n[l+1])

W[l] has shape (n[l+1],n[l])

W[l] has shape (n[l],n[l−1])

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

Course can be found here
Lecture slides can be found here

About this course: This course will teach you the “magic” of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.

After 3 weeks, you will:

• Understand industry best-practices for building deep learning applications.
• Be able to effectively use the common neural network “tricks”, including initialization, L2 and dropout regularization, Batch normalization, gradient checking,
• Be able to implement and apply a variety of optimization algorithms, such as mini-batch gradient descent, Momentum, RMSprop and Adam, and check for their convergence.
• Understand new best-practices for the deep learning era of how to set up train/dev/test sets and analyze bias/variance
• Be able to implement a neural network in TensorFlow.

This is the second course of the Deep Learning Specialization.

Who is this class for: This class is for: - Learners that took the first course of the specialization: “Neural Networks and Deep Learning” - Anyone that already understands fully-connected neural networks, and wants to learn the practical aspects of making them work well.

Week 1 Practical aspects of Deep Learning

Learning Objectives

Recall that different types of initializations lead to different results
Recognize the importance of initialization in complex neural networks.
Recognize the difference between train/dev/test sets
Diagnose the bias and variance issues in your model
Learn when and how to use regularization methods such as dropout or L2 regularization.
Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them
Use gradient checking to verify the correctness of your backpropagation implementation

Practice Questions

Quiz: Practical aspects of deep learning10 questions

QUIZ
Practical aspects of deep learning
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 15, 11:59 PM PDT

1 point
1.If you have 10,000,000 examples, how would you split the train/dev/test set?

60% train . 20% dev . 20% test

98% train . 1% dev . 1% test

33% train . 33% dev . 33% test

1 point
2.The dev and test set should:

Come from the same distribution

Come from different distributions

Be identical to each other (same (x,y) pairs)

Have the same number of examples

1 point
3.If your Neural Network model seems to have high variance, what of the following would be promising things to try?

Add regularization

Get more training data

Increase the number of units in each hidden layer

Get more test data

Make the Neural Network deeper

1 point
4.You are working on an automated check-out kiosk for a supermarket, and are building a classifier for apples, bananas and oranges. Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%. Which of the following are promising things to try to improve your classifier? (Check all that apply.)

Increase the regularization parameter lambda

Decrease the regularization parameter lambda

Get more training data

Use a bigger neural network

1 point
5.What is weight decay?

A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.

A technique to avoid vanishing gradient by imposing a ceiling on the values of the weights.

The process of gradually decreasing the learning rate during training.

Gradual corruption of the weights in the neural network if it is trained on noisy data.

1 point
6.What happens when you increase the regularization hyperparameter lambda?

Weights are pushed toward becoming smaller (closer to 0)

Weights are pushed toward becoming bigger (further from 0)

Doubling lambda should roughly result in doubling the weights

Gradient descent taking bigger steps with each iteration (proportional to lambda)

1 point
7.With the inverted dropout technique, at test time:

You apply dropout (randomly eliminating units) and do not keep the 1/keep_prob factor in the calculations used in training

You apply dropout (randomly eliminating units) but keep the 1/keep_prob factor in the calculations used in training.

You do not apply dropout (do not randomly eliminate units), but keep the 1/keep_prob factor in the calculations used in training.

You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training

1 point
8.Increasing the parameter keep_prob from (say) 0.5 to 0.6 will likely cause the following: (Check the two that apply)

Increasing the regularization effect

Reducing the regularization effect

Causing the neural network to end up with a higher training set error

Causing the neural network to end up with a lower training set error

1 point
9.Which of these techniques are useful for reducing variance (reducing overfitting)? (Check all that apply.)

Dropout

Data augmentation

Vanishing gradient

Xavier initialization

Gradient Checking

Exploding gradient

L2 regularization

1 point
10.Why do we normalize the inputs x?

It makes the parameter initialization faster

It makes the cost function faster to optimize

It makes it easier to visualize the data

Normalization is another word for regularization–It helps to reduce variance

Programming assignments

Programming Assignment: Regularization1h

https://www.coursera.org/learn/deep-neural-network/programming/SXQaI

Programming Assignment: Gradient Checking1h

https://www.coursera.org/learn/deep-neural-network/programming/n6NBD

Week 2 Optimization algorithms

Learning Objectives

• Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam
• Use random minibatches to accelerate the convergence and improve the optimization
• Know the benefits of learning rate decay and apply it to your optimization

Practice Questions

Quiz: Optimization algorithms10 questions

QUIZ
Optimization algorithms
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 22, 11:59 PM PDT

1 point
1.Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?

a[8]{3}(7)

a[8]{7}(3)

a[3]{8}(7)

a[3]{7}(8)

1 point
2.Which of these statements about mini-batch gradient descent do you agree with?

You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).

Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.

One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.

1 point
3.Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

If the mini-batch size is 1, you end up having to process the entire training set before making any progress.

If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.

If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.

If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.

1 point
4.Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:

Which of the following do you agree with?

If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.

If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.

Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.

Whether you’re using batch gradient descent or mini-batch gradient descent, something is wrong.

1 point
5.Suppose the temperature in Casablanca over the first three days of January are the same:

Jan 1st: θ1=10oC
Jan 2nd: θ210oC
(We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)

Say you use an exponentially weighted average with β=0.5 to track the temperature: v0=0, vt=βvt−1+(1−β)θt. If v2 is the value computed after day 2 without bias correction, and vcorrected2 is the value you compute with bias correction. What are these values? (You might be able to do this without a calculator, but you don’t actually need one. Remember what is bias correction doing.)

v2=10, vcorrected2=10

v2=7.5, vcorrected2=7.5

v2=10, vcorrected2=7.5

v2=7.5, vcorrected2=10

1 point
6.Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.

α=1t√α0

α=11+2∗tα0

α=etα0

α=0.95tα0

1 point
7.You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: vt=βvt−1+(1−β)θt. The red line below was computed using β=0.9. What would happen to your red curve as you vary β? (Check the two that apply)

Decreasing β will shift the red line slightly to the right.

Increasing β will shift the red line slightly to the right.

Decreasing β will create more oscillation within the red line.

Increasing β will create more oscillations within the red line.

1 point
8.Consider this figure:

These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?

(1) is gradient descent. (2) is gradient descent with momentum (large β) . (3) is gradient descent with momentum (small β)

(1) is gradient descent with momentum (small β), (2) is gradient descent with momentum (small β), (3) is gradient descent

(1) is gradient descent. (2) is gradient descent with momentum (small β). (3) is gradient descent with momentum (large β)

(1) is gradient descent with momentum (small β). (2) is gradient descent. (3) is gradient descent with momentum (large β)

1 point
9.Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W[1],b[1],…,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value forJ? (Check all that apply)

Try better random initialization for the weights

Try tuning the learning rate α

Try mini-batch gradient descent

Try using Adam

Try initializing all the weights to zero

1 point
10.Which of the following statements about Adam is False?

Adam combines the advantages of RMSProp and momentum

Adam should be used with batch gradient computations, not with mini-batches.

We usually use “default” values for the hyperparameters β1,β2 and ε in Adam (β1=0.9, β2=0.999, ε=10−8)

The learning rate hyperparameter α in Adam usually needs to be tuned.

Week 3 Hyperparameter tuning, Batch Normalization and Programming Frameworks

Master the process of hyperparameter tuning

Practice Questions

Quiz:Hyperparameter tuning, Batch Normalization, Programming Frameworks10 questions

QUIZ
Hyperparameter tuning, Batch Normalization, Programming Frameworks
10 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
October 29, 11:59 PM PDT

1 point
1.If searching among a large number of hyperparameters, you should try values in a grid rather than random values, so that you can carry out the search more systematically and not rely on chance. True or False?

1 point
2.Every hyperparameter, if set poorly, can have a huge negative impact on training, and so all hyperparameters are about equally important to tune well. True or False?

1 point
3.During hyperparameter search, whether you try to babysit one model (“Panda” strategy) or train a lot of models in parallel (“Caviar”) is largely determined by:

Whether you use batch or mini-batch optimization

The presence of local minima (and saddle points) in your neural network

The amount of computational power you can access

The number of hyperparameters you have to tune

1 point
4.If you think β (hyperparameter for momentum) is between on 0.9 and 0.99, which of the following is the recommended way to sample a value for beta?

1 point
5.Finding good hyperparameter values is very time-consuming. So typically you should do it once at the start of the project, and try to find very good hyperparameters so that you don’t ever have to revisit tuning them again. True or false?

1 point
6.In batch normalization as presented in the videos, if you apply it on the lth layer of your neural network, what are you normalizing?

z[l]

b[l]

W[l]

a[l]

1 point
7.In the normalization formula z(i)norm=z(i)−μσ2+ε√, why do we use epsilon?

To speed up convergence

To avoid division by zero

To have a more accurate normalization

In case μ is too small

1 point
8.Which of the following statements about γ and β in Batch Norm are true?

There is one global value of γ∈R and one global value of β∈R for each layer, and applies to all the hidden units in that layer.

β and γ are hyperparameters of the algorithm, which we tune via random sampling.

They set the mean and variance of the linear variable z[l] of a given layer.

They can be learned using Adam, Gradient descent with momentum, or RMSprop, not just with gradient descent.

The optimal values are γ=σ2+ε−−−−−√, and β=μ.

1 point
9.After training a neural network with Batch Norm, at test time, to evaluate the neural network on a new example you should:

Perform the needed normalizations, use μ and σ2 estimated using an exponentially weighted average across mini-batches seen during training.

Skip the step where you normalize using μ and σ2 since a single test example cannot be normalized.

Use the most recent mini-batch’s value of μ and σ2 to perform the needed normalizations.

If you implemented Batch Norm on mini-batches of (say) 256 examples, then to evaluate on one test example, duplicate that example 256 times so that you’re working with a mini-batch the same size as during training.

1 point
10.Which of these statements about deep learning programming frameworks are true? (Check all that apply)

Even if a project is currently open source, good governance of the project helps ensure that the it remains open even in the long term, rather than become closed or modified to benefit only one company.

Deep learning programming frameworks require cloud-based machines to run.

A programming framework allows you to code up deep learning algorithms with typically fewer lines of code than a lower-level language such as Python.

Structuring Machine Learning Projects

About this course: You will learn how to build a successful machine learning project. If you aspire to be a technical leader in AI, and know how to set direction for your team’s work, this course will show you how.

Much of this content has never been taught elsewhere, and is drawn from my experience building and shipping many deep learning products. This course also has two “flight simulators” that let you practice decision-making as a machine learning project leader. This provides “industry experience” that you might otherwise get only after years of ML work experience.

After 2 weeks, you will:

• Understand how to diagnose errors in a machine learning system, and
• Be able to prioritize the most promising directions for reducing error
• Understand complex ML settings, such as mismatched training/test sets, and comparing to and/or surpassing human-level performance
• Know how to apply end-to-end learning, transfer learning, and multi-task learning

I’ve seen teams waste months or years through not understanding the principles taught in this course. I hope this two week course will save you months of time.

This is a standalone course, and you can take this so long as you have basic machine learning knowledge. This is the third course in the Deep Learning Specialization.

Who is this class for: Pre-requisites: - This course is aimed at individuals with basic knowledge of machine learning, who want to know how to set technical direction and prioritization for their work. - It is recommended that you take course one and two of this specialization (Neural Networks and Deep Learning, and Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization) prior to beginning this course.

Course can be found here

Week 1 ML Strategy (1)

Learning Objectives

• Understand why Machine Learning strategy is important
• Apply satisficing and optimizing metrics to set up your goal for ML projects
• Choose a correct train/dev/test split of your dataset
• Understand how to define human-level performance
• Use human-level perform to define your key priorities in ML projects
• Take the correct ML Strategic decision based on observations of performances and dataset

Machine Learning flight simulator

Quiz: Bird recognition in the city of Peacetopia (case study)15 questions

QUIZ
Bird recognition in the city of Peacetopia (case study)
15 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
November 26, 11:59 PM PST

Question 11 point
1.Question 1
Problem Statement

This example is adapted from a real production application, but with details disguised to protect confidentiality.

You are a famous researcher in the City of Peacetopia. The people of Peacetopia have a common characteristic: they are afraid of birds. To save them, you have to build an algorithm that will detect any bird flying over Peacetopia and alert the population.

The City Council gives you a dataset of 10,000,000 images of the sky above Peacetopia, taken from the city’s security cameras. They are labelled:

• y = 0: There is no bird on the image
• y = 1: There is a bird on the image

Your goal is to build an algorithm able to classify new images taken by security cameras from Peacetopia.

There are a lot of decisions to make:

• What is the evaluation metric?
• How do you structure your data into train/dev/test sets?

Metric of success

The City Council tells you the following that they want an algorithm that

1. Has high accuracy
2. Runs quickly and takes only a short time to classify a new image.
3. Can fit in a small amount of memory, so that it can run in a small processor that the city will attach to many different security cameras.

Note: Having three evaluation metrics makes it harder for you to quickly choose between two different algorithms, and will slow down the speed with which your team can iterate. True/False?

Question 21 point
2.Question 2
After further discussions, the city narrows down its criteria to:

• “We need an algorithm that can let us know a bird is flying over Peacetopia as accurately as possible.”
• “We want the trained model to take no more than 10sec to classify a new image.”
• “We want the model to fit in 10MB of memory.”

If you had the three following models, which one would you choose?

Test Accuracy Runtime Memory size
97% 1 sec 3MB

Test Accuracy Runtime Memory size
99% 13 sec 9MB

Test Accuracy Runtime Memory size
97% 3 sec 2MB

Test Accuracy Runtime Memory size
98% 9 sec 9MB

Question 31 point
3.Question 3
Based on the city’s requests, which of the following would you say is true?

Accuracy is an optimizing metric; running time and memory size are a satisficing metrics.

Accuracy is a satisficing metric; running time and memory size are an optimizing metric.

Accuracy, running time and memory size are all optimizing metrics because you want to do well on all three.

Accuracy, running time and memory size are all satisficing metrics because you have to do sufficiently well on all three for your system to be acceptable.

Question 41 point
4.Question 4
Structuring your data

Before implementing your algorithm, you need to split your data into train/dev/test sets. Which of these do you think is the best choice?

Train Dev Test
6,000,000 1,000,000 3,000,000

Train Dev Test
9,500,000 250,000 250,000

Train Dev Test
3,333,334 3,333,333 3,333,333

Train Dev Test
6,000,000 3,000,000 1,000,000

Question 51 point
5.Question 5
After setting up your train/dev/test sets, the City Council comes across another 1,000,000 images, called the “citizens’ data”. Apparently the citizens of Peacetopia are so scared of birds that they volunteered to take pictures of the sky and label them, thus contributing these additional 1,000,000 images. These images are different from the distribution of images the City Council had originally given you, but you think it could help your algorithm.

You should not add the citizens’ data to the training set, because this will cause the training and dev/test set distributions to become different, thus hurting dev and test set performance. True/False?

Question 61 point
6.Question 6
One member of the City Council knows a little about machine learning, and thinks you should add the 1,000,000 citizens’ data images to the test set. You object because:

This would cause the dev and test set distributions to become different. This is a bad idea because you’re not aiming where you want to hit.

A bigger test set will slow down the speed of iterating because of the computational expense of evaluating models on the test set.

The 1,000,000 citizens’ data images do not have a consistent x–>y mapping as the rest of the data (similar to the New York City/Detroit housing prices example from lecture).

The test set no longer reflects the distribution of data (security cameras) you most care about.

Question 71 point
7.Question 7
You train a system, and its errors are as follows (error = 100%-Accuracy):

Training set error 4.0%
Dev set error 4.5%

This suggests that one good avenue for improving performance is to train a bigger network so as to drive down the 4.0% training error. Do you agree?

Yes, because having 4.0% training error shows you have high bias.

Yes, because this shows your bias is higher than your variance.

No, because this shows your variance is higher than your bias.

No, because there is insufficient information to tell.

Question 81 point
8.Question 8
You ask a few people to label the dataset so as to find out what is human-level performance. You find the following levels of accuracy:

Bird watching expert #1 0.3% error
Bird watching expert #2 0.5% error
Normal person #1 (not a bird watching expert) 1.0% error
Normal person #2 (not a bird watching expert) 1.2% error

If your goal is to have “human-level performance” be a proxy (or estimate) for Bayes error, how would you define “human-level performance”?

0.0% (because it is impossible to do better than this)

0.3% (accuracy of expert #1)

0.4% (average of 0.3 and 0.5)

0.75% (average of all four numbers above)

Question 91 point
9.Question 9
Which of the following statements do you agree with?

A learning algorithm’s performance can be better than human-level performance but it can never be better than Bayes error.

A learning algorithm’s performance can never be better than human-level performance but it can be better than Bayes error.

A learning algorithm’s performance can never be better than human-level performance nor better than Bayes error.

A learning algorithm’s performance can be better than human-level performance and better than Bayes error.

Question 101 point
10.Question 10
You find that a team of ornithologists debating and discussing an image gets an even better 0.1% performance, so you define that as “human-level performance.” After working further on your algorithm, you end up with the following:

Human-level performance 0.1%
Training set error 2.0%
Dev set error 2.1%

Based on the evidence you have, which two of the following four options seem the most promising to try? (Check two options.)

Try decreasing regularization.

Train a bigger model to try to do better on the training set.

Try increasing regularization.

Get a bigger training set to reduce variance.

Question 111 point
11.Question 11
You also evaluate your model on the test set, and find the following:

Human-level performance 0.1%
Training set error 2.0%
Dev set error 2.1%
Test set error 7.0%

What does this mean? (Check the two best options.)

You should try to get a bigger dev set.

You have underfit to the dev set.

You have overfit to the dev set.

You should get a bigger test set.

Question 121 point
12.Question 12
After working on this project for a year, you finally achieve:

Human-level performance 0.10%
Training set error 0.05%
Dev set error 0.05%

What can you conclude? (Check all that apply.)

If the test set is big enough for the 0.05% error estimate to be accurate, this implies Bayes error is ≤0.05

This is a statistical anomaly (or must be the result of statistical noise) since it should not be possible to surpass human-level performance.

With only 0.09% further progress to make, you should quickly be able to close the remaining gap to 0%

It is now harder to measure avoidable bias, thus progress will be slower going forward.

Question 131 point
13.Question 13
It turns out Peacetopia has hired one of your competitors to build a system as well. Your system and your competitor both deliver systems with about the same running time and memory size. However, your system has higher accuracy! However, when Peacetopia tries out your and your competitor’s systems, they conclude they actually like your competitor’s system better, because even though you have higher overall accuracy, you have more false negatives (failing to raise an alarm when a bird is in the air). What should you do?

Look at all the models you’ve developed during the development process and find the one with the lowest false negative error rate.

Ask your team to take into account both accuracy and false negative rate during development.

Rethink the appropriate metric for this task, and ask your team to tune to the new metric.

Pick false negative rate as the new metric, and use this new metric to drive all further development.

Question 141 point
14.Question 14
You’ve handily beaten your competitor, and your system is now deployed in Peacetopia and is protecting the citizens from birds! But over the last few months, a new species of bird has been slowly migrating into the area, so the performance of your system slowly degrades because your data is being tested on a new type of data.

You have only 1,000 images of the new species of bird. The city expects a better system from you within the next 3 months. Which of these should you do first?

Use the data you have to define a new evaluation metric (using a new dev/test set) taking into account the new species, and use that to drive further progress for your team.

Put the 1,000 images into the training set so as to try to do better on these birds.

Try data augmentation/data synthesis to get more images of the new type of bird.

Add the 1,000 images into your dataset and reshuffle into a new train/dev/test split.

Question 151 point
15.Question 15
The City Council thinks that having more Cats in the city would help scare off birds. They are so happy with your work on the Bird detector that they also hire you to build a Cat detector. (Wow Cat detectors are just incredibly useful aren’t they.) Because of years of working on Cat detectors, you have such a huge dataset of 100,000,000 cat images that training on this data takes about two weeks. Which of the statements do you agree with? (Check all that agree.)

Having built a good Bird detector, you should be able to take the same model and hyperparameters and just apply it to the Cat dataset, so there is no need to iterate.

Needing two weeks to train will limit the speed at which you can iterate.

If 100,000,000 examples is enough to build a good enough Cat detector, you might be better of training with just 10,000,000 examples to gain a ≈10x improvement in how quickly you can run experiments, even if each model performs a bit worse because it’s trained on less data.

Buying faster computers could speed up your teams’ iteration speed and thus your team’s productivity.

Week 2 ML Strategy (2)

Learning Objectives

• Understand what multi-task learning and transfer learning are
• Recognize bias, variance and data-mismatch by looking at the performances of your algorithm on train/dev/test sets

Machine Learning flight simulator

Quiz: Autonomous driving (case study)15 questions

QUIZ
Autonomous driving (case study)
15 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
December 3, 11:59 PM PST

Question 11 point
1.Question 1
To help you practice strategies for machine learning, in this week we’ll present another scenario and ask how you would act. We think this “simulator” of working in a machine learning project will give a task of what leading a machine learning project could be like!

You are employed by a startup building self-driving cars. You are in charge of detecting road signs (stop sign, pedestrian crossing sign, construction ahead sign) and traffic signals (red and green lights) in images. The goal is to recognize which of these objects appear in each image. As an example, the above image contains a pedestrian crossing sign and red traffic lights

Your 100,000 labeled images are taken using the front-facing camera of your car. This is also the distribution of data you care most about doing well on. You think you might be able to get a much larger dataset off the internet, that could be helpful for training even if the distribution of internet data is not the same.

You are just getting started on this project. What is the first thing you do? Assume each of the steps below would take about an equal amount of time (a few days).

Spend a few days collecting more data using the front-facing camera of your car, to better understand how much data per unit time you can collect.

Spend a few days getting the internet data, so that you understand better what data is available.

Spend a few days checking what is human-level performance for these tasks so that you can get an accurate estimate of Bayes error.

Spend a few days training a basic model and see what mistakes it makes.

Question 21 point
2.Question 2
Your goal is to detect road signs (stop sign, pedestrian crossing sign, construction ahead sign) and traffic signals (red and green lights) in images. The goal is to recognize which of these objects appear in each image. You plan to use a deep neural network with ReLU units in the hidden layers.

For the output layer, a softmax activation would be a good choice for the output layer because this is a multi-task learning problem. True/False?

Question 31 point
3.Question 3
You are carrying out error analysis and counting up what errors the algorithm makes. Which of these datasets do you think you should manually go through and carefully examine, one image at a time?

500 images on which the algorithm made a mistake

10,000 images on which the algorithm made a mistake

500 randomly chosen images

10,000 randomly chosen images

Question 41 point
4.Question 4
After working on the data for several weeks, your team ends up with the following data:

• 100,000 labeled images taken using the front-facing camera of your car.
• 900,000 labeled images of roads downloaded from the internet.
• Each image’s labels precisely indicate the presence of any specific road signs and traffic signals or combinations of them. For example, y(i) = 10010 means the image contains a stop sign and a red traffic light.

Because this is a multi-task learning problem, you need to have all your y(i) vectors fully labeled. If one example is equal to 0?11? then the learning algorithm will not be able to use that example. True/False?

Question 51 point
5.Question 5
The distribution of data you care about contains images from your car’s front-facing camera; which comes from a different distribution than the images you were able to find and download off the internet. How should you split the dataset into train/dev/test sets?

Choose the training set to be the 900,000 images from the internet along with 20,000 images from your car’s front-facing camera. The 80,000 remaining images will be split equally in dev and test sets.

Mix all the 100,000 images with the 900,000 images you found online. Shuffle everything. Split the 1,000,000 images dataset into 600,000 for the training set, 200,000 for the dev set and 200,000 for the test set.

Choose the training set to be the 900,000 images from the internet along with 80,000 images from your car’s front-facing camera. The 20,000 remaining images will be split equally in dev and test sets.

Mix all the 100,000 images with the 900,000 images you found online. Shuffle everything. Split the 1,000,000 images dataset into 980,000 for the training set, 10,000 for the dev set and 10,000 for the test set.

Question 61 point
6.Question 6
Assume you’ve finally chosen the following split between of the data:

Dataset: Contains: Error of the algorithm:
Training 940,000 images randomly picked from (900,000 internet images + 60,000 car’s front-facing camera images) 8.8%
Training-Dev 20,000 images randomly picked from (900,000 internet images + 60,000 car’s front-facing camera images) 9.1%
Dev 20,000 images from your car’s front-facing camera 14.3%
Test 20,000 images from the car’s front-facing camera 14.8%

You also know that human-level error on the road sign and traffic signals classification task is around 0.5%. Which of the following are True? (Check all that apply).

You have a large avoidable-bias problem because your training error is quite a bit higher than the human-level error.

You have a large variance problem because your model is not generalizing well to data from the same training distribution but that it has never seen before.

Your algorithm overfits the dev set because the error of the dev and test sets are very close.

You have a large data-mismatch problem because your model does a lot better on the training-dev set than on the dev set

You have a large variance problem because your training error is quite higher than the human-level error.

Question 71 point
7.Question 7
Based on table from the previous question, a friend thinks that the training data distribution is much easier than the dev/test distribution. What do you think?

Your friend is right. (I.e., Bayes error for the training data distribution is probably lower than for the dev/test distribution.)

Your friend is wrong. (I.e., Bayes error for the training data distribution is probably higher than for the dev/test distribution.)

There’s insufficient information to tell if your friend is right or wrong.

Question 81 point
8.Question 8
You decide to focus on the dev set and check by hand what are the errors due to. Here is a table summarizing your discoveries:

Overall dev set error 14.3%
Errors due to incorrectly labeled data 4.1%
Errors due to foggy pictures 8.0%
Errors due to rain drops stuck on your car’s front-facing camera 2.2%
Errors due to other causes 1.0%

In this table, 4.1%, 8.0%, etc.are a fraction of the total dev set (not just examples your algorithm mislabeled). I.e. about 8.0/14.3 = 56% of your errors are due to foggy pictures.

The results from this analysis implies that the team’s highest priority should be to bring more foggy pictures into the training set so as to address the 8.0% of errors in that category. True/False?

True because it is the largest category of errors. As discussed in lecture, we should prioritize the largest category of error to avoid wasting the team’s time.

True because it is greater than the other error categories added together (8.0 > 4.1+2.2+1.0).

False because this would depend on how easy it is to add this data and how much you think your team thinks it’ll help.

False because data augmentation (synthesizing foggy images by clean/non-foggy images) is more efficient.

Question 91 point
9.Question 9
You can buy a specially designed windshield wiper that help wipe off some of the raindrops on the front-facing camera. Based on the table from the previous question, which of the following statements do you agree with?

2.2% would be a reasonable estimate of the maximum amount this windshield wiper could improve performance.

2.2% would be a reasonable estimate of the minimum amount this windshield wiper could improve performance.

2.2% would be a reasonable estimate of how much this windshield wiper will improve performance.

2.2% would be a reasonable estimate of how much this windshield wiper could worsen performance in the worst case.

Question 101 point
10.Question 10
You decide to use data augmentation to address foggy images. You find 1,000 pictures of fog off the internet, and “add” them to clean images to synthesize foggy days, like this:

Which of the following statements do you agree with?

So long as the synthesized fog looks realistic to the human eye, you can be confident that the synthesized data is accurately capturing the distribution of real foggy images, since human vision is very accurate for the problem you’re solving.

There is little risk of overfitting to the 1,000 pictures of fog so long as you are combing it with a much larger (>>1,000) of clean/non-foggy images.

Adding synthesized images that look like real foggy pictures taken from the front-facing camera of your car to training dataset won’t help the model improve because it will introduce avoidable-bias.

Question 111 point
11.Question 11
After working further on the problem, you’ve decided to correct the incorrectly labeled data on the dev set. Which of these statements do you agree with? (Check all that apply).

You should also correct the incorrectly labeled data in the test set, so that the dev and test sets continue to come from the same distribution

You should correct incorrectly labeled data in the training set as well so as to avoid your training set now being even more different from your dev set.

You should not correct the incorrectly labeled data in the test set, so that the dev and test sets continue to come from the same distribution

You should not correct incorrectly labeled data in the training set as well so as to avoid your training set now being even more different from your dev set.

Question 121 point
12.Question 12
So far your algorithm only recognizes red and green traffic lights. One of your colleagues in the startup is starting to work on recognizing a yellow traffic light. (Some countries call it an orange light rather than a yellow light; we’ll use the US convention of calling it yellow.) Images containing yellow lights are quite rare, and she doesn’t have enough data to build a good model. She hopes you can help her out using transfer learning.

What do you tell your colleague?

She should try using weights pre-trained on your dataset, and fine-tuning further with the yellow-light dataset.

If she has (say) 10,000 images of yellow lights, randomly sample 10,000 images from your dataset and put your and her data together. This prevents your dataset from “swamping” the yellow lights dataset.

You cannot help her because the distribution of data you have is different from hers, and is also lacking the yellow label.

Recommend that she try multi-task learning instead of transfer learning using all the data.

Question 131 point
13.Question 13
Another colleague wants to use microphones placed outside the car to better hear if there’re other vehicles around you. For example, if there is a police vehicle behind you, you would be able to hear their siren. However, they don’t have much to train this audio system. How can you help?

Transfer learning from your vision dataset could help your colleague get going faster. Multi-task learning seems significantly less promising.

Multi-task learning from your vision dataset could help your colleague get going faster. Transfer learning seems significantly less promising.

Either transfer learning or multi-task learning could help our colleague get going faster.

Neither transfer learning nor multi-task learning seems promising.

Question 141 point
14.Question 14
To recognize red and green lights, you have been using this approach:

• (A) Input an image (x) to a neural network and have it directly learn a mapping to make a prediction as to whether there’s a red light and/or green light (y).
A teammate proposes a different, two-step approach:

• (B) In this two-step approach, you would first (i) detect the traffic light in the image (if any), then (ii) determine the color of the illuminated lamp in the traffic light.

Between these two, Approach B is more of an end-to-end approach because it has distinct steps for the input end and the output end. True/False?

Question 151 point
15.Question 15
Approach A (in the question above) tends to be more promising than approach B if you have a __ (fill in the blank).

Large training set

Multi-task learning problem.

Large bias problem.

Problem with a high Bayes error.

Convolutional Neural Networks

About this course: This course will teach you how to build convolutional neural networks and apply it to image data. Thanks to deep learning, computer vision is working far better than just two years ago, and this is enabling numerous exciting applications ranging from safe autonomous driving, to accurate face recognition, to automatic reading of radiology images.

You will:

• Understand how to build a convolutional neural network, including recent variations such as residual networks.
• Know how to apply convolutional networks to visual detection and recognition tasks.
• Know to use neural style transfer to generate art.
• Be able to apply these algorithms to a variety of image, video, and other 2D or 3D data.

This is the fourth course of the Deep Learning Specialization.

Who is this class for: - Learners that took the first two courses of the specialization. The third course is recommended. - Anyone that already has a solid understanding of densely connected neural networks, and wants to learn convolutional neural networks or work with image data.

Week 1 Foundations of Convolutional Neural Networks

Learn to implement the foundational layers of CNNs (pooling, convolutions) and to stack them properly in a deep network to solve multi-class image classification problems.

Learning Objectives

• Understand the convolution operation
• Understand the pooling operation
• Remember the vocabulary used in convolutional neural network (padding, stride, filter, …)
• Build a convolutional neural network for image multi-class classification

primary

primary

primary

primary

#

primary

primary

primary

primary