Final Project
This course can be found in udacity ud170.

# Data Analysis Process

Otherwise, you can find the free course here.

## Intro to CSVs

If you’d like to learn more about data wrangling, check out the Udacity course Data Wrangling with MongoDB.

## CSVs in Python

https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/22sQCo6ovH0.mp4

### Python’s csv Module

This page contains documentation for Python’s csv module. Instead of csv, you’ll be using unicodecsv in this course. unicodecsv works exactly the same as csv, but it comes with Anaconda and has support for unicode. The csv documentation page is still the best way to learn how to use the unicodecsv library, since the two libraries work exactly the same way.

### Iterators in Python

This page explains the difference between iterators and lists in Python, and how to use iterators.

### Solutions

IPND students: Look at the end of this lesson for Quiz Solutions

## Fixing Data Types

https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/7NSYtdVrlRE.mp4

https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/AO8vSyAtfV4.mp4

## Investigating the Data

Now you’ve started the data wrangling process by loading the data and making sure it’s in a good format. The next step is to investigate a bit and see if there are any inconsistencies or problems in the data that you’ll need to clean up.

For each of the three files you’ve loaded, find the total number of rows in the csv and the number of unique students. To find the number of unique students in each table, you might want to try creating a set of the account keys.

Again, in case you’re not finished with your local setup, you can complete this exercise in the Udacity code editor. You’ll need to run the next exercise locally, though, so if you haven’t finished setting up, you should do that now.

## Problems in the Data

### Removing an Element from a Dictionary

If you’re not sure how to remove an element from a dictionary, this post might be helpful.

### Solutions

IPND students: Look at the end of this lesson for Quiz Solutions

### Updated Code for Previous Exercise

After running the above code, Caroline also shows rewriting the solution from the previous exercise to the following code:

## Missing Engagement Records

Printing a Single Row
This page describes how to use Python’s break statement, which might be helpful for printing only a single problem record.

IPND students: Look at the end of this lesson for Quiz Solutions

## Refining the Question

### Exploratory Data Analysis

If you’d like to learn more about the exploratory phase of the data analysis process, check out the Udacity course Data Analysis with R.

### Solutions

IPND students: Look at the end of this lesson for Quiz Solutions

## Getting Data from First Week

Note that paid students may have canceled from other courses before paying, and the suggested solution will retain records from these other enrollments.

## Quiz: Making Histograms

### Visualizing data

Even though you know the mean, standard deviation, maximum, and minimum of various metrics, there are a lot of other facts about each metric that would be nice to know. Are more values close to the minimum or the maximum? What is the median? And so on.

Instead of printing out more statistics, at this point it makes sense to visualize the data using a histogram.

### Making histograms in Python

To make a histogram in Python, you can use the matplotlib library, which comes with Anaconda. The following code will make a histogram of an example list of data points called data.

The line %matplotlib inline is specifically for IPython notebook, and causes your plots to appear in your notebook rather than a new window. If you are not using IPython notebook, you should not include this line, and instead you should add the line plt.show() at the bottom to show the plot in a new window.

### Making histograms of student data

Now use this method to make a histogram of each of the three metrics we looked at for both students who pass the subway project and students who don’t. That is, you should create 6 histograms. Do any of the metrics have histograms with very different shapes for students who pass the subway project vs. those who don’t?

You can also create histograms of the metrics you explored on your own if you’d like.

## Are your Results Just Noise?

### Statistics

If you’d like to learn more about statistics, which you can use to rigorously determine how likely it is that your results are due to chance, check out the Udacity courses Intro to Descriptive Statistics and Intro to Inferential Statistics.

## Correlation Does Not Imply Causation

### Cheese and Bedsheet Tangling

To see the plot shown in the video, as well as many other amusing or strange correlations, check out this website.

### A/B Testing

To learn more about using online experiments to determine whether one change causes another, take the Udacity course A/B Testing.

## Predicting Based on Many Features

### Machine Learning

To learn more about using machine learning to automatically make predictions, take the Udacity course Intro to Machine Learning.

## Quiz: Improving Plots and Sharing Findings

In matplotlib, you can add axis labels using plt.xlabel("Label for x axis") and plt.ylabel("Label for y axis"). For histograms, you usually only need an x-axis label, but for other plot types a y-axis label may also be needed. You can also add a title using plt.title("Title of plot").

### Making plots look nicer with seaborn

You can automatically make matplotlib plots look nicer using the seaborn library. This library is not automatically included with Anaconda, but Anaconda includes something called a package manager to make it easier to add new libraries. The package manager is called conda, and to use it, you should open the Command Prompt (on a PC) or terminal (on Mac or Linux), and type the command conda install seaborn.

If you are using a different Python installation than Anaconda, you may have a different package manager. The most common ones are pip and easy_install, and you can use them with the commands pip install seaborn or easy_install seaborn respectively.

Once you have installed seaborn, you can import it anywhere in your code using the line import seaborn as sns. Then any plot you make afterwards will automatically look better. Give it a try!

If you’re wondering why the abbreviation for seaborn is sns, it’s because seaborn was named after the character Samuel Norman Seaborn from the show The West Wing, and sns are his initials.

The seaborn package also includes some extra functions you can use to make complex plots that would be difficult in matplotlib. We won’t be covering those in this course, but if you’d like to see what functions seaborn has available, you can look through the documentation.

You’ll also frequently want to add some arguments to your plot to tune how it looks. You can see what arguments are available on the documentation page for the hist function. One common argument to pass is the bins argument, which sets the number of bins used by your histogram. For example, plt.hist(data, bins=20) would make sure your histogram has 20 bins.

### Improving one of your plots

Use these techniques to improve at least one of the plots you made earlier.

Finally, decide which of the discoveries you made this lesson you would most want to communicate to someone else, and write a forum post sharing your findings.

## Conclusion

L1_Solution_Code.ipynb

## Quiz Solutions

### Checking for more problem records

num_problem_students

### Refining the Question

Note that if you switch the order of the second if statement like so

if (enrollment_date > paid_students[account_key] or
account_key not in paid_students)
you will most likely get an error. Why do you think that is? Check out this Stackoverflow discussion to find out more: http://stackoverflow.com/questions/13960657/does-python-evaluate-ifs-conditions-lazily

### Debugging Data Analysis Code

Here is the code Caroline shows in the solution video:

Alternatively, you can find the account key with the maximum minutes using this shorthand notation:

### Fixing Bug in within_one_week()

She also updated the code for the within_one_week function to the following:

### Lessons Completed in First Week

First, Caroline refactors the given code to analyze total minutes spent in the first week into the following:

Then she called the functions she created to analyze the lessons completed in the first week as follows:

### Number of Visits in the First Week

Here is the code Caroline shows in the solution video. First she ran this code to create the has_visited field:

Then, after recreating the engagement_by_account dictionary with the updated data, she ran the following code to analyze days visited in the first week:

### Splitting out Passing Students

Here is the code Caroline shows in the solution video:

### Comparing the Two Student Groups

Here is the code Caroline shows in the solution video:

### Making Histograms

Here is the code Caroline shows in the solution video:

#### Fixing the Number of Bins

To change how many bins are shown for each plot, try using the bins argument to the hist function. You can find documentation for the hist function and the arguments it takes here.

### Improving Plots and Sharing Findings

Here is the code Caroline shows in the solution video:

# Numpy and Pandas for 1D Data

## Quiz: Gapminder Data

Gapminder data
The data in this lesson was obtained from the site gapminder.org. The variables included are:

Aged 15+ Employment Rate (%)
Life Expectancy (years)
Primary school completion (% of boys)
Primary school completion (% of girls)

## Quiz: NumPy Arrays

Pandas Numpy
Series Array

similarity and difference between numpy array and python list

similarity difference
for loop numpy array have the same type

solution

argmax() return the position of max()

## Quiz: Vectorized Operations

+ operation:
python | numpy
—|—

## Quiz: Calculate Overall Completion Rate

Bitwise Operations

In NumPy, a & b performs a bitwise and of a and b. This is not necessarily the same as a logical and, if you wanted to see if matching terms in two integer vectors were non-zero. However, if a and b are both arrays of booleans, rather than integers, bitwise and and logical and are the same thing. If you want to perform a logical and on integer vectors, then you can use the NumPy function np.logical_and(a, b) or convert them into boolean vectors first.

Similarly, a | b performs a bitwise or, and ~a performs a bitwise not. However, if your arrays contain booleans, these will be the same as performing logical or and logical not. NumPy also has similar functions for performing these logical operations on integer-valued arrays.

For the quiz, assume that the number of males and females are equal i.e. we can take a simple average to get an overall completion rate.

In the solution, we may want to / 2. instead of just / 2. This is because in Python 2, dividing an integer by another integer (2) drops fractions, so if our inputs are also integers, we may end up losing information. If we divide by a float (2.) then we will definitely retain decimal values.

Erratum: The output of cell [3] in the solution video is incorrect: it appears that the male variable has not been set to the proper value set in cell [2]. All values except for the first will be different. The correct output in cell Out[3]: should instead start with:

array([ 192.83205, 205.28855, 202.82258, 186.63257, 206.91115,

solution

quiz

solution

quiz

sloution

## Quiz: + vs. +=

notice

array([2,3,4,5])

array([1,2,3,4])

## Quiz: In-Place vs. Not In-Place

notice

array([100,2,3,4])

quiz

solution

## Quiz: Series Indexes

s.describe() s.loc[INDEX] s.iloc[0]
Pandas idxmax()
Note: The argmax() function mentioned in the videos has been realiased to idxmax(), and returns the index of the first maximally-valued element. You can find documentation for the idxmax() function in Pandas here.
quiz

solution

quiz

## Quiz: Filling Missing Values

Remember that Jupyter notebooks will just print out the results of the last expression run in a code cell as though a print expression was run. If you want to save the results of your operations for later, remember to assign the results to a variable or, for some Pandas functions like .dropna(), use inplace = True to modify the starting object without needing to reassign it.
quiz

solution

## Quiz: Pandas Series apply()

Note: The grader will execute your finished reverse_names(names) function on some test names Series when you submit your answer. Make sure that this function returns another Series with the transformed names.

split()
You can find documentation for Python’s split() function here.
quiz

solution

## Quiz: Plotting in Pandas

If the variable data is a NumPy array or a Pandas Series, just like if it is a list, the code

will create a histogram of the data.

Pandas also has built-in plotting that uses matplotlib behind the scenes, so if data is a Series, you can create a histogram using data.hist().

There’s no difference between these two in this case, but sometimes the Pandas wrapper can be more convenient. For example, you can make a line plot of a series using data.plot(). The index of the Series will be used for the x-axis and the values for the y-axis.

In the following quiz, we’ve created Series containing the various variables we’ve been looking at this lesson. Pick a country you’re interested in, and make a plot of each variable over time.

The Udacity editor will only show one plot each time you click “Test Run”, so you can look at multiple plots by clicking “Test Run” multiple times. If you’re running plotting code locally, you may need to add the line plt.show() depending on your setup.
quiz

solution

# Numpy and Pandas for 2D Data

## Quiz: Two-Dimensional NumPy Arrays

python: list of lists
numpy: 2D array
pandas: DataFrame
quiz

solution

axis = 0 column
1 row
quiz

solution

quiz

solution

## Quiz: Calculating Correlation

### Understand and Interpreting Correlations

This page contains some scatterplots of variables with different values of correlation.
This page lets you use a slider to change the correlation and see how the data might look.
Pearson’s r only measures linear correlation! This image shows some different linear and non-linear relationships and what Pearson’s r will be for those relationships.

### Corrected vs. Uncorrected Standard Deviation

By default, Pandas’ std() function computes the standard deviation using Bessel’s correction. Calling std(ddof=0) ensures that Bessel’s correction will not be used.

### Previous Exercise

The exercise where you used a simple heuristic to estimate correlation was the “Pandas Series” exercise in the previous lesson, “NumPy and Pandas for 1D Data”.

### Pearson’s r in NumPy

NumPy’s corrcoef() function can be used to calculate Pearson’s r, also known as the correlation coefficient.
quiz

solution

## Quiz: DataFrame Vectorized Operations

### Pandas shift()

Documentation for the Pandas shift() function is here. If you’re still not sure how the function works, try it out and see!

### Alternative Solution

As an alternative to using vectorized operations, you could also use the code return entries_and_exits.diff() to calculate the answer in a single step.
quiz

solution

## Quiz: DataFrame applymap()

Note: The grader will execute your finished convert_grades(grades) function on some test grades DataFrames when you submit your answer. Make sure that this function returns a DataFrame with the converted grades. ​Hint​: You may need to define a helper function to use with .applymap().
quiz

solution

## Quiz: DataFrame apply()

Note: In order to get the proper computations, we should actually be setting the value of the “ddof” parameter to 0 in the .std() function.

Note that the type of standard deviation calculated by default is different between numpy’s .std() and pandas’ .std() functions. By default, numpy calculates a population standard deviation, with “ddof = 0”. On the other hand, pandas calculates a sample standard deviation, with “ddof = 1”. If we know all of the scores, then we have a population - so to standardize using pandas, we need to set “ddof = 0”.
.apply() used to convert columns(default) to columns and convert rows(with axis) to rows
.applymap() used to elements
quiz

solution

## Quiz: DataFrame apply() Use Case 2

.apply() convert columns to element
df.apply(np.max):=df.max()
quiz

solution

code

quiz

solution

code

## Quiz: Calculating Hourly Entries and Exits

In the quiz where you calculated hourly entries and exits, you did so for a single set of cumulative entries. However, in the original data, there was a separate set of numbers for each station.

Thus, to correctly calculate the hourly entries and exits, it was necessary to group by station and day, then calculate the hourly entries and exits within each day.

Write a function to do that. You should use the apply() function to call the function you wrote previously. You should also make sure you restrict your grouped data to just the entries and exits columns, since your function may cause an error if it is called on non-numerical data types.

If you would like to learn more about using groupby() in Pandas, this page contains more details.

Note: You will not be able to reproduce the ENTRIESn_hourly and EXITSn_hourly columns in the full dataset using this method. When creating the dataset, we did extra processing to remove erroneous values.

### quiz

To clarify the structure of the data, the original data recorded the cumulative number of entries on each station at four-hour intervals. For the quiz, you just need to look at the differences between consecutive measurements on each station: by computing “hourly entries”, we just mean recording the number of new tallies between each recording period as a contrast to “cumulative entries”.

solution

## Quiz: Combining Pandas DataFrames

In the merged table on the right, the join dates in the third and fourth rows should be 5/19 and 5/11, reflecting the account key mapping in the enrollments table.
quiz

solution

## Quiz: Plotting for DataFrames

Just like Pandas Series, DataFrames also have a plot() method. If df is a DataFrame, then df.plot() will produce a line plot with a different colored line for each variable in the DataFrame. This can be a convenient way to get a quick look at your data, especially for small DataFrames, but for more complicated plots you will usually want to use matplotlib directly.

In the following quiz, create a plot of your choice showing something interesting about the New York subway data. For example, you might create:

Histograms of subway ridership on both days with rain and days without rain
A scatterplot of subway stations with latitude and longitude as the x and y axes and ridership as the bubble size
If you choose this option, you may wish to use the as_index=False argument to groupby(). There is example code in the following quiz.
A scatterplot with subway ridership on one axis and precipitation or temperature on the other
If you’re not sure how to make the plot you want, try searching on Google or take a look at the matplotlib documentation. Once you’ve created a plot you’re happy with, share what you’ve found on the forums!
quiz

solution

## Three-Dimensional Data

### Three-Dimensional Data

Now that you’ve worked with one-dimensional and two-dimensional data, you might be wondering how to work with three or more dimensions.

### 3D data in NumPy

NumPy arrays can have arbitrarily many dimensions. Just like you can create a 1D array from a list, and a 2D array from a list of lists, you can create a 3D array from a list of lists of lists, and so on. For example, the following code would create a 3D array:

### 3D data in Pandas

Pandas has a data structure called a Panel, which is similar to a DataFrame or a Series, but for 3D data. If you would like, you can learn more about Panels here.