Udacity ud032 Data Wrangling with MongoDB Notebook

Course can be found here, classroom.
wiki

Data Extraction Fundamentals

Intro

Quiz: Action Time

Assessing the Quality of Data Pt. 1

Assessing the Quality of Data Pt. 2

Tabular Formats

CSV Format

You can download the datafiles from the Supporting Materials link on this page or the Course Materials page.

Pick the data format supported by your spreadsheet application, download the file, open it in the spreadsheet, then export it as “csv” file.

Compare the size of the spreadsheet file with the size of the csv file!

Beatles Discography (csv)

Quiz: Parsing CSV Files

code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Your task is to read the input DATAFILE line by line, and for the first 10 lines (not including the header)
# split each line on "," and then for each line, create a dictionary
# where the key is the header title of the field, and the value is the value of that field in the row.
# The function parse_file should return a list of dictionaries,
# each data line in the file being a single list entry.
# Field names and values should not contain extra whitespace, like spaces or newline characters.
# You can use the Python string method strip() to remove the extra whitespace.
# You have to parse only the first 10 data lines in this exercise,
# so the returned list should have 10 entries!
import os
DATADIR = ""
DATAFILE = "beatles-diskography.csv"
def parse_file(datafile):
data = []
with open(datafile, "r") as f:
for line in f:
print line
return data
def test():
# a simple test of your implemetation
datafile = os.path.join(DATADIR, DATAFILE)
d = parse_file(datafile)
firstline = {'Title': 'Please Please Me', 'UK Chart Position': '1', 'Label': 'Parlophone(UK)', 'Released': '22 March 1963', 'US Chart Position': '-', 'RIAA Certification': 'Platinum', 'BPI Certification': 'Gold'}
tenthline = {'Title': '', 'UK Chart Position': '1', 'Label': 'Parlophone(UK)', 'Released': '10 July 1964', 'US Chart Position': '-', 'RIAA Certification': '', 'BPI Certification': 'Gold'}
assert d[0] == firstline
assert d[9] == tenthline
test()

solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def parse_file(datafile):
data = []
with open(datafile, "r") as f:
header = f.readline().split(",")
count = 0
for line in f:
if count == 10:
break
field = line.split(",")
entry = {}
for i, value in enumerate(field):
entry[header[i].strip()] = value.strip()
data.append(entry)
count += 1
return data

You can check the data in the dropdown in the top-left corner of the quiz starter code or download the datafile beatles-diskography.csv in the Supporting Materials below.

Python string method strip() will come in handy to get rid of the extra whitespace (that includes newline character at the end of line)

Quiz: Problematic Line

Using CSV Module

csv.DictReader()
default denote the first row as the field labels
line is dict data type
code:

1
2
3
4
5
6
import csv
datafile = ".csv"
with open(datafile, 'rb') as sd:
r = csv.DictReader(sd)
for line in r:
...

You can read more about the python csv module at the link below:
http://docs.python.org/2/library/csv.html

Intro to XLRD

You might find this video a lot more exciting if you try to run this code locally along the video! You can download the datafile from Course Materials. You can also install the xlrd library locally on your computer via python pip and the following command:

pip install xlrd
The example code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import xlrd
datafile = "2013_ERCOT_Hourly_Load_Data.xls"
def parse_file(datafile):
workbook = xlrd.open_workbook(datafile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, col)
for col in range(sheet.ncols)]
for r in range(sheet.nrows)]
print "\nList Comprehension"
print "data[3][2]:",
print data[3][2]
print "\nCells in a nested loop:"
for row in range(sheet.nrows):
for col in range(sheet.ncols):
if row == 50:
print sheet.cell_value(row, col),
### other useful methods:
print "\nROWS, COLUMNS, and CELLS:"
print "Number of rows in the sheet:",
print sheet.nrows
print "Type of data in cell (row 3, col 2):",
print sheet.cell_type(3, 2)
print "Value in cell (row 3, col 2):",
print sheet.cell_value(3, 2)
print "Get a slice of values in column 3, from rows 1-3:"
print sheet.col_values(3, start_rowx=1, end_rowx=4)
print "\nDATES:"
print "Type of data in cell (row 1, col 0):",
print sheet.cell_type(1, 0)
exceltime = sheet.cell_value(1, 0)
print "Time in Excel format:",
print exceltime
print "Convert time to a Python datetime tuple, from the Excel float:",
print xlrd.xldate_as_tuple(exceltime, 0)
return data
data = parse_file(datafile)

Quiz: Reading Excel Files

You can download the “2013_ERCOT_Hourly_Load_Data.xls” datafile from the Supporting Materials section on this page or from this Course Materials page. Note that the code expects the data to be contained in an archive named “2013_ERCOT_Hourly_Load_Data.xls.zip”, so you will need to change the name of the downloaded archive or modify the code to run the code on your local computer.

You can also install the xlrd library locally on your computer via python pip and the following command:
pip install xlrd

2013 ERCOT hourly load data

code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#!/usr/bin/env python
"""
Your task is as follows:
- read the provided Excel file
- find and return the min, max and average values for the COAST region
- find and return the time value for the min and max entries
- the time values should be returned as Python tuples
Please see the test function for the expected return format
"""
import xlrd
from zipfile import ZipFile
datafile = "2013_ERCOT_Hourly_Load_Data.xls"
def open_zip(datafile):
with ZipFile('{0}.zip'.format(datafile), 'r') as myzip:
myzip.extractall()
def parse_file(datafile):
workbook = xlrd.open_workbook(datafile)
sheet = workbook.sheet_by_index(0)
### example on how you can get the data
#sheet_data = [[sheet.cell_value(r, col) for col in range(sheet.ncols)] for r in range(sheet.nrows)]
### other useful methods:
# print "\nROWS, COLUMNS, and CELLS:"
# print "Number of rows in the sheet:",
# print sheet.nrows
# print "Type of data in cell (row 3, col 2):",
# print sheet.cell_type(3, 2)
# print "Value in cell (row 3, col 2):",
# print sheet.cell_value(3, 2)
# print "Get a slice of values in column 3, from rows 1-3:"
# print sheet.col_values(3, start_rowx=1, end_rowx=4)
# print "\nDATES:"
# print "Type of data in cell (row 1, col 0):",
# print sheet.cell_type(1, 0)
# exceltime = sheet.cell_value(1, 0)
# print "Time in Excel format:",
# print exceltime
# print "Convert time to a Python datetime tuple, from the Excel float:",
# print xlrd.xldate_as_tuple(exceltime, 0)
data = {
'maxtime': (0, 0, 0, 0, 0, 0),
'maxvalue': 0,
'mintime': (0, 0, 0, 0, 0, 0),
'minvalue': 0,
'avgcoast': 0
}
return data
def test():
open_zip(datafile)
data = parse_file(datafile)
assert data['maxtime'] == (2013, 8, 13, 17, 0, 0)
assert round(data['maxvalue'], 10) == round(18779.02551, 10)
test()

solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def parse_file(datafile):
workbook = xlrd.open_workbook(datafile)
sheet = workbook.sheet_by_index(0)
cv = sheet.col_values(1, start_rowx=1, end_rowx=None)
maxvalue = max(cv)
maxposition = cv.index(maxvalue) + 1
minvalue = min(cv)
minposition = cv.index(minvalue) + 1
max_exceltime = sheet.cell_value(maxposition, 0)
maxtime = xlrd.xldate_as_tuple(max_exceltime, 0)
min_exceltime = sheet.cell_value(minposition, 0)
mintime = xlrd.xldate_as_tuple(min_exceltime, 0)
avgcoast = sum(cv)/len(cv)
data = {
'maxtime': maxtime,
'maxvalue': maxvalue,
'mintime': mintime,
'minvalue': minvalue,
'avgcoast': avgcoast
}
return data

Intro to JSON

Data Modeling in JSON

JSON Resources

Extra Info
If you’re unfamiliar with JSON, or would just like a refresher, W3Schools has a great tutorial on the subject.

JSON Tutorial

You can also check out

http://www.json.org/

You can find information about Python’s json module on this page of the Python documentation. Note that JSON arrays are interpreted as lists and JSON objects as dictionaries, so you can use the standard Python approaches to inspect JSON data. You’ll get some practice exploring some data of this type in the next quiz.

Quiz: JSON Playground

You can check the data in the dropdown in the top-left corner of the quiz starter code.

‘Run locally’ means that you have to download or copy the file contents to your local machine, modify it and run.

To be able to do that you need to have Python installed, as well as the requests module. Please see Requests installation documentation. If you have “pip, you can install Requests by running the following command:

pip install requests
To learn more about the requests module, see the documentation here.

Quiz: Exploring JSON

Check You Out