## Introduction

• A spreadsheet is a simple way to do tabular calculations
• The first one, VisiCalc, was the original “compelling application”
• Widely used in science as a simple database, and for statistics
• Most widely used spreadsheet today is Microsoft Excel
• But it's not free
• This lecture will use `Gnumeric` instead
• Principles and UI are the same

## You Can Skip This Lecture If...

• You can enter and format data in a spreadsheet
• You know how to calculate totals, averages, and other aggregate values
• You can create a chart
• You know how conditionals work
• You know how to create lookup tables

## First Steps

• Run Gnumeric

• Spreadsheet consists of cells with unique ids like `A1` or `C4`
• Enter data by typing in cells
• [enter] moves to the next row, [tab] moves to next column
• Use mouse or arrow keys to move around

## Entering Data

• Studying the long-term effects of programming on human memory
• How well do people remember names, faces, and where they live after they've been programming for several years?
• How does performance change with repetition?
• Want to use a spreadsheet to store data and do calculations
• Enter the values shown below
• Figure 21.2: Raw Scores

## Formatting Data

• I.e., easier to understand
• Select the six numbers using either the mouse or the keyboard
• Select Format...Cells
• Under the “Number” category, change the number of decimal places to 1
• While we're here:
• Select the “1” label (to select the first row)
• Select Format...Cells again
• Under the “Font” tab, change the format to bold
• Do the same for the title column
• Figure 21.3: Scores After Formatting Titles

## Formulas

• Often want to calculate new values from old
• In this case, want to calculate each subject's overall performance
• Weighting is 30% for preliminary result, 70% for final result
• Select E2 (first empty cell in Alan Turing's row)
• Click on the “=” button to enter a formula
• Enter `(0.3*C2)+(0.7*D2)`
• `C2` and `D2` are references to other cells
• 84.6 appears in cell E2
• Figure 21.4: Scores With A Formula

## Replicating Formulas

• Could type similar formulas into cells E3 and E4
• But it would be tedious to do this for a hundred test subjects
• Instead, select E2, copy, and paste into E3, then into E4
• Turns C2 into C3 or C4 as required
• Grades are 84.6, 71.9, and 86.3
• Figure 21.5: Copying Formulas

## Built-In Functions

• What about calculating averages and other aggregate statistics?
• Don't want to have to type in `(0.01*A1+0.01*A2+...+0.01*A100)`
• Especially if we might add another row or column later on
• Solution is to use built-in functions
• Select C5
• Enter the formula `AVERAGE(C2:C4)`
• C2:C4 is a range of cells
• Copy and paste into D5 to calculate the average final score
• Add `MAX` below average, along with a label
• Spreadsheets are programs, and programs should be documented
• Then insert a row above average for `MIN`
• The spreadsheet does the right thing with relative references when rows are inserted or deleted
• Figure 21.6: Minimum, Average, and Maximum Scores

## Commonly-Used Functions

FunctionPurpose
`AND(e1,e2,...)`True if all expressions are true; false otherwise
`AVERAGE(values)`Return the average of the given values (which may be a range)
`DATE(year,month,day)`Return the number of days since January 1, 1900 for the given date
`INDEX(array,row,col)`Return the section of an array indexed by row and column indices
`LOOKUP(value,lookup_vector,result_vector)`Find a value in a lookup vector, and return the corresponding entry from the result vector
`NOT(e)`True if the expression is 0; false otherwise
`OR(e1,e2,...)`True if any expression is true; false otherwise
`RAND()`Return a random value between 0 and 1
`REPLACE(old,start,num,new)`Replace part of a string
`ROUND(number)`Round off a number
`SIN(e)`Return the sine of an expression
`TODAY()`Return the number of days since January 1, 1900 for today
Table 21.1: Gnumeric Functions

## Dependencies

• Spreadsheets update dependencies between cells automatically
• Just like Make updates dependencies between files
• A declarative programming language
• Example:
• Copy the equation for overall grade from E4 into E5, E6, and E7
• Change John von Neumann's final score from 68 to 86
• Several cells' values immediately change
• Figure 21.7: Dependencies Between Cells

• This is why people use spreadsheets: the data is the program

## Conditionals

• Who remembered enough to be able to carry out simple tasks?
• Field studies show that people need at least 75% recall
• Syntax for a conditional is `IF(condition,true_value,false_value)`
• First argument to function must be a Boolean expression
• Second is the function's value if the first argument is true
• Third is its value if the first argument is false
• Put `IF(E2>75,"success","failure")` in cell F2
• Copy and paste into F3 and F4
• Figure 21.8: Conditionals

## Multi-Valued Conditionsl

• Want more specific diagnosis than just “success” or “failure”
• Could use nested conditional expressions
• `IF(E2<70,"Failure",IF(E2<80,"Marginal",IF(E2<86,"Good","Excellent")))`
• But it would be hard to read
• And what if there were a hundred options?

## Lookup Tables

• Use a lookup table instead
• Find a value in one row, and use the contents of a corresponding cell from another row
• Syntax is `LOOKUP(value,lookup_vector,result_vector)`
• `value` is a single cell
• `lookup_vector` is part of a single row or column
• If the value isn't in the lookup vector, the spreadsheet uses the nearest cell whose value is less than or equal to it
• So 84.6 falls back to the cell containing the value 80
• `lookup_vector` must be sorted in order for this to work
• The result vector must be exactly the same length as the lookup vector

## Lookup Table Example

• Put cutoffs and evaluations in rows 9 and 10
• With documentation
• Put `LOOKUP(E2,B9:E9,B10:E10)` in cell F2
• Figure 21.9: Looking Up Results

## Absolute References

• Copy and paste into F3 to look up letter grade for John von Neumann
• Displays `#N/A`, meaning “not valid”
• Formula has been adjusted to use B10:E10 as the lookup vector and B11:E11 as the result vector
• But these cells are empty
• Figure 21.10: Lookup Failure

• Use absolute references to prevent this
• `\$B\$9` means cell B9 even when the formula is copied, rows and columns are inserted or deleted, etc.
• So change the formula for F2 to `LOOKUP(E2,\$B\$9:\$E\$9,\$B\$10:\$E\$10)`
• Figure 21.11: Absolute References in Formulas

## A Larger Data Set

• Open `solarsystem.csv`
• Orbital information about planets and satellites
• Data stored as comma-separated values
• ```Name,Position,Orbits,Distance,Period,Inclination,Eccentricity
,,,x1000km,days,degrees,degrees
Sun,-,-,-,-,-,-
Mercury,1,Sun,57910,87.97,7.00,0.21
Venus,2,Sun,108200,224.70,3.39,0.01
Earth,3,Sun,149600,365.26,0.00,0.02
Mars,4,Sun,227940,686.98,1.85,0.09
Jupiter,5,Sun,778330,4332.71,1.31,0.05
Saturn,6,Sun,1429400,10759.50,2.49,0.06
Uranus,7,Sun,2870990,30685.00,0.77,0.05```
• Click on the row below the two title rows, and select View...Freeze Panes
• Scrolling will now leave the title rows in place
• Figure 21.12: Scrolling the Solar System

## Creating Charts

• How are period and distance related?
• The best (often, the only) way to understand large data sets is to view them graphically
• To create a chart:
• Select the range D4:E78 (i.e., everything except the Sun)
• Select Insert...Chart
• Choose an XY scatterplot
• Figure 21.13: Basic Chart

## Customizing The Display

• Too many points crowded against left edge of chart
• Not surprising: values for Pluto and Jupiter's innermost satellites span orders of magnitude
• Try log-log plot
• Each ten-fold increase in scale is a single unit on the axis
• Can do it by setting options when creating the chart

## Creating A Log-Log Chart

• Set I4 to `log(D4)`, then copy, select I5:I78, and paste
• Pastes into all selected cells at once
• Set J4 to `log(E4)`, copy, and paste into
• Error: J28 and others display `#NUM!`
• Figure 21.14: Error Creating Log-Log Plot

## Fixing The Error

• The data uses negative periods for retrograde motion
• Have to scrub the data in order to use it
• Change formula to `log(abs(E4))`
• Now plot columns I and J
• Add X and Y axis titles along the way
• Figure 21.15: Log-Log Plot of Distances and Periods

## Analysis

• Looks like several straight lines lying beside one another
• The rightmost has nine points…
• …and there are nine planets
• Do the other lines correspond to Jupiter, Saturn, and other orbited bodies?
• Often use spreadsheets for this kind of exploratory data analysis
• Look for things that might be patterns
• Then apply real statistical tools to see if they are
• It's not science until you do the second step
• The human eye is very good at “seeing” patterns that aren't there

## Programming

• FIXME: describe how to manipulate Gnumeric data from Python

## Exercises

Exercise 21.1:

Spreadsheets use conditional expressions, rather than conditional statements. C/C++, Java, and Python also support conditional expressions. How are they written? When should you use them? When shouldn't you?

Exercise 21.2:

`\$B\$9` is an absolute reference to the cell B9. What does the expression `\$B9` refer to? What about `B\$9`? When would you use expressions like these?