PC204 Lecture 5

Conrad Huang

conrad@cgl.ucsf.edu

Office: GH-N453A

Topics

  • Homework review
  • List indices and slices
  • Case study code
  • Implementation of Markov analysis
  • String format operator
  • File operations
  • File names and paths
  • File formats
  • Introduction to modules
  • Where are modules kept?

Homework Review

  • 4.1 - check for duplicates
  • 4.2 - longest reducible word

List Indices and Slices

  • Python uses zero-based indexing
  • Most real-life statements use one-based indexing
  • Python index is one less than real-life index (e.g., Python 30 = real-life 31)
  • 0        1         2         3         4         5         6         7         8
    12345678901234567890123456789012345678901234567890123456789012345678901234567890 (real-life)
                                  x------xy------yz------z
    ATOM      2  CA  ALA A   2      21.954  -0.569 111.726  1.00 14.00           C  
                                  x------xy------yz------z
    01234567890123456789012345678901234567890123456789012345678901234567890123456789 (Python)
    0         1         2         3         4         5         6         7         
    
  • "If we number columns starting from 1, then the x, y and z coordinates of the atom are in columns 31-38, 39-46 and 47-54, respectively."
  • The x coordinate is in columns 31 through 38, corresponding to Python indices 30 and 37, respectively
  • Python slice syntax is start_index:end_index+1, so, again using Python indexing, to get characters 30 through 37, the slice syntax is 30:38
  • Note that the slices for the y and z coordinates are 38:46 and 46:54

Case Study Code

  • "Data structure selection" is about trade-offs
    • How fast will it run?
    • How much memory will it use?
    • How hard is it to implement?
  • The case study code uses:
    • design patterns
      • Reading the contents of a file
      • Sorting items by associated data
    • optional arguments
  • histogram.py

Markov Analysis

  • Case study also mentions using Markov analysis for generating random (nonsense) sentences using statistics from some text
  • Markov chains represent a series of probability-based state changes
    • When generating a sentence, each word represents a "state"
    • The initial word in the sentence is selected based on the frequency of appearance in a "training set" of texts
    • The probability of one word following another is also derived from the training set by collecting the frequency that one particular word follows another
  • markov.py
    • We use only a single text as the training set
    • This implementation does not take into account ends of sentences or multi-word phrases or …

Markov Analysis (cont.)

  • The code is split into two parts:
    1. Processing input text to generate frequencies
    2. Applying frequencies to generate sentence
  • What if the first step is very expensive (like using War and Peace as input text)?
    • Not a problem if we only want to run the program only once
    • If we want to run the program multiple times, we save time by keeping results from the first step around
    • “Persistent” data must be saved on disk, frequently as files or in databases

File Formats

  • Python provides many ways of writing data to files
  • How will your data file be used?
    • For reading by humans?
    • For reading by computers?
    • Both?

Human-readable Data Files

  • Text files, usually formatted with words
    • Usually, output to files are more strictly structured than output to the screen, so print statement may not be appropriate since it adds white space and line terminators at its discretion
    • The string formatting operator provides greater control over how strings are constructed:
      col1 = 10
      col2 = 20
      output = "%d\t%d" % (col1, col2)

Computer-readable Data Files

  • Are your data files meant for "internal" use, i.e., only by your code?
    • There are simple solutions for reading and writing data from and to files if you assume that no one else will try to read your data files: pickle, anydbm, shelve, …
    • There are more complex solutions when accommodating others: tab-separated values (TSV), comma-separated values (CSV), extensible markup language (XML), …

shelve Module

  • Sometimes, you want to split a computation process into multiple steps:
    • Preprocess once and reuse results multiple times
    • Checkpoint data in long calculations
  • The shelve module lets you treat a file like a dictionary, except fetching and settings values actually read and write from a file
    • Saving and restoring data looks just like assignment operations
    • Example after next little detour

Example using shelve

  • In the Markov analysis example above, we collected word frequencies and then used them to generate (nonsensical) sentences
  • Suppose we want to generate a sentence once a day from an arbitrary text, but do not want to recollect the frequencies each time (perhaps our text is War and Peace)
  • How do we do this?
    • Split Markov analysis and sentence generate into two separate programs
    • Use shelve to store the analysis results in a file: markov1_prep.py
    • Then read and use the results during sentence generation: markov1_use.py

Files for Data Interchange

  • Saving and restoring data using shelve is very convenient, but not very useful for collaborators unless they are using your code
  • Common interchange formats include tabular formats with comma- or tab-separated values (CSV/TSV), and Extensible Markup Language (XML)
    • CSV/TSV files are easy to parse and (marginally) human readable
    • XML files are more flexible but (generally) not human friendly

Example using TSV Format

  • The Markov example may be rewritten to use TSV instead of shelve
  • Note that if we were to change the representation for Markov data, we would need to modify both source code files to read and write the new representation in the TSV version
  • For the shelve version, we would not need to modify the saving/restoring part of the code because shelve hides that complexity from us

String Format Operator

  • % is the format operator when operand on the left is a string, aka "format string"
    • format string may have zero or more "format sequences" which are introduced by %
    • "%d\t%d" is the format string in the previous example
    • "%d" is the format sequence that specifies what part of the format string should be replaced with an integer
    • So "%d\t%d" is a format string for two integers (the two %d format sequences) separated by a tab (the \t)
    • Other format sequences include "%s" (string), "%f" (floating point number), "%g" (compact floating point number), "%x" (base-16 integer), …

String Format Operator (cont.)

  • The operand to the right of the format operator is either a single value or a tuple of values
    • If there is only one format sequence in the format string, then a single value may be used in place of a tuple
    • Each value in the right operand replaces a corresponding format sequence in the format string
    • If the number of values does not match the number of format sequences, or a value type does not match the corresponding format sequence type, an exception will be raised

String Format Operator (cont.)

  • Here are some examples of format exceptions:
      >>> data = "this is not an integer"
      >>> "%d" % data 
      Traceback (most recent call last): 
        File "<stdin>", line 1, in ? 
      TypeError: int argument required 
      >>> "%d %d" % (1, 2, 3)
      Traceback (most recent call last): 
        File "<stdin>", line 1, in ? 
      TypeError: not all arguments converted during string formatting
  • These exceptions are errors, but not all exceptions are errors (more on this next week)

String Format Operator (cont.)

  • Python 3 strings also have a format method that can be used for formatting
      >>> "{:d} {:s}".format(1, str(2))
      '1 2'
      >>> "{:d} {:s}".format(1, 2)
      Traceback (most recent call last):
        File "", line 1, in 
        ValueError: Unknown format code 's' for object of type 'int'
  • There are many options that may be specified for each {} parameter but we prefer the simplicity of the format operator %

Text File Operations

f = open('output.txt', 'w')
f.write('Hello world\n')
print >> f, 'Hello world'    # Python 2
print('Hello world', file=f) # Python 3
f.close()
  • Files may be opened for reading, writing or appending
    • Optional second argument to open function specifies the "mode", which may be "r" (read), "w" (write), or "a" (append). "r" is the default value.
  • For files opened for writing or appending, there are two ways to add data into the file:
    • via the write method of the file
    • using print (which is a function in Python 3 and a statement in Python 2)

Text File Operations (cont.)

f = open('output.txt', 'w')
f.write('Hello world\n')
print('Hello world', file=f) # Python 3
print >> f, 'Hello world'    # Python 2
f.close()
  • The write method takes a single string argument and sends it into the file.
    • Note that we need to explicitly include \n in the data string because write is very literal
  • print may be used to send data to a file
    • In Python 3, use the optional file argument to specify the target file
    • In Python 2, use the >> operator to specify the target file
    • Note that we do not need to include \n in the data string because print automatically terminates the output line

Text File Operations (cont.)

f = open('output.txt', 'w')
f.write('Hello world\n')
print('Hello world', file=f) # Python 3
print >> f, 'Hello world'    # Python 2
f.close()
  • The last operation on a file should be close, which tells Python that there will be no more operations on the file
    • This allows Python to release any data associated with the opened file
    • Most operating systems have a limit on the number of simultaneously open files, so it is good practice to clean up when files are no longer in use

with Statement

with open('output.txt', 'w') as f:
    f.write('Hello world\n')
    print('Hello world', file=f)   # Python 3
    print >> f, 'Hello world'      # Python 2
  • The with statement is used for managing resources such as files
  • The resource is acquired by the function call following the with keyword and is released (in this case, the file is closed) when the with statement completes (whether normally or with an exception)
  • The resource (in this case, the open file) is assigned to the variable following the as keyword, and is valid only within the with statement
  • You still need to handle errors that occur in the open call, e.g., file does not exist
  • This is the preferred way of opening and closing files because it guarantees that the file will be closed when it is no longer needed

Binary Files

  • Binary files contain non-text data and must be opened with 'b' in the mode parameter
    fin = open('input.txt', 'rb')
    fout = open('output.txt', 'wb')
  • In Python 3, operations on binary files uses bytes, not strings
    b = fin.read(20)
    s = b.decode('utf-8')
    fout.write(s.encode('utf-8'))
    fout.write(b'Hello world')
  • utf-8 stands for 8-bit Unicode Transformation Format and defines how 8-bit binary data may be converted to and from Unicode (Python 3 string and Python 2 unicode)
  • In Python 3, all network related input/output (e.g., urllib) are binary, so data must be decoded/encoded explicitly
  • In Python 2, reading from and writing to binary files uses strings and is, therefore, similar to text file operations

File Names and Paths

  • Python provides functions for manipulating file names and paths
    • Python abstraction for file system is that files are stored in folders (aka directories)
    • Folders may also be stored in folders
    • The name of a file can be uniquely specified by joining a series of folder names, followed by the file name itself, e.g., /Users/conrad/.cshrc
      • The joined sequence of names is called the "path"
      • Note that Python accepts / as the joining character, even on Windows which actually uses \

File Names and Paths (cont.)

  • The os and os.path modules provide functions for querying and manipulating file names and paths
    • os.getcwd - get current working directory
    • os.listdir - list names of items in folder
    • os.path.exists - check if file or folder exists
    • os.path.isdir - check if name is a folder
    • os.path.join - combine names into path

File Names and Paths (cont.)

  • Examples of using file and path functions:
      >>> import os
      >>> cwd = os.getcwd()
      >>> print(cwd)
      /var/tmp/conrad
      >>> os.listdir(cwd)
      ['x', 'chimera-build']
      >>> cb = os.path.join(cwd, 'chimera-build')
      >>> print(cb)
      /var/tmp/conrad/chimera-build
      >>> os.path.isdir(cb)
      True
      >>> os.listdir(cb)
      ['build', 'foreign', 'install']

Introduction to Modules

  • Modules are collections of code typically grouped by functionality
  • We have already used standard modules such as string, random, shelve and os.path
  • Modules are referenced using the import statement
    import shelve
    s = shelve.open(path)
    h1 = s["h1"]
    h2 = s["h2"]
    s.close()
  • Once imported, the functions defined in the module are referenced as module.function, like shelve.open

Writing Your Own Modules

  • (A contrived example) Suppose you want to write two scripts
    • One to count the total number of words in a document, wc.py
    • One to count the frequency of words in a document, freq.py
    • The code that reads the file and separates lines into words look very similar
  • Suppose we change the definition of a "word" so that, in addition to being delimited by whitespace, it must also not start or end with a punctuation character
  • We must change both scripts: wc2.py, freq2.py
  • What if we have ten scripts instead of two?

Sharing Common Code

  • We can put the common code into a module and share it among the scripts: word.py
  • The scripts no longer contain any code having to with how words are extracted: wc.py, freq.py
  • Changes to common code (like "what is a word") only need to be made once
  • As long as word.py is in the same folder as the scripts, wc.py and freq.py, the import statement will find the word module

Generators (optional)

  • When we factored out the common code, we changed how the scripts work
  • Originally, the work (counting words or building histogram) is done as part of the loop reading the file
  • The refactored code puts all the words in a list in the loop reading the file, and the scripts loop over the list of words
  • This potentially requires building a very large list unnecessarily since we only use each word once and never refer to it again
  • Python has a generator concept that handles this problem: word.py, wc.py, freq.py

Where Are Modules Kept?

  • Modules in the same directory as the calling scripts are found automatically
  • In addition, Python loads modules from a variety of locations:
    >>> import sys, pprint
    >>> pprint.pprint(sys.path)
    ['',
     'C:\\WINDOWS\\SYSTEM32\\python27.zip',
     'C:\\Python27\\DLLs',
     'C:\\Python27\\lib',
     'C:\\Python27\\lib\\plat-win',
     'C:\\Python27\\lib\\lib-tk',
     'C:\\Python27',
     'C:\\Users\\conrad\\AppData\\Roaming\\Python\\Python27\\site-packages',
     'C:\\Python27\\lib\\site-packages']
  • A number of these directories refer to where standard packages (those included as part of the standard Python distribution) are kept
  • site-packages is typically where non-standard packages are installed

Installing Your Own Modules

  • You can put your modules the standard site-packages directory
  • Newer versions of Python (2.6, 3.0) added support for per-user site-packages directory (PEP 370)
  • Use the following code to find out where it is on your computer:
    import site
    print(site.getusersitepackages())
  • Installing your modules in one of these site-packages directories should make them importable with no other action on your part

Setting Up Your Own Library

  • If you do not want to install in a standard directory, you still have some options:
    • Add code at the beginning of your script to explicitly include a directory to search for modules
      import sys
      sys.path.insert(0, "/Users/conrad/mylib")
      sys.path.insert(0, "/Users/conrad/swampy")
    • On Linux, set the PYTHONPATH environment variable before invoking Python
      export PYTHONPATH="/Users/conrad/swampy:/Users/conrad/mylib"
      python script_name
    • In both cases, the directories /Users/conrad/swampy and /Users/conrad/mylib will be searched when Python encounters an import statement
    • Both directories will be searched before standard directories

Debugging

  • Reading
    • Read your code critically, just as you do when editing a paper
    • It's the small things that get you
  • Running
    • Gather information about bugs by adding print statements
    • Do not debug by "random walk"
    • Do not make a dozen changes and hope that one works
  • Ruminating
    • We do not do enough of this!
    • What information do I need to identify where the error is occurring, and how can I get that information?
  • Retreating
    • The last refuge, but do not be too quick to take it
    • Try one last time to figure out what is wrong
    • After all, if this version of code is your current best effort, what makes you think that the next version will work any better?

Homework

  • 5.1 - use os.walk to count files in a directory tree
  • 5.2 - retrieve data from RCSB web service