Web Client Data Model

(XML/XHTML/CSS/DOM)

John "Scooter" Morris

April 2, 2012

Portions Copyright © 2005-06 Python Software Foundation.

Overview

  • Switching gears: it's about the user!
  • XML (eXtensible Markup Language)
  • XHTML (eXtensible HyperText Markup Language)
  • CSS (Cascading Style Sheets)
  • DOM (Document Object Model)

It's about the user!

  • Scientific software has users, but they are often overlooked
  • The focus of the database is on the data
  • The focus of the analysis is on the algorithms and the programming
  • The focus of the interface (and the system as a whole) must be on the user
  • In general, you are not the user, so how do you design with the user in mind?
    • Use cases/task definitions
    • User testing
    • Personas

Use cases/task definitions

  • Questions you need to answer:
    • What is the purpose of your system?
    • What are users going to do with the system?
  • Your UI design determines the how
  • Use case:
    • Defines the task the user is trying to achieve
    • Should include the inputs and outputs
    • Often will include subtasks before using the system and after using the system
    • Clearly specify the role of the system in the user's task
    • Often are very elaborate and complicated...
      • ...but don't need to be. It's better to keep it simple

User Testing

  • Ideal:
    • Get user feed back on:
      • use cases and tasks
      • wire-frame (white board) prototypes
      • early functional prototypes
      • final system
    • Incorporate user feedback into system...
    • ...retest
  • Real world:
    • Get user input when you can:
      • from fellow students
      • from lab mates
      • from postdocs
      • from friends and neighbors
    • Do not take critiques personally!
    • Do not explain where the user went wrong!

Personas

  • Sometimes getting a cross-section of users is not possible
    • then what?
  • Your team must substitute for the users -- how?
    • Characterize your users using personas
    • Each persona has a name, background, gender, ethnicity, specific set of desired tasks, etc.
    • Define a small set of personas that encompasses your user community
    • Run through use cases/user interfaces as each persona
      • look for issues from the viewpoint of the persona
      • essentially, you are role playing
    • Incorporate feedback from personas into your design

XML

  • XML is becoming the standard way to store everything from web pages to astronomical data
    • Bewildering variety of tools for dealing with it
    • And more appearing every day
  • This lecture describes how to process and modify XML
    • Warning: the standards are more complex than they should have been
  • Reading:

In the Beginning

  • 1969-1986: Standard Generalized Markup Language (SGML)
    • Developed by Charles Goldfarb and others at IBM
    • A way of adding information to medical and legal documents so that computers could process them
    • Very complex specification (over 500 pages)
  • 1989: Tim Berners-Lee creates HyperText Markup Language (HTML) for the World Wide Web
    • Much (much) simpler than SGML
    • Anyone could write it, so everyone did

The Modern Era

  • Problem: HTML had a small, fixed set of tags
    • Everyone wanted to add new ones
    • Solution: create a standard way to define a set of tags, and the relationships between them
  • First version of XML standardized in 1998
    • A set of rules for defining markup languages
    • Much more complex than HTML, but still simpler than SGML
  • New version of HTML called XHTML was also defined
    • Like HTML, but obeys all XML rules
    • Still a lot of non-XML compliant HTML out there
  • HTML 5 working its way through the W3C standards process
    • In part, a reaction to the complexity of XHTML 2 proposals
    • Extends HTML 4.01 with new APIs and Elements

Formatting Rules

  • A basic XML document contains elements and text
    • Full spec allows for external entity references, processing instructions, and other fun
  • Elements are shown using tags
    • Must be enclosed in angle brackets "<>"
    • Full form: <tagname>…</tagname>
    • Short form (if the element doesn't contain anything): <tagname/>
    • Note that tags must be closed in XML:
      • <hr> is legal in HTML
      • in XML or XHTML it must be closed: <hr/> or <hr></hr>

Document Structure

  • Elements must be properly nested
    • If Y starts inside X, Y must end before X ends
    • So <X>…<Y>…</Y></X> is legal…
    • …but <X>…<Y>…</X></Y> is not
  • Every document must have a single root element
    • I.e., a single element must enclose everything else
  • Specific XML dialects may restrict which elements can appear inside which others
    • XHTML is very liberal
    • MathML (Mathematical Markup Language) is stricter

Text

  • Text is normal printable text
  • Must use escape sequences to represent "<" and ">"
    • In XML, written &name;
    • Sequence Character Description
      &lt; < Less than
      &gt; > Greater than
      &quot; " Double quote
      &apos; ' Apostrophe
      &amp; & Ampersand
      &Aring; Å Angstrom
      &nbsp;   Non-breaking space
      &lambda; λ Greek small letter lambda
      &Lambda; Λ Greek capital letter lambda
      Table 1: XML Character Escape Samples
    • See List of XML and HTML character entity references for the complete list

XHTML

  • Most common use of XML is still XHTML (the XML version of hypertext)
  • Basic tags:
    • Tag Usage
      <html> Root element of entire HTML document.
      <body> Body of page (i.e., visible content).
      <h1> Top-level heading. Use <h2>, <h3>, etc. for second- and third-level headings.
      <p> Paragraph.
      <em> Emphasized text; browser or editor will usually display it in italics.
      <address> Address of document author (also usually displayed in italics).
      Table 2: Basic XHTML Tags

Sample XHTML Page

<html>
<body>
<h1>Software Carpentry</h1>

<p>This course will introduce <em>essential software 
development skills</em>,
and show where and how they should be applied.</p>

<address>Greg Wilson (gvwilson@third-bit.com)</address>
</body>
</html>
		
[Simple Page Rendered by Firefox]

Figure 1: Simple Page Rendered by Firefox

Critique of HTML/XHTML

  • HTML and XHTML mix semantics and display
    • <h1/> (level-1 heading) is semantic (meaning)
    • <i/> (italics) is display (formatting)
  • Now generally considered a bad thing
    • Modern HTML/XHTML documents contain semantic tags only
    • Control display using Cascading Style Sheets (CSS)
      • We will only cover a little of the syntax and the CSS Box Model

Attributes

  • Elements can be customized by giving them attributes
    • Enclosed in the opening tag
    • <h1 align="center">A Centered Heading</h1>
    • <p id="disclaimer" align="center">This planet provided as-is.</p>
  • An attribute name may appear at most once in any element
    • Like keys in a dictionary
    • So <p align="left" align="right">…</p> is illegal
  • Values must be quoted
    • Old-style browsers accepted <p align=center>…<p>, but modern parsers will reject it
    • Must use escape sequences for angle brackets, quotes, etc. inside values

Attributes Vs. Elements

  • Use attributes when:
    • Each value can occur at most once for any element
    • The order of the values doesn't matter
    • Those values have no internal structure
  • In all other cases, use nested elements
    • If you have to parse an attribute's value to figure out what it means, use an element instead

More XHTML Tags

  • Well-written HTML pages have a <head/> element as well as a <body/>
    • Contains metadata about the page
  • Well-written pages also use comments (just like code)
    • Introduce with <!--, and end with -->
    • <html>
      <head>
        <title>Comments Page</title>
        <meta name="author" content="aturing"/>
      </head>
      <body>
      
      <!-- House style puts all titles in italics -->
      <h1><em>Welcome to the Comments Page</em></h1>
      
      <!-- Update this paragraph to describe the forum. -->
      <p>Welcome to the Comments Forum.</p>
      
      </body>
      </html>
      
    • Unfortunately, comments cannot be nested

Lists and Tables

  • Use <ul/> for an unordered (bulleted) list, and <ol/> for an ordered (numbered) one
    • Each list item is wrapped in <li/>
  • Use <table/> for tables
    • Each row is wrapped in <tr/> (for “table row”)
    • Within each row, column items are wrapped in <td/> (for “table data”)
    • Note: tables are often used to force multi-column layout, as well as for tabular data

Example

<html>
<head>
  <title>Lists and Tables</title>
  <meta name="svn" content="$Id: xml.html,v 1.15 2010/04/23 20:41:32 scooter Exp $"/>
</head>
<body>

<table cellpadding="3" border="1">
  <tr>
    <td align="center"><em>Unordered List</em></td>
    <td align="center"><em>Ordered List</em></td>
  </tr>
  <tr>
    <td align="left" valign="top">
      <ul>
        <li>Hydrogen</li>
        <li>Lithium</li>
        <li>Sodium</li>
        <li>Potassium</li>
        <li>Rubidium</li>
        <li>Cesium</li>
        <li>Francium</li>
      </ul>
    </td>
    <td align="left" valign="top">
      <ol>
        <li>Helium</li>
        <li>Neon</li>
        <li>Argon</li>
        <li>Krypton</li>
        <li>Xenon</li>
        <li>Radon</li>
      </ol>
    </td>
  </tr>
</table>

</body>
</html>

Example

[Lists and Tables]

Figure 2: Lists and Tables

    • Note how RCS keywords have been put in <meta/> elements in document head
      • Automatically updated each time the document is committed to version control

Images

  • How to put an image in a page?
    • XML documents can only contain text, so you can't store an image or audio clip directly in a page
  • Usual solution is to store a reference to the external file using the <img/> tag
    • The src argument specifies where to find the image file

Images

<html>
<head>
  <title>Images</title>
  <meta name="svn" content="$Id: xml.html,v 1.15 2010/04/23 20:41:32 scooter Exp $"/>
</head>
<body>

<h1>Our Logo</h1>

<img src="../../.swc/lec/img/sc_powered.jpg" alt="[Powered by Software Carpentry]"/>

</body>
</html>
[Images in Pages]

Figure 3: Images in Pages

  • Always use the alt attribute to specify alternative text
    • Screen readers for people with visual handicaps use this instead of the image
    • And it's good documentation for search engines

Links

  • Links to other pages is what makes it “hypertext”
  • Use the <a/> element to create a link
    • The text inside the element is displayed and (usually) underlined for clicking
    • The href attribute specifies what the link is pointing at
    • Both local filenames and URLs are supported

Links

<html>
<head>
  <title>Links</title>
  <meta name="svn" content="$Id: xml.html,v 1.15 2010/04/23 20:41:32 scooter Exp $"/>
</head>
<body>

<h1>A Few of My Favorite Places</h1>

<ul>
  <li><a href="http://www.google.com">Google</a></li>
  <li><a href="http://www.python.org">Python</a></li>
  <li><a href="http://www.nature.com/index.html">Nature Online</a></li>
  <li>Examples in this lecture:
    <ul>
      <li><a href="comments.html">Comments</a></li>
      <li><a href="image.html">Images</a></li>
      <li><a href="list_table.html">Lists and Tables</a></li>
    </ul>
  </li>
</ul>

</body>
</html>
	

Links

[Links in Pages]

Figure 4: Links in Pages

HTML5 - Differences from HTML 4.01

  • New Elements:
    • article, aside, audio, canvas, command, datalist, details, embed, figcaption, figure, footer, header, hgroup, keygen, mark, meter, nav, output, progress, rp, rt, ruby, section, source, summary, time, video
  • Inline SVG and MathML
  • New form controls:
    • dates and times, email, url, search
  • New form methods:
    • PUT and DELETE
  • Parsing rules similar to HTML (loose vs. strict)
  • New APIs

HTML5 - New APIs

  • Canvas
  • Timed media playback (SMIL)
  • Offline storage
  • Document editing
  • Drag-and-drop
  • Cross-document messaging
  • Browser history management

HTML5 - Summary

Questions on XML or HTML?

Cascading Style Sheets (CSS)

  • Style sheets provide a way to change the look(style) of a document without changing it's structure
  • CSS can be used to:
    • change font style, color, size, and spacing; adjust margins or padding; do positioning of content either relative to other content or absolute; and provide a variety of different decorations for XML elements
    • turn elements on or off, or dynamically change the look of an element

Using Cascading Style Sheets (CSS)

  • CSS instructions can be specified in the style attribute
    • For example, a centered paragraph might be written <p style="text-align: center">
    • CSS attributes are separated by semi-colons: <p style="text-align:center; font-weight:bold;">
  • CSS instructions can also be specified as part of a style sheet
    • Style sheets can be in the document itself
      • Within <style/> tags
      • For example this document has in its <head/> section:
        <style type="text/css" media="all"> followed by a number of CSS instructions
    • Style sheets can be loaded from external files
      • This document also has in its <head/> section:
        <link rel="stylesheet" href="css/slides.css" type="text/css" media="projection" id="slideProj" />
      • The file "slides.css" contains a number of CSS instructions relevant for the slide layout
  • CSS instructions

    • The general syntax for a CSS instruction is:
    • selector {property1:value1; property2:value2;...}
    • The selector tells the style system which elements the instruction refers to
    • See http://www.w3schools.com/Css/ for a list of properties
    • The most common use of the selector is: element.class, where class is the value of the class attribute, and element is either an HTML element or an element you've "invented".
    • Selectors are actually much, much more complicated:
      • A selector can be a pseudo-class. For example a:hover can be used to change style when over a link
      • A selector can be a pseudo-element. For example p:first-letter can be used to change the style for the first letter of a paragraph
      • A selector can refer to an ID. For example p#paragaph1 would refer to the paragraph whose ID attribute is "paragraph1"
      • A selector can include parent-shild relationships. For example "ul.inc li.active" would refer to <LI/> elements with a class attribute of "active" and that are descendants of <UL/> elements with a class attribute of "inc".
      • A selector can include pattern matching, attribute matching, and much, much more...

    CSS example

    Example style:

    <style type="text/css">
    body {font-family:arial;}
    p.example {font-family:courier; margin-left:5em; margin-right:5em; background-color:LightBlue;}
    .center {text-align:center;}
    myTitle {font-weight:bold; display:block; color:green; text-align:center; font-size:150%}
    

    Example input:

    Figure 5: Simple CSS Example Rendered by Firefox

    <body>
    <myTitle>This is our header</myTitle>
    
    <p>We will now introduce an example.  This 
    is a standard paragraph, with all of the default 
    styles set up by the browser.  Can you think of 
    a way you might be able to override at least one 
    of those defaults?  Back to our example, we now 
    want to highlight a section of text, which might 
    be a quote or some other kind of example</p>
    
    <p class="example">This is our example.  Note that 
    the margins have been adjusted and we also now have a 
    background color.  We could also have drawn a box 
    around our example, or we could have made other 
    adjustments.</p>
    
    <p>Now we're back to normal text.</p>
    </body> 

    CSS example

    • Notes:
      • Elements don't have to be HTML. Can introduce your own, if it helps clarify the semantics of the document
      • If you had a large document with 20 examples, all you would need to do to change them all is change the style sheet
      • Concept is identical to Styles in Word
      • Using Javascript, can switch between loaded stylesheets
        • That's how the "0" works in our slide program

    CSS Layout Model

    • Won't go over all of the CSS syntax and tips and tricks
    • Two key things to get a handle on:
      • Inline vs. block layout
      • CSS Box model

    CSS Inline vs. Block

    • display: inline vs. display: block
      • Inline layouts are things like <i/>, <span/>, and <b/> that can be laid out within a line (no line break)
        • In-line layouts can be specified with the css property display: inline
        <html>
          <body>
            This is a sentence with a 
               <myStyle style="display:inline; border: thin red solid">"myStyle" element</myStyle> 
            embedded in it.
          </body>
        </html>
        This is a sentence with a "myStyle" element embedded in it.
      • Block layouts are things like <p/>, <div/> and <li/> that cause the line of text to break
        • Block layouts can be specified with the css property display: block
        <html>
          <body&rt;
            This is a sentence with a 
               <myStyle style="display:block; border: thin red solid">"myStyle" element</myStyle> 
            embedded in it.
          </body&rt;
        </html&rt;
        This is a sentence with a "myStyle" element embedded in it.
      • Important to know if you are creating a custom element

    CSS Box Model

    • CSS Box model: margins, borders, and padding
      • CSS uses three values for each side of the box when laying out an element:
        • margin-top, -bottom, -left, and -right: the transparent area around the element
        • border-top, -bottom, -left, and -right: the area for the border that will be painted around the element
        • padding-top, -bottom, -left, and -right: the area between the actual content and the border
      • Gives you detailed control of the spacing of elements relative to each other
      • Box width and height are specified by width and height, respectively
      • Units can be in % of surrounding element, ems, or px (pixels)

    CSS Summary

    • Best way to learn CSS:
      • Find something you like on the web
      • Figure out how they did it (use View→Page Source)
      • Set up a small example and try it!
      • Use Firefox, check out the many helpful extensions:
        • Tools→Error Console in Firefox 3.0
        • Firebug for Firefox 2.x

    Questions on CSS?

    The Document Object Model

    • The Document Object Model (DOM) is a cross-language standard for representing XML documents as trees
      • One node for each element, attribute, or text
    • Pro:
      • Much easier to manipulate trees than strings
      • Same basic model in many different languages (which lowers the learning cost)
    • Con:
      • Needs a lot of memory for large documents
      • Generic standard doesn't take advantage of the more advanced features of some languages
    • Python's standard library includes a simple implementation of DOM called minidom
      • Fast, sturdy, and well documented…
      • …if you understand all the terminology, and know more or less what you're looking for

    The Basics

    • Every DOM tree has a single root representing the document as a whole
      • Doesn't correspond to anything that's actually in the document
    • This element has a single child, which is the root node of the document
    • It, and other element nodes, may have three types of children:
      • Other elements
      • Text nodes
      • Attribute nodes

    DOM Tree Example

    [A DOM Tree]

    Figure 6: A DOM Tree

    <root>
      <first>element</first>
      <second attr="value">element</second>
      <third-element/>
    </root>

    More On Tree Structure

    • Every node keeps track of what its parent is
      • Allows programs to search up the tree, as well as down
    • Note: it's easy to forget that text and attributes are stored in nodes of their own
      • Other Python libraries like ElementTree use dictionaries instead
      • Pro: makes simple things a little simpler
      • Con: not (yet) part of the standard library

    Creating a Tree

    • Usual way to create a DOM tree is to parse a file
    • <?xml version="1.0" encoding="utf-8"?>
      <planet name="Mercury">
        <period units="days">87.97</period>
      </planet>
      
      import xml.dom.minidom
      doc = xml.dom.minidom.parse('mercury.xml')
      print doc.toxml('utf-8')
      
      <?xml version="1.0" encoding="utf-8"?>
      <planet name="Mercury">
        <period units="days">87.97</period>
      </planet>
      

    Converting To Text

    • The toxml method can be called on the document, or on any element node, to create text
    • DOM trees always store text as Unicode , so when you're converting the tree to text, you must tell the library how to represent characters
    • This means that strings taken from XML documents are Unicode, not ASCII
      • import xml.dom.minidom
        
        my_xml = '''<name>Donald Knuth</name>'''
        my_doc = xml.dom.minidom.parseString(my_xml)
        name = my_doc.documentElement.firstChild.data
        print 'name is:', name
        print 'but name in full is:', repr(name)
        
        name is: Donald Knuth
        but name in full is: u'Donald Knuth'
        
      • Note the u in front of the string the second time it is printed
        • A simple print statement converts the Unicode string to ASCII for display

    Other Ways To Create Documents

    • Can also create a tree by parsing a string
      import xml.dom.minidom
      
      src = '''<planet name="Venus">
        <period units="days">224.7</period>
      </planet>'''
      
      doc = xml.dom.minidom.parseString(src)
      print doc.toxml('utf-8')
      
      <?xml version="1.0" encoding="utf-8"?>
      <planet name="Venus">
        <period units="days">224.7</period>
      </planet>
      

    Other Ways To Create Documents

    • Or by building a tree by hand
      import xml.dom.minidom
      
      impl = xml.dom.minidom.getDOMImplementation()
      
      doc = impl.createDocument(None, 'planet', None)
      root = doc.documentElement
      root.setAttribute('name', 'Mars')
      
      period = doc.createElement('period')
      root.appendChild(period)
      
      text = doc.createTextNode('686.98')
      period.appendChild(text)
      
      print doc.toxml('utf-8')
      
      <?xml version="1.0" encoding="utf-8"?>
      <planet name="Mars"><period>686.98</period></planet>
      
    • Notice that the output of the preceding example wasn't nicely indented
      • Because we didn't create text nodes containing carriage returns and blanks
      • Most machine-generated XML doesn't

    The Details

    • xml.dom.minidom is really just a wrapper around other platform-specific XML libraries
      • Have to reach inside it and get the underlying implementation object to create the document node
      • That node then knows how to create other elements in the document
      • Middle argument to createDocument specifies the type of the document's root node
      • Documentation explains what the first and third arguments to createDocument are
    • Add new nodes to existing ones by:
      • Asking the document to create the node
      • Appending it to a node that's already part of the tree
    • Set attributes of element nodes using setAttribute(attributeName, newValue)
      • Remember, all attribute values are strings
      • If you want to store an integer or a Boolean, you have to convert it yourself

    Finding Nodes

    • Often want to do things to all elements of a particular type
      • E.g., find all <experimenter/> nodes, extract names, and print a sorted list
    • Use the getElementsByTagName method to do this
      • Returns a list of all the descendents of a node with the specified tag
    • import xml.dom.minidom
      
      src = '''<heavenly_bodies>
        <planet name="Mercury"/>
        <planet name="Venus"/>
        <planet name="Earth"/>
        <moon name="Moon"/>
        <planet name="Mars"/>
        <moon name="Phobos"/>
        <moon name="Deimos"/>
      </heavenly_bodies>'''
      
      doc = xml.dom.minidom.parseString(src)
      for node in doc.getElementsByTagName('moon'):
          print node.getAttribute('name')
      
      Moon
      Phobos
      Deimos
      
    • Question: what happens if you add or delete nodes while looping over this list?

    Walking a Tree

    • Often want to visit each node in the tree
      • E.g., print an outline of the document showing element nesting
    • Node's type is stored in a member variable called nodeType
      • ELEMENT_NODE, TEXT_NODE, ATTRIBUTE_NODE, DOCUMENT_NODE
    • If a node is an element, its children are stored in a read-only list called childNodes
    • If a node is a text node, the actual text is in the member data

    Recursive Tree Walker

    import xml.dom.minidom
    
    src = '''<solarsystem>
    <planet name="Mercury"><period units="days">87.97</period></planet>
    <planet name="Venus"><period units="days">224.7</period></planet>
    <planet name="Earth"><period units="days">365.26</period></planet>
    </solarsystem>
    '''
    
    def walkTree(currentNode, indent=0):
        spaces = ' ' * indent
        if currentNode.nodeType == currentNode.TEXT_NODE:
            print spaces + 'TEXT' + ' (%d)' % len(currentNode.data)
        else:
            print spaces + currentNode.tagName
            for child in currentNode.childNodes:
                walkTree(child, indent+1)
    
    doc = xml.dom.minidom.parseString(src)
    walkTree(doc.documentElement)
    	
    solarsystem
     TEXT (1)
     planet
      period
       TEXT (5)
     TEXT (1)
     planet
      period
       TEXT (5)
     TEXT (1)
     planet
      period
       TEXT (6)
     TEXT (1)
    	

    Modifying the Tree

    [Modifying the DOM Tree]

    Figure 7: Modifying the DOM Tree

    • Modifying trees in place is a little bit tricky
      • Helps to draw lots of pictures
    • Example: want to emphasize the first word of each paragraph
      • Get the text node below the paragraph
      • Take off the first word
      • Insert a new <em/> element whose only child is a text node containing that word

    Complications

    • But what if the first child of the paragraph already has some markup around it?
      • E.g., what if the paragraph starts with a link?
    • Could just wrap the first child with <em/>
      • But if (for example) the link contains several words, this will look wrong
    • We'll ignore this problem for now

    Solution

    • Step 1: find all the paragraphs using getElementsByTagName, and iterate over them
        def emphasize(doc):
            paragraphs = doc.getElementsByTagName('p')
            for para in paragraphs:
                first = para.firstChild
                if first.nodeType == first.TEXT_NODE:
                    emphasizeText(doc, para, first)

    Solution

    • Step 2: break the paragraph text into pieces, and handle each piece in turn
      • Create a new node for each piece
      • Push it onto the front of the paragraph's child list
      • Once they've all been handled, get rid of the original text node
      • def emphasizeText(doc, para, textNode):
        
            # Look for optional spaces, a word, and the rest of the paragraph.
            m = re.match(r'^(\s*)(\S*)\b(.*)$', str(textNode.data))
            if not m:
                return
            leadingSpace, firstWord, restOfText = m.groups()
            if not firstWord:
                return
        
            # If there's text after the first word, re-save it.
            if restOfText:
                restOfText = doc.createTextNode(restOfText)
                para.insertBefore(restOfText, para.firstChild)
        
            # Emphasize the first word.
            emph = doc.createElement('em')
            emph.appendChild(doc.createTextNode(firstWord))
            para.insertBefore(emph, para.firstChild)
        
            # If there's leading space, re-save it.
            if leadingSpace:
                leadingSpace = doc.createTextNode(leadingSpace)
                para.insertBefore(leadingSpace, para.firstChild)
        
            # Get rid of the original text.
            para.removeChild(textNode)

    Not Finished Yet

    • Part 3: test it
      • Yes, it really is part of the program
      • if __name__ == '__main__':
        
            src = '''<html><body>
        <p>First paragraph.</p>
        <p>Second paragraph contains <em>emphasis</em>.</p>
        <p>Third paragraph.</p>
        </body></html>'''
        
            doc = xml.dom.minidom.parseString(src)
            emphasize(doc)
            print doc.toxml('utf-8')
        <?xml version="1.0" encoding="utf-8"?>
        <html><body>
        <p><em>First</em> paragraph.</p>
        <p><em>Second</em> paragraph contains <em>emphasis</em>.</p>
        <p><em>Third</em> paragraph.</p>
        </body></html>
        

    Summary

    • There's a lot of hype in hypertext
      • Haven't yet heard anyone claim that XML will cure the common cold, but I'm sure it's been said
    • Pros:
      • One set of rules for people to learn
      • Never have to write a parser again
        • At least, the low-level syntactic bits—still need to figure out what all those tags mean
    • Cons:
      • Raw XML is hard to read
        • Particularly if it has been generated by a machine
      • A lot of data isn't actually trees
        • When storing a 2D matrix or a table, you have to organize data by row or by column…
        • …either of which makes the other hard to access
      • There are a lot of complications and subtleties
        • Most applications ignore most of them
        • Which means that they fail (usually badly) when confronted with something outside the subset they understand
    • Like Inglish speling, it's here to stay

    Questions?