| prev | Version 1011 (Mon Nov 27 20:46:06 2006) | next |
"<>"<tagname>…</tagname><tagname/><X>…<Y>…</Y></X> is legal…<X>…<Y>…</X></Y> is not"<" and ">"&name;| Sequence | Character |
|---|---|
< | < |
> | > |
" | " |
& | & |
| Table 19.1: XML Character Escapes | |
| Tag | Usage |
|---|---|
<html> | Root element of entire HTML document. |
<body> | Body of page (i.e., visible content). |
<h1> | Top-level heading. Use <h2>, <h3>, etc. for second- and third-level headings. |
<p> | Paragraph. |
<em> | Emphasized text; browser or editor will usually display it in italics. |
<address> | Address of document author (also usually displayed in italics). |
| Table 19.2: Basic XHTML Tags | |
<html> <body> <h1>Software Carpentry</h1> <p>This course will introduce <em>essential software development skills</em>, and show where and how they should be applied.</p> <address>Greg Wilson (gvwilson@third-bit.com)</address> </body> </html>
![[Simple Page Rendered by Firefox]](./img/xml/simple_page_firefox.png)
Figure 19.1: Simple Page Rendered by Firefox
<h1/> (level-1 heading) is semantic (meaning)<i/> (italics) is display (formatting)<h1 align="center">A Centered Heading</h1><p id="disclaimer" align="center">This planet provided as-is.</p><p align="left" align="right">…</p> is illegal<p align=center>…<p>, but modern parsers will reject it<head/> element as well as a <body/><!--, and end with --><html> <head> <title>Comments Page</title> <meta name="author" content="aturing"/> </head> <body> <!-- House style puts all titles in italics --> <h1><em>Welcome to the Comments Page</em></h1> <!-- Update this paragraph to describe the forum. --> <p>Welcome to the Comments Forum.</p> </body> </html>
<ul/> for an unordered (bulleted) list, and <ol/> for an ordered (numbered) one<li/><table/> for tables<tr/> (for “table row”)<td/> (for “table data”)<html>
<head>
<title>Lists and Tables</title>
<meta name="svn" content="$Id: xml.html,v 1.1 2008/03/26 18:58:45 scooter Exp $"/>
</head>
<body>
<table cellpadding="3" border="1">
<tr>
<td align="center"><em>Unordered List</em></td>
<td align="center"><em>Ordered List</em></td>
</tr>
<tr>
<td align="left" valign="top">
<ul>
<li>Hydrogen</li>
<li>Lithium</li>
<li>Sodium</li>
<li>Potassium</li>
<li>Rubidium</li>
<li>Cesium</li>
<li>Francium</li>
</ul>
</td>
<td align="left" valign="top">
<ol>
<li>Helium</li>
<li>Neon</li>
<li>Argon</li>
<li>Krypton</li>
<li>Xenon</li>
<li>Radon</li>
</ol>
</td>
</tr>
</table>
</body>
</html>
![[Lists and Tables]](./img/xml/list_table_firefox.png)
Figure 19.2: Lists and Tables
<meta/> elements in document head<img/> tagsrc argument specifies where to find the image file<html> <head> <title>Images</title> <meta name="svn" content="$Id: xml.html,v 1.1 2008/03/26 18:58:45 scooter Exp $"/> </head> <body> <h1>Our Logo</h1> <img src="../../../img/sc_powered.jpg" alt="[Powered by Software Carpentry]"/> </body> </html>
![[Images in Pages]](./img/xml/image.png)
Figure 19.3: Images in Pages
alt attribute to specify alternative text<a/> element to create a linkhref attribute specifies what the link is pointing at<html>
<head>
<title>Links</title>
<meta name="svn" content="$Id: xml.html,v 1.1 2008/03/26 18:58:45 scooter Exp $"/>
</head>
<body>
<h1>A Few of My Favorite Places</h1>
<ul>
<li><a href="http://www.google.com">Google</a></li>
<li><a href="http://www.python.org">Python</a></li>
<li><a href="http://www.nature.com/index.html">Nature Online</a></li>
<li>Examples in this lecture:
<ul>
<li><a href="comments.html">Comments</a></li>
<li><a href="image.html">Images</a></li>
<li><a href="list_table.html">Lists and Tables</a></li>
</ul>
</li>
</ul>
</body>
</html>
![[Links in Pages]](./img/xml/links.png)
Figure 19.4: Links in Pages
minidom<root> <first>element</first> <second attr="value">element</second> <third-element/> </root>
![[A DOM Tree]](./img/xml/dom_tree.png)
Figure 19.5: A DOM Tree
ElementTree use dictionaries instead<?xml version="1.0" encoding="utf-8"?> <planet name="Mercury"> <period units="days">87.97</period> </planet>
import xml.dom.minidom
doc = xml.dom.minidom.parse('mercury.xml')
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <planet name="Mercury"> <period units="days">87.97</period> </planet>
toxml method can be called on the document, or on any element node, to create textThe Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for the detailsimport xml.dom.minidom my_xml = '''<name>Donald Knuth</name>''' my_doc = xml.dom.minidom.parseString(my_xml) name = my_doc.documentElement.firstChild.data print 'name is:', name print 'but name in full is:', repr(name)
name is: Donald Knuth but name in full is: u'Donald Knuth'
u in front of the string the second time it is printedprint statement converts the Unicode string to ASCII for displayimport xml.dom.minidom
src = '''<planet name="Venus">
<period units="days">224.7</period>
</planet>'''
doc = xml.dom.minidom.parseString(src)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <planet name="Venus"> <period units="days">224.7</period> </planet>
import xml.dom.minidom
impl = xml.dom.minidom.getDOMImplementation()
doc = impl.createDocument(None, 'planet', None)
root = doc.documentElement
root.setAttribute('name', 'Mars')
period = doc.createElement('period')
root.appendChild(period)
text = doc.createTextNode('686.98')
period.appendChild(text)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <planet name="Mars"><period>686.98</period></planet>
xml.dom.minidom is really just a wrapper around other platform-specific XML librariesdocument nodecreateDocument specifies the type of the document's root nodecreateDocument aresetAttribute(attributeName, newValue)<experimenter/> nodes, extract names, and print a sorted listgetElementsByTagName method to do thisimport xml.dom.minidom
src = '''<heavenly_bodies>
<planet name="Mercury"/>
<planet name="Venus"/>
<planet name="Earth"/>
<moon name="Moon"/>
<planet name="Mars"/>
<moon name="Phobos"/>
<moon name="Deimos"/>
</heavenly_bodies>'''
doc = xml.dom.minidom.parseString(src)
for node in doc.getElementsByTagName('moon'):
print node.getAttribute('name')
Moon Phobos Deimos
nodeTypeELEMENT_NODE, TEXT_NODE, ATTRIBUTE_NODE, DOCUMENT_NODEchildNodesdataimport xml.dom.minidom
src = '''<solarsystem>
<planet name="Mercury"><period units="days">87.97</period></planet>
<planet name="Venus"><period units="days">224.7</period></planet>
<planet name="Earth"><period units="days">365.26</period></planet>
</solarsystem>
'''
def walkTree(currentNode, indent=0):
spaces = ' ' * indent
if currentNode.nodeType == currentNode.TEXT_NODE:
print spaces + 'TEXT' + ' (%d)' % len(currentNode.data)
else:
print spaces + currentNode.tagName
for child in currentNode.childNodes:
walkTree(child, indent+1)
doc = xml.dom.minidom.parseString(src)
walkTree(doc.documentElement)
solarsystem TEXT (1) planet period TEXT (5) TEXT (1) planet period TEXT (5) TEXT (1) planet period TEXT (6) TEXT (1)
<em/> element whose only child is a text node containing that word![[Modifying the DOM Tree]](./img/xml/modify_tree.png)
Figure 19.6: Modifying the DOM Tree
<em/>getElementsByTagName, and iterate over themdef emphasize(doc):
paragraphs = doc.getElementsByTagName('p')
for para in paragraphs:
first = para.firstChild
if first.nodeType == first.TEXT_NODE:
emphasizeText(doc, para, first)def emphasizeText(doc, para, textNode):
# Look for optional spaces, a word, and the rest of the paragraph.
m = re.match(r'^(\s*)(\S*)\b(.*)$', str(textNode.data))
if not m:
return
leadingSpace, firstWord, restOfText = m.groups()
if not firstWord:
return
# If there's text after the first word, re-save it.
if restOfText:
restOfText = doc.createTextNode(restOfText)
para.insertBefore(restOfText, para.firstChild)
# Emphasize the first word.
emph = doc.createElement('em')
emph.appendChild(doc.createTextNode(firstWord))
para.insertBefore(emph, para.firstChild)
# If there's leading space, re-save it.
if leadingSpace:
leadingSpace = doc.createTextNode(leadingSpace)
para.insertBefore(leadingSpace, para.firstChild)
# Get rid of the original text.
para.removeChild(textNode)if __name__ == '__main__':
src = '''<html><body>
<p>First paragraph.</p>
<p>Second paragraph contains <em>emphasis</em>.</p>
<p>Third paragraph.</p>
</body></html>'''
doc = xml.dom.minidom.parseString(src)
emphasize(doc)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?> <html><body> <p><em>First</em> paragraph.</p> <p><em>Second</em> paragraph contains <em>emphasis</em>.</p> <p><em>Third</em> paragraph.</p> </body></html>
| prev | Copyright © 2005-06 Python Software Foundation. | next |