TX
  1. Summary
  2. Git repository
  3. Installation
  4. Overview
    1. Generating tree with tags
    2. Generating tree from nodes
    3. Parsing HTML
    4. Debug tree
    5. Querying with XPath
    6. Building RXP-like structure
    7. Translating XQuery and XSLT examples to Python
      1. XQuery example 1
      2. XQuery example 2
      3. XSLT example 1
  5. Nodes
    1. Document
    2. Element
    3. Attribute
    4. Comment
    5. Text
  6. Tags
  7. XPath
    1. Compilation
    2. XPath class wrapper
    3. Shortcut syntax
    4. XPath prompt
  8. Pattern
  9. Parsing HTML

Summary

TX is short for Tuxee XML. It's a set of Python modules to generate, transform, parse, search XML (and HTML) document.

ModuleDescription
tx.nodesDefine classes for each type of node to build XML tree
tx.tagsSimplify tree creation
tx.htmltreeBuild XML tree using htmlparser module
tx.xpathTranslate XPath expressions to Python functions

And some modules used internally:

ModuleDescription
tx.errorExceptions used in tx
tx.parserGeneric parser inspired by PyParsing
tx.iteratorsIterators to walk a XML tree in various ways
tx.htmlparserError tolerant HTML parser
tx.xpathparserTranslate XPath expressions to "s-expression"
tx.xpathfnProvide XPath/XQuery functions and operators
tx.contextXPath context object
tx.sequenceXPath sequence object

Misc. modules:

ModuleDescription
tx.miscContains some utility functions
tx.rxpcompatTranslate RXP-like tree structure to tx tree
tx.xpath_misc...
tx.sequence_misc...
tx.nodes_misc...

Git repository

Repositorygit://git.tuxee.net/tx
GitWebWeb interface

Installation

As root, run ./setup install.

Overview

Note: The special variable _ is the result of the previous computation.

Generating tree with tags

Importing the tags object as w:

>>> from tuxeenet.tx.tags import tags as w

then generating a tree and serializing it:

>>> w.html( w.head( w.title( 'Hello, World!' ) ) , w.body( 'bla bla bla' ) )
>>> _.serialize()
'<html><head><title>Hello, World!</title></head><body>bla bla bla</body></html>'

The _doc_ and _comment_ name have special meanings. The former create a Document node, while the latter create a Comment node.

>>> w._doc_( w.foo( w._comment_( ' this is a comment ' ) ) , w.bar( 'quux&baz' ) ).serialize()
'<foo><!-- this is a comment --></foo><bar>quux&amp;baz</bar>'

>>> w.foo( w._comment_( ' a comment ' ) , 'bar' , id = 'contents' , width = '92' ).serialize()
'<foo id="contents" width="92"><!-- a comment -->bar</foo>'

For attribute names, double _ are translated to : and single _ are translated to -. A _ starting a name is dropped (useful when name match a Python keyword.)

>>> w._return( 'Et voila !' , xml__lang = 'fr' , _class = 'rt2' ).serialize()
'<return xml:lang="fr" class="rt2">Et voila !</return>'

Generating tree from nodes

The tags object from the tags module is just a convenient way for building tree, but in reality it just construct Element, Attribute, Text,.. nodes implicitly.

Here is how to generate tree directly from these objects:

>>> from tuxeenet.tx.nodes import *
>>> a = Attribute( 'id' , 'contents' )
>>> b = Attribute( 'width' , '92' )
>>> c = Comment( ' a comment ' )
>>> d = Text( 'bar' )
>>> e = Element( 'foo' , ( a , b ) , ( c , d ) ) # name, attributes, children
>>> e.serialize()
'<foo id="contents" width="92"><!-- a comment -->bar</foo>'

Parsing HTML

>>> from urllib import urlopen
>>> from tuxeenet.tx.htmltree import parse

Fetching and parsing the http://slashdot.org page:

>>> doc = parse( urlopen( 'http://slashdot.org/' ).read() )
>>> doc
<Document with 2 children>

Examples below will use this doc variable. Note that you will not necessary get the exact same output, since the page (the homepage of Slashdot) can change of course.

Debug tree

From the doc, we can output a verbose tree to show document structure with nodes type:

>>> print doc.asDebug( maxChildren = 2 )
DOCUMENT[0] with 2484 nodes
  TEXT[1] '\n'
  ELEMENT[2] html
    ELEMENT[3] head
      ELEMENT[4] title
        TEXT[5] 'Slashdot: News for nerds, stuff that matters'
      ELEMENT[6] link
        ATTRIBUTE[7] rel = `top`
        ATTRIBUTE[8] title = `News for nerds, stuff that matters`
        ATTRIBUTE[9] href = `//slashdot.org/`
      [.. and 13 more children ..]
    TEXT[36] '\n'
    [.. and 2 more children ..]

Querying with XPath

A large part of XPath 2.0 is available.

For example, to extract the string value of the attribute title of an element link which have also an attribute rel of value top, and where this element link is a child of element head itself child of root element html:

>>> doc[ '/html/head/link[@rel="top"]/@title/string()' ]
>>> tuple(_) # Convert resulting sequence to a tuple
(u'News for nerds, stuff that matters',)

(Note that an Unicode string is returned.)

Building RXP-like structure

The pyRXP module is a wrapper for the RXP XML parser. tx provide a way to convert an existing tree to the type of structure used by pyRXP.

This is really only useful for compatibility purpose with RXP module.

>>> sequence = doc[ '//font[@face="verdana"]' ]
>>> sequence[ 0 ].asRxp()
('font', {'color': '#001670', 'face': 'verdana'}, [u'\xa0', ('b', None, ['OSTG'], None)], None)

and translating back to a tx tree:

>>> doc = ('font', {'color': '#001670', 'face': 'verdana'}, [u'\xa0', ('b', None, ['OSTG'], None)], None)
>>> from tuxeenet.tx.rxpcompat import fromRxp
>>> fromRxp( doc )
<Element font with 2 attributes and 2 children>
>>> fromRxp( doc ).serialize()
'<font color="#001670" face="verdana">\xc2\xa0<b>OSTG</b></font>'

Important note: You may have noticed that the \xa0 is printed as \xc2\xa0'. It's because .serialize() produce string with utf-8` encoding by default.

Translating XQuery and XSLT examples to Python

First 2 examples show how to translate XQuery to Python, while the third example show how to translate XSLT to Python.

These examples are important to show that specialized languages are not necessary for processing XML document with same power as XQuery, XSLT,..

XQuery example 1

Python could be used to fully replace XQuery complex operations.

Taking the following example from the XQuery spec:

let $i := <tool>wrench</tool>
let $o := <order> {$i} <quantity>5</quantity> </order>
let $odoc := document ($o)
let $newi := $o/tool

Which is followed by these expected results:

Some notes:

First part translated in Python with tx modules (using tags module):

i = w.tool( 'wrench' )
o = w.order( i.clone() , w.quantity( '5' ) )
odoc = o.clone()
newi = o/'tool'  # Notice the use of the '/' operator to use XPath

Declare trees as standalone:

i.finalize()
o.finalize()
odoc.finalize()

(Note: We make a root function to simplify fnRoot usage.)

>>> from tuxeenet.tx.sequence import Sequence
>>> from tuxeenet.tx.context import Context
>>> from tuxeenet.tx.xpathfn import fnRoot
>>> root = lambda node : fnRoot( Context() , Sequence( node ) )

Then we can check expected results:

>>> assert root( i ) == Sequence( i )
>>> assert root( o/'quantity' ) == Sequence( o )
>>> assert root( odoc/'.//quantity' ) == Sequence( odoc ) # The '.' is important
>>> assert root( newi ) == Sequence( o )

XQuery example 2

An example from http://www.perfectxml.com/XQuery.html, with the books.xml document used below:

<bib>
    <book year="1994">
        <title>TCP/IP Illustrated</title>
        <author>
            <last>Stevens</last>
            <first>W.</first>
        </author>
        <publisher>Addison-Wesley</publisher>
        <price>65.95</price>
    </book>

    <book year="1992">
        <title>Advanced Programming in the UNIX Environment</title>
        <author>
            <last>Stevens</last>
            <first>W.</first>
        </author>
        <publisher>Addison-Wesley</publisher>
        <price>65.95</price>
    </book>

    <book year="2000">
        <title>Data on the Web</title>
        <author>
            <last>Abiteboul</last>
            <first>Serge</first>
        </author>
        <author>
            <last>Buneman</last>
            <first>Peter</first>
        </author>
        <author>
            <last>Suciu</last>
            <first>Dan</first>
        </author>
        <publisher>Morgan Kaufmann Publishers</publisher>
        <price>65.95</price>
    </book>

    <book year="1999">
        <title>The Economics of Technology and Content for Digital TV</title>
        <editor>
            <last>Gerbarg</last>
            <first>Darcy</first>
            <affiliation>CITI</affiliation>
        </editor>
        <publisher>Kluwer Academic Publishers</publisher>
        <price>129.95</price>
    </book>
</bib>

The XQuery source code:

<listings>
  {
     for $p in distinct-values(doc("books.xml")//publisher)
     order by $p
     return
         <result>
         { $p }
         {
              for $b in doc("books.xml")/bib/book
              where $b/publisher = $p
              order by $b/title
              return $b/title
         }
         </result>
  }
</listings>

Translation to Python using tx, supposing books.xml XML tree is in doc variable:

w.listings(
        w.result( p , sorted( b/'title'
                              for b in doc/'/bib/book'
                              if b/'publisher' == p ) )
        for p in sorted( doc/'distinct-values(//publisher)' ) )

Which construct the following document: (indentation added manually)

<listings>
  <result>
    <publisher>Addison-Wesley</publisher>
    <title>Advanced Programming in the UNIX Environment</title>
    <title>TCP/IP Illustrated</title>
  </result>
  <result>
    <publisher>Kluwer Academic Publishers</publisher>
    <title>The Economics of Technology and Content for Digital TV</title>
  </result>
  <result>
    <publisher>Morgan Kaufmann Publishers</publisher>
    <title>Data on the Web</title>
  </result>
</listings>

XSLT example 1

Doing XSLT like transformation.

Taking example from http://www.adp-gmbh.ch/xml/xslt_examples.html, with document:

<?xml version="1.0" ?>

<famous-persons>
  <persons category="medicine">
    <person>
      <firstname> Edward   </firstname>
      <name>      Jenner   </name>
    </person>
    <person>
      <firstname> Gertrude </firstname>
      <name>      Elion    </name>
    </person>
  </persons>
  <persons category="computer science">
    <person>
      <firstname> Charles  </firstname>
      <name>      Babbage  </name>
    </person>
    <person>
      <firstname> Alan     </firstname>
      <name>      Touring  </name>
    </person>
    <person>
      <firstname> Ada      </firstname>
      <name>      Byron    </name>
    </person>
  </persons>
  <persons category="astronomy">
    <person>
      <firstname> Tycho    </firstname>
      <name>      Brahe    </name>
    </person>
    <person>
      <firstname> Johannes </firstname>
      <name>      Kepler   </name>
    </person>
    <person>
      <firstname> Galileo  </firstname>
      <name>      Galilei  </name>
    </person>
  </persons>
</famous-persons>

and stylesheet:

<?xml version="1.0" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:template match="/">
     <html><head><title>Sorting example</title></head><body>
     <xsl:apply-templates select="famous-persons/persons">
       <xsl:sort select="@category" />
     </xsl:apply-templates>
     </body></html>
  </xsl:template>

  <xsl:template match="persons">
    <h2><xsl:value-of select="@category" /></h2>
    <ul>
       <xsl:apply-templates select="person">
         <xsl:sort select="name"      />
         <xsl:sort select="firstname" />
       </xsl:apply-templates>
    </ul>
  </xsl:template>

  <xsl:template match="person">
        <xsl:text disable-output-escaping="yes">
            &lt;li&gt;
        </xsl:text>
        <b><xsl:value-of select="name"      /></b>
           <xsl:value-of select="firstname" />
  </xsl:template>

</xsl:stylesheet>

could be translated to Python as follow:

import operator as op # for op.itemgetter
def transform( node ) :
    if node.match( '/' ) :
        return w.html( w.head( w.title( 'Sorting example' ) ) ,
                       w.body( map( transform ,
                                    sorted( node/'famous-persons/persons' ,
                                            key = op.itemgetter( '@category' ) ) ) ) )
    elif node.match( 'persons' ) :
        return ( w.h2( node/'@category/string()' ) ,
                 w.ul( map( transform ,
                            sorted( sorted( node/'person' ,
                                            key = op.itemgetter( 'firstname' ) ) ,
                                    key = op.itemgetter( 'name' ) ) ) ) )
    elif node.match( 'person' ) :
        return w.li( w.b( node/'name/string()' ) ,
                     node/'firstname/string()' )

result = w._doc_( transform( doc ) )

which produce:

DOCUMENT[0] with 47 nodes
  ELEMENT[1] html
    ELEMENT[2] head
      ELEMENT[3] title
        TEXT[4] 'Sorting example'
    ELEMENT[5] body
      ELEMENT[6] h2
        TEXT[7] u'astronomy'
      ELEMENT[8] ul
        ELEMENT[9] li
          ELEMENT[10] b
            TEXT[11] u'      Brahe    '
          TEXT[12] u' Tycho    '
        ELEMENT[13] li
          ELEMENT[14] b
            TEXT[15] u'      Galilei  '
          TEXT[16] u' Galileo  '
        ELEMENT[17] li
          ELEMENT[18] b
            TEXT[19] u'      Kepler   '
          TEXT[20] u' Johannes '
      ELEMENT[21] h2
        TEXT[22] u'computer science'
      ELEMENT[23] ul
        ELEMENT[24] li
          ELEMENT[25] b
            TEXT[26] u'      Babbage  '
          TEXT[27] u' Charles  '
  [...]

Note that we first sort by firstname then by name. Also note that such example are just here to show that we can translate XSLT or XQuery to Python, but that not necessary give optimized alternative.

Note also that for emulating XSLT we could have to add other node match at end of transform such as:

    [...]
    elif node.match( '@*|text()' ) :
	return node
    else :
	return map( transform , node/'node()' )

but in our example that was not necessary.

Nodes

An XML document is a tree of nodes.

With actual implementation, nodes are supposed to be constant. Once created, they're not expected to be updated. Mainly because Document node number all its descendants and inserting some children somewhere in the document would need this numbering to be redone at some point.

FIXME: Tag the root node with some flag (or the node from which we need to restart the numbering -the lowest one if several descendant are updated) to let it know that it should number again its descendant when needed ? In the meantime, numbers new children with unique number (the parent one) ?

Document

A Document node can contains any nodes except Document and Attribute.

Constructor: Document( children = () , finalize = True )

If finalize is True (default value), then Document numbers all its descendant and mark their root pointers to it, hence making the Document node the root node of the tree.

Element

An Element node can contains any nodes except Document.

Constructor: Element( name , attributes = () , children = () , finalize = False )

Attribute

An Attribute node is only allowed inside an Element node.

Constructor: Attribute( name , value )

Comment

A Comment node is only allowed inside a Document or an Element node.

Constructor: Comment( contents )

Restriction: In XML, a comment cannot contains -- nor ends with -.

Text

A Text node is only allowed inside a Document or an Element node.

Constructor: Text( contents )

Tags

The tags module provide the w object which can be used to create document tree with Python syntax.

A string is automatically translated to a Node element.

Otherwise, a node of the type Document, Element or Comment is create with the general syntax: w.name( child1 , .. , attribute1 = value1 , .. ). Attributes make sense only for Element node type however.

It's also possible to pass a function, which take no parameters and should return a correct XML tree. This function will be called at serialization time.

w._doc_(
  w.html(
    w._comment_( ' Header ' ) ,
    w.head(
      w.title( 'This is a example page' ) ,
      w.link( rel = 'stylesheet' , href = '/default-style.css' , title = 'Default style' ) ,
      w.meta( http_equiv = 'Content-type' , content = 'text/html' , charset = 'utf-8' ) ) ,
    w._comment_( ' Body ' ) ,
    w.body(
      w.h1( 'Section 1' ) ,
      w.h2( 'Section 1.1' ) ,
      'Bla bla bla.' ,
      w.h2( 'Section 1.2' ) ,
      'Bla bla bla.' ) ) )

generate a document which once serialized give:

<html>
  <!-- Header -->
  <head>
    <title>This is a example page</title>
    <link href="/default-style.css" rel="stylesheet" title="Default style"/>
    <meta content="text/html" charset="utf-8" http-equiv="Content-type"/>
  </head>
  <!-- Body -->
  <body>
    <h1>Section 1</h1>
    <h2>Section 1.1</h2>
    Bla bla bla.
    <h2>Section 1.2</h2>
    Bla bla bla.
  </body>
</html>

Note that the result here is split into several lines and indented while in reality the result is just one line of text since no \n (new line) characters are part of the document.

Same document presented with debug output:

>>> print doc.asDebug()
DOCUMENT[0] with 24 nodes
  ELEMENT[1] html
    COMMENT[2] ' Header '
    ELEMENT[3] head
      ELEMENT[4] title
        TEXT[5] 'This is a example page'
      ELEMENT[6] link
        ATTRIBUTE[7] href = `/default-style.css`
        ATTRIBUTE[8] rel = `stylesheet`
        ATTRIBUTE[9] title = `Default style`
      ELEMENT[10] meta
        ATTRIBUTE[11] content = `text/html`
        ATTRIBUTE[12] charset = `utf-8`
        ATTRIBUTE[13] http-equiv = `Content-type`
    COMMENT[14] ' Body '
    ELEMENT[15] body
      ELEMENT[16] h1
        TEXT[17] 'Section 1'
      ELEMENT[18] h2
        TEXT[19] 'Section 1.1'
      TEXT[20] 'Bla bla bla.'
      ELEMENT[21] h2
        TEXT[22] 'Section 1.2'
      TEXT[23] 'Bla bla bla.'

Example of deferred function:

count = 0
def foo() :
  global count
  count += 1
  return w.p( "I'm generated %d time(s)." % count )

doc = w.body( foo )

print doc.serialize()
print doc.serialize()
print doc.serialize()

produce:

<body><p>I'm generated 1 time(s).</p></body>
<body><p>I'm generated 2 time(s).</p></body>
<body><p>I'm generated 3 time(s).</p></body>

XPath

The module xpath provide a large subset of XPath 2.0.

Unsupported features are:

Compilation

The module xpath contains a compile function which take a XPath expression and return a function taking a context as argument and returning a sequence as result of the evaluation.

XPath class wrapper

XPath class is a convenient (small) wrapper around compile function.

An instance of the XPath class is created with a XPath expression. To evaluate the XPath expression, use eval member function with a optional context node.

>>> from tuxeenet.tx.xpath import XPath
>>> x1 = XPath( '//@href' )
>>> x1.eval( doc ) # return all href attribute in document 'doc'

Shortcut syntax

The Node base class define operator [] and / to make it easy to query a tree with XPath expression.

>>> doc[ '/html/head/link[@rel="top"]/@title/string()' ]

or

>>> doc / '/html/head/link[@rel="top"]/@title/string()'

are equivalent, while however the latter form could be written:

>>> doc/'html/head/link[@rel="top"]/@title/string()'

(without the initial / in the XPath expression) since doc is already the root node (for this example.)

This is almost the direct translation of the following XQuery code:

$doc/html/head/link[@rel="top"]/@title/string()

XPath prompt

For debugging purpose, a "XPath prompt" application is available to interactively evaluate XPath expressions.

$ tx-prompt
XPath TX 0.1 - (c)2005  Frederic Jolliton <frederic@jolliton.com>

XPath2.0>

Then any supported XPath expression can be entered.

There is some special command:

The $current variable is used as context item when evaluating XPath expression.

Producing sequence of numbers:

XPath2.0> 1+2
3
XPath2.0> 18 div 4
4.5
XPath2.0> (12+3)*5
75
XPath2.0> 1, 2, 3
Sequence(1, 2, 3)
XPath2.0> 12, 17 to 20, 22
Sequence(12, 17, 18, 19, 20, 22)

Using if ternary operator:

XPath2.0> if (1=1) then "ok" else "failed"
ok
XPath2.0> if (1!=1) then "ok" else "failed"
failed

Working with XML tree:

Displaying parse tree of some expressions (mainly useful for debugging purpose !):

XPath2.0> \e 1+2
(exprlist (+ (path (integer "1"))
             (path (integer "2"))))

XPath2.0> \e foo() + bar()
(exprlist (+ (path (call "foo"))
             (path (call "bar"))))

XPath2.0> \e ../foo | @bar
(exprlist (union (path (parent (node))
                       (child (element "foo")))
                 (path (attribute (attribute "bar")))))

XPath2.0> \e /html/child::head/element(title)/string()
(exprlist (path "/"
                (child (element "html"))
                (child (element "head"))
                (child (element "title"))
                (call "string")))

XPath2.0> \e for $att in distinct-values(//@*/name()) return ($att,count(//attribute()[name()=$att]))
(exprlist (for ((att (path (call "distinct-values"
                                 (path "/"
                                       (descendant-or-self (node))
                                       (attribute (attribute "*"))
                                       (call "name"))))))
               (path (exprlist (path (varref "att"))
                               (path (call "count"
                                           (path "/"
                                                 (descendant-or-self (node))
                                                 (predicates (attribute (attribute))
                                                             (exprlist (= (path (call "name"))
                                                                          (path (varref "att"))))))))))))

Pattern

Patterns are used for XSLT nodes matching.

node.match( 'a/b' ) # Return True if node is an element `b` with a parent element `a`

Note: Internally patterns are translated to XPath expression. Such expression return the empty sequence if pattern doesn't match the node. For example, a/b become something like self::element(b)/parent::element(a).

Parsing HTML

The htmlparser module provide a replacement for HTMLParser module provided with Python. The main difference is that the tx module never throw an error. It is able to parse the worst HTML documents.

Note: To parse regular XML document, a parser like rxp could be used instead with help of rxpcompat module, because the HTMLParser is not designed for XML and is not necessary good enough for this purpose. Your mileage may vary.

The htmltree module use the htmlparser module and produce Document.

>>> import sys
>>> from tuxeenet.tx.htmltree import parse
>>> parse( '<html><test></html>' ).asDebug( file = sys.stdout )
DOCUMENT[0] with 3 nodes
  ELEMENT[1] html
    ELEMENT[2] test
>>> parse( '<html>some ~< bad <p>document</b> /><p>really' ).asDebug( file = sys.stdout )
DOCUMENT[0] with 7 nodes
  ELEMENT[1] html
    TEXT[2] 'some ~< bad '
    ELEMENT[3] p
      TEXT[4] 'document />'
    ELEMENT[5] p
      TEXT[6] 'really'
>>> parse( '<html>some ~< bad <p>document</b> /><p>really' ).serialize()
'<html>some ~&lt; bad <p>document /></p><p>really</p></html>'
Generated by Crock on Thu Dec 28 20:11:33 2006