TX is short for Tuxee XML. It's a set of Python modules to generate, transform, parse, search XML (and HTML) document.
| Module | Description |
| tx.nodes | Define classes for each type of node to build XML tree |
| tx.tags | Simplify tree creation |
| tx.htmltree | Build XML tree using htmlparser module |
| tx.xpath | Translate XPath expressions to Python functions |
And some modules used internally:
| Module | Description |
| tx.error | Exceptions used in tx |
| tx.parser | Generic parser inspired by PyParsing |
| tx.iterators | Iterators to walk a XML tree in various ways |
| tx.htmlparser | Error tolerant HTML parser |
| tx.xpathparser | Translate XPath expressions to "s-expression" |
| tx.xpathfn | Provide XPath/XQuery functions and operators |
| tx.context | XPath context object |
| tx.sequence | XPath sequence object |
Misc. modules:
| Module | Description |
| tx.misc | Contains some utility functions |
| tx.rxpcompat | Translate RXP-like tree structure to tx tree |
| tx.xpath_misc | ... |
| tx.sequence_misc | ... |
| tx.nodes_misc | ... |
| Repository | git://git.tuxee.net/tx |
| GitWeb | Web interface |
As root, run ./setup install.
Note: The special variable _ is the result of the previous computation.
Importing the tags object as w:
>>> from tuxeenet.tx.tags import tags as w
then generating a tree and serializing it:
>>> w.html( w.head( w.title( 'Hello, World!' ) ) , w.body( 'bla bla bla' ) ) >>> _.serialize() '<html><head><title>Hello, World!</title></head><body>bla bla bla</body></html>'
The _doc_ and _comment_ name have special meanings. The former create a Document node, while the latter create a Comment node.
>>> w._doc_( w.foo( w._comment_( ' this is a comment ' ) ) , w.bar( 'quux&baz' ) ).serialize() '<foo><!-- this is a comment --></foo><bar>quux&baz</bar>' >>> w.foo( w._comment_( ' a comment ' ) , 'bar' , id = 'contents' , width = '92' ).serialize() '<foo id="contents" width="92"><!-- a comment -->bar</foo>'
For attribute names, double _ are translated to : and single _ are translated to -. A _ starting a name is dropped (useful when name match a Python keyword.)
>>> w._return( 'Et voila !' , xml__lang = 'fr' , _class = 'rt2' ).serialize() '<return xml:lang="fr" class="rt2">Et voila !</return>'
The tags object from the tags module is just a convenient way for building tree, but in reality it just construct Element, Attribute, Text,.. nodes implicitly.
Here is how to generate tree directly from these objects:
>>> from tuxeenet.tx.nodes import * >>> a = Attribute( 'id' , 'contents' ) >>> b = Attribute( 'width' , '92' ) >>> c = Comment( ' a comment ' ) >>> d = Text( 'bar' ) >>> e = Element( 'foo' , ( a , b ) , ( c , d ) ) # name, attributes, children >>> e.serialize() '<foo id="contents" width="92"><!-- a comment -->bar</foo>'
>>> from urllib import urlopen >>> from tuxeenet.tx.htmltree import parse
Fetching and parsing the http://slashdot.org page:
>>> doc = parse( urlopen( 'http://slashdot.org/' ).read() ) >>> doc <Document with 2 children>
Examples below will use this doc variable. Note that you will not necessary get the exact same output, since the page (the homepage of Slashdot) can change of course.
From the doc, we can output a verbose tree to show document structure with nodes type:
>>> print doc.asDebug( maxChildren = 2 )
DOCUMENT[0] with 2484 nodes
TEXT[1] '\n'
ELEMENT[2] html
ELEMENT[3] head
ELEMENT[4] title
TEXT[5] 'Slashdot: News for nerds, stuff that matters'
ELEMENT[6] link
ATTRIBUTE[7] rel = `top`
ATTRIBUTE[8] title = `News for nerds, stuff that matters`
ATTRIBUTE[9] href = `//slashdot.org/`
[.. and 13 more children ..]
TEXT[36] '\n'
[.. and 2 more children ..]A large part of XPath 2.0 is available.
For example, to extract the string value of the attribute title of an element link which have also an attribute rel of value top, and where this element link is a child of element head itself child of root element html:
>>> doc[ '/html/head/link[@rel="top"]/@title/string()' ] >>> tuple(_) # Convert resulting sequence to a tuple (u'News for nerds, stuff that matters',)
(Note that an Unicode string is returned.)
The pyRXP module is a wrapper for the RXP XML parser. tx provide a way to convert an existing tree to the type of structure used by pyRXP.
This is really only useful for compatibility purpose with RXP module.
>>> sequence = doc[ '//font[@face="verdana"]' ]
>>> sequence[ 0 ].asRxp()
('font', {'color': '#001670', 'face': 'verdana'}, [u'\xa0', ('b', None, ['OSTG'], None)], None)and translating back to a tx tree:
>>> doc = ('font', {'color': '#001670', 'face': 'verdana'}, [u'\xa0', ('b', None, ['OSTG'], None)], None)
>>> from tuxeenet.tx.rxpcompat import fromRxp
>>> fromRxp( doc )
<Element font with 2 attributes and 2 children>
>>> fromRxp( doc ).serialize()
'<font color="#001670" face="verdana">\xc2\xa0<b>OSTG</b></font>'Important note: You may have noticed that the \xa0 is printed as \xc2\xa0'. It's because .serialize() produce string with utf-8` encoding by default.
First 2 examples show how to translate XQuery to Python, while the third example show how to translate XSLT to Python.
These examples are important to show that specialized languages are not necessary for processing XML document with same power as XQuery, XSLT,..
Python could be used to fully replace XQuery complex operations.
Taking the following example from the XQuery spec:
let $i := <tool>wrench</tool>
let $o := <order> {$i} <quantity>5</quantity> </order>
let $odoc := document ($o)
let $newi := $o/toolWhich is followed by these expected results:
fn:root($i) returns $i
fn:root($o/quantity) returns $o
fn:root($odoc//quantity) returns $odoc
fn:root($newi) returns $o
Some notes:
The XQuery version implictly copy tree, but in Python we have to ask it explicitly with clone member function.
We have to call .finalize() on tree which are not Document because by default any node which is not Document is considered part of another tree, and not a root of a tree by itself.
We have to run fnRoot (the fn:root function) in the nullContext explicitly.
First part translated in Python with tx modules (using tags module):
i = w.tool( 'wrench' ) o = w.order( i.clone() , w.quantity( '5' ) ) odoc = o.clone() newi = o/'tool' # Notice the use of the '/' operator to use XPath
Declare trees as standalone:
i.finalize() o.finalize() odoc.finalize()
(Note: We make a root function to simplify fnRoot usage.)
>>> from tuxeenet.tx.sequence import Sequence >>> from tuxeenet.tx.context import Context >>> from tuxeenet.tx.xpathfn import fnRoot >>> root = lambda node : fnRoot( Context() , Sequence( node ) )
Then we can check expected results:
>>> assert root( i ) == Sequence( i ) >>> assert root( o/'quantity' ) == Sequence( o ) >>> assert root( odoc/'.//quantity' ) == Sequence( odoc ) # The '.' is important >>> assert root( newi ) == Sequence( o )
An example from http://www.perfectxml.com/XQuery.html, with the books.xml document used below:
<bib>
<book year="1994">
<title>TCP/IP Illustrated</title>
<author>
<last>Stevens</last>
<first>W.</first>
</author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>
<book year="1992">
<title>Advanced Programming in the UNIX Environment</title>
<author>
<last>Stevens</last>
<first>W.</first>
</author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>
<book year="2000">
<title>Data on the Web</title>
<author>
<last>Abiteboul</last>
<first>Serge</first>
</author>
<author>
<last>Buneman</last>
<first>Peter</first>
</author>
<author>
<last>Suciu</last>
<first>Dan</first>
</author>
<publisher>Morgan Kaufmann Publishers</publisher>
<price>65.95</price>
</book>
<book year="1999">
<title>The Economics of Technology and Content for Digital TV</title>
<editor>
<last>Gerbarg</last>
<first>Darcy</first>
<affiliation>CITI</affiliation>
</editor>
<publisher>Kluwer Academic Publishers</publisher>
<price>129.95</price>
</book>
</bib>The XQuery source code:
<listings>
{
for $p in distinct-values(doc("books.xml")//publisher)
order by $p
return
<result>
{ $p }
{
for $b in doc("books.xml")/bib/book
where $b/publisher = $p
order by $b/title
return $b/title
}
</result>
}
</listings>Translation to Python using tx, supposing books.xml XML tree is in doc variable:
w.listings(
w.result( p , sorted( b/'title'
for b in doc/'/bib/book'
if b/'publisher' == p ) )
for p in sorted( doc/'distinct-values(//publisher)' ) )Which construct the following document: (indentation added manually)
<listings>
<result>
<publisher>Addison-Wesley</publisher>
<title>Advanced Programming in the UNIX Environment</title>
<title>TCP/IP Illustrated</title>
</result>
<result>
<publisher>Kluwer Academic Publishers</publisher>
<title>The Economics of Technology and Content for Digital TV</title>
</result>
<result>
<publisher>Morgan Kaufmann Publishers</publisher>
<title>Data on the Web</title>
</result>
</listings>Doing XSLT like transformation.
Taking example from http://www.adp-gmbh.ch/xml/xslt_examples.html, with document:
<?xml version="1.0" ?>
<famous-persons>
<persons category="medicine">
<person>
<firstname> Edward </firstname>
<name> Jenner </name>
</person>
<person>
<firstname> Gertrude </firstname>
<name> Elion </name>
</person>
</persons>
<persons category="computer science">
<person>
<firstname> Charles </firstname>
<name> Babbage </name>
</person>
<person>
<firstname> Alan </firstname>
<name> Touring </name>
</person>
<person>
<firstname> Ada </firstname>
<name> Byron </name>
</person>
</persons>
<persons category="astronomy">
<person>
<firstname> Tycho </firstname>
<name> Brahe </name>
</person>
<person>
<firstname> Johannes </firstname>
<name> Kepler </name>
</person>
<person>
<firstname> Galileo </firstname>
<name> Galilei </name>
</person>
</persons>
</famous-persons>and stylesheet:
<?xml version="1.0" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<html><head><title>Sorting example</title></head><body>
<xsl:apply-templates select="famous-persons/persons">
<xsl:sort select="@category" />
</xsl:apply-templates>
</body></html>
</xsl:template>
<xsl:template match="persons">
<h2><xsl:value-of select="@category" /></h2>
<ul>
<xsl:apply-templates select="person">
<xsl:sort select="name" />
<xsl:sort select="firstname" />
</xsl:apply-templates>
</ul>
</xsl:template>
<xsl:template match="person">
<xsl:text disable-output-escaping="yes">
<li>
</xsl:text>
<b><xsl:value-of select="name" /></b>
<xsl:value-of select="firstname" />
</xsl:template>
</xsl:stylesheet>could be translated to Python as follow:
import operator as op # for op.itemgetter
def transform( node ) :
if node.match( '/' ) :
return w.html( w.head( w.title( 'Sorting example' ) ) ,
w.body( map( transform ,
sorted( node/'famous-persons/persons' ,
key = op.itemgetter( '@category' ) ) ) ) )
elif node.match( 'persons' ) :
return ( w.h2( node/'@category/string()' ) ,
w.ul( map( transform ,
sorted( sorted( node/'person' ,
key = op.itemgetter( 'firstname' ) ) ,
key = op.itemgetter( 'name' ) ) ) ) )
elif node.match( 'person' ) :
return w.li( w.b( node/'name/string()' ) ,
node/'firstname/string()' )
result = w._doc_( transform( doc ) )which produce:
DOCUMENT[0] with 47 nodes
ELEMENT[1] html
ELEMENT[2] head
ELEMENT[3] title
TEXT[4] 'Sorting example'
ELEMENT[5] body
ELEMENT[6] h2
TEXT[7] u'astronomy'
ELEMENT[8] ul
ELEMENT[9] li
ELEMENT[10] b
TEXT[11] u' Brahe '
TEXT[12] u' Tycho '
ELEMENT[13] li
ELEMENT[14] b
TEXT[15] u' Galilei '
TEXT[16] u' Galileo '
ELEMENT[17] li
ELEMENT[18] b
TEXT[19] u' Kepler '
TEXT[20] u' Johannes '
ELEMENT[21] h2
TEXT[22] u'computer science'
ELEMENT[23] ul
ELEMENT[24] li
ELEMENT[25] b
TEXT[26] u' Babbage '
TEXT[27] u' Charles '
[...]Note that we first sort by firstname then by name. Also note that such example are just here to show that we can translate XSLT or XQuery to Python, but that not necessary give optimized alternative.
Note also that for emulating XSLT we could have to add other node match at end of transform such as:
[...]
elif node.match( '@*|text()' ) :
return node
else :
return map( transform , node/'node()' )but in our example that was not necessary.
An XML document is a tree of nodes.
With actual implementation, nodes are supposed to be constant. Once created, they're not expected to be updated. Mainly because Document node number all its descendants and inserting some children somewhere in the document would need this numbering to be redone at some point.
FIXME: Tag the root node with some flag (or the node from which we need to restart the numbering -the lowest one if several descendant are updated) to let it know that it should number again its descendant when needed ? In the meantime, numbers new children with unique number (the parent one) ?
A Document node can contains any nodes except Document and Attribute.
Constructor: Document( children = () , finalize = True )
If finalize is True (default value), then Document numbers all its descendant and mark their root pointers to it, hence making the Document node the root node of the tree.
An Element node can contains any nodes except Document.
Constructor: Element( name , attributes = () , children = () , finalize = False )
An Attribute node is only allowed inside an Element node.
Constructor: Attribute( name , value )
A Comment node is only allowed inside a Document or an Element node.
Constructor: Comment( contents )
Restriction: In XML, a comment cannot contains -- nor ends with -.
A Text node is only allowed inside a Document or an Element node.
Constructor: Text( contents )
The tags module provide the w object which can be used to create document tree with Python syntax.
A string is automatically translated to a Node element.
Otherwise, a node of the type Document, Element or Comment is create with the general syntax: w.name( child1 , .. , attribute1 = value1 , .. ). Attributes make sense only for Element node type however.
It's also possible to pass a function, which take no parameters and should return a correct XML tree. This function will be called at serialization time.
w._doc_(
w.html(
w._comment_( ' Header ' ) ,
w.head(
w.title( 'This is a example page' ) ,
w.link( rel = 'stylesheet' , href = '/default-style.css' , title = 'Default style' ) ,
w.meta( http_equiv = 'Content-type' , content = 'text/html' , charset = 'utf-8' ) ) ,
w._comment_( ' Body ' ) ,
w.body(
w.h1( 'Section 1' ) ,
w.h2( 'Section 1.1' ) ,
'Bla bla bla.' ,
w.h2( 'Section 1.2' ) ,
'Bla bla bla.' ) ) )generate a document which once serialized give:
<html>
<!-- Header -->
<head>
<title>This is a example page</title>
<link href="/default-style.css" rel="stylesheet" title="Default style"/>
<meta content="text/html" charset="utf-8" http-equiv="Content-type"/>
</head>
<!-- Body -->
<body>
<h1>Section 1</h1>
<h2>Section 1.1</h2>
Bla bla bla.
<h2>Section 1.2</h2>
Bla bla bla.
</body>
</html>Note that the result here is split into several lines and indented while in reality the result is just one line of text since no \n (new line) characters are part of the document.
Same document presented with debug output:
>>> print doc.asDebug()
DOCUMENT[0] with 24 nodes
ELEMENT[1] html
COMMENT[2] ' Header '
ELEMENT[3] head
ELEMENT[4] title
TEXT[5] 'This is a example page'
ELEMENT[6] link
ATTRIBUTE[7] href = `/default-style.css`
ATTRIBUTE[8] rel = `stylesheet`
ATTRIBUTE[9] title = `Default style`
ELEMENT[10] meta
ATTRIBUTE[11] content = `text/html`
ATTRIBUTE[12] charset = `utf-8`
ATTRIBUTE[13] http-equiv = `Content-type`
COMMENT[14] ' Body '
ELEMENT[15] body
ELEMENT[16] h1
TEXT[17] 'Section 1'
ELEMENT[18] h2
TEXT[19] 'Section 1.1'
TEXT[20] 'Bla bla bla.'
ELEMENT[21] h2
TEXT[22] 'Section 1.2'
TEXT[23] 'Bla bla bla.'Example of deferred function:
count = 0 def foo() : global count count += 1 return w.p( "I'm generated %d time(s)." % count ) doc = w.body( foo ) print doc.serialize() print doc.serialize() print doc.serialize()
produce:
<body><p>I'm generated 1 time(s).</p></body> <body><p>I'm generated 2 time(s).</p></body> <body><p>I'm generated 3 time(s).</p></body>
The module xpath provide a large subset of XPath 2.0.
Unsupported features are:
instance of, treat as, castable as, cast as operators,
processing-instruction(..) and namespace(..) tests,
schema-attribute(..) and schema-element(..) tests,
date support or any type except string, float and boolean (decimal and double are considered as float.)
The module xpath contains a compile function which take a XPath expression and return a function taking a context as argument and returning a sequence as result of the evaluation.
XPath class is a convenient (small) wrapper around compile function.
An instance of the XPath class is created with a XPath expression. To evaluate the XPath expression, use eval member function with a optional context node.
>>> from tuxeenet.tx.xpath import XPath >>> x1 = XPath( '//@href' ) >>> x1.eval( doc ) # return all href attribute in document 'doc'
The Node base class define operator [] and / to make it easy to query a tree with XPath expression.
>>> doc[ '/html/head/link[@rel="top"]/@title/string()' ]
or
>>> doc / '/html/head/link[@rel="top"]/@title/string()'
are equivalent, while however the latter form could be written:
>>> doc/'html/head/link[@rel="top"]/@title/string()'
(without the initial / in the XPath expression) since doc is already the root node (for this example.)
This is almost the direct translation of the following XQuery code:
$doc/html/head/link[@rel="top"]/@title/string()
For debugging purpose, a "XPath prompt" application is available to interactively evaluate XPath expressions.
$ tx-prompt XPath TX 0.1 - (c)2005 Frederic Jolliton <frederic@jolliton.com> XPath2.0>
Then any supported XPath expression can be entered.
There is some special command:
`\.' followed by an URI (filename, URL,..) to load a document as default context item,
\d switch to default display mode,
\f switch to full display mode,
\s switch to short display mode,
\i switch to inline display mode,
\l switch the display of the location of resulting node on/off,
\e followed by an XPath expression show its syntax tree,
\x toggle query duration display,
\v display names of variables currently defined,
\o toggle query optimization on/off,
\c flush query cache (useful after \o command to flush already compiled expression),
$name := expression evaluate XPath expression and store the result into variable named name.
The $current variable is used as context item when evaluating XPath expression.
Producing sequence of numbers:
XPath2.0> 1+2 3 XPath2.0> 18 div 4 4.5 XPath2.0> (12+3)*5 75 XPath2.0> 1, 2, 3 Sequence(1, 2, 3) XPath2.0> 12, 17 to 20, 22 Sequence(12, 17, 18, 19, 20, 22)
Using if ternary operator:
XPath2.0> if (1=1) then "ok" else "failed" ok XPath2.0> if (1!=1) then "ok" else "failed" failed
Working with XML tree:
Fetching document
XPath2.0> $current := doc('http://slashdot.org')Extracting the title:
XPath2.0> /html/head/title <Element title with 0 attributes and 1 children> XPath2.0> \i [inline] XPath2.0> /html/head/title <title>Slashdot: News for nerds, stuff that matters</title>
Extracting articles title (pre-september 2005):
XPath2.0> \s [short] XPath2.0> //td[@align='LEFT']//font[@color]/b[text()]/string() 1 STRING Ask Slashdot: GSM and Asterisk Integration? 2 STRING Hardware: Free WiFi Trend Continues 3 STRING Linux: Winemaker Drinks To Linux 4 STRING Games: World of Warcraft Card Game Coming Soon 5 STRING Your Rights Online: Is Your Boss a Psychopath? 6 STRING Linux: Australian Linux Trademark Holds Water 7 STRING Science: Nanotubes Start to Show their Promise
Extracting articles title (post-september 2005):
XPath2.0> \s [short] XPath2.0> //div[@class='generaltitle']/normalize-space() 1 STRING Tivo Institutes 1 Year Service Contracts 2 STRING Politics: US Senate Allows NASA To Buy Soyuz Vehicles 3 STRING IT: Reconnaissance In Virtual Space 4 STRING Your Rights Online: FBI Agents Put New Focus on Deviant Porn 5 STRING Ask Slashdot: Top 50 Science Fiction TV Shows 6 STRING Your Rights Online: Business At The Price Of Freedom 7 STRING Apple: Music Exec Fires Back At Apple CEO 8 STRING Science: Grammar Traces Language Roots 9 STRING Developers: RMS Previews GPL3 Terms 10 STRING Massachusetts Finalizes OpenDocument Standard Plan 11 STRING Developers: Palm Teams With Microsoft for Smart Phone 12 STRING Developers: Why Vista Had To Be Rebuilt From Scratch 13 STRING Hardware: Nabaztag the WiFi Bunny 14 STRING Revamping the Movie Distribution Chain 15 STRING Politics: Municipal Broadband Projects Spread Across U.S.
Extracting RSS title from an external document:
XPath2.0> \s
[short]
XPath2.0> doc('http://rss.slashdot.org/Slashdot/slashdot')//item/title/string()
1 STRING Tivo Institutes 1 Year Service Contracts
2 STRING US Senate Allows NASA To Buy Soyuz Vehicles
3 STRING Reconnaissance In Virtual Space
4 STRING FBI Agents Put New Focus on Deviant Porn
5 STRING Top 50 Science Fiction TV Shows
6 STRING Business At The Price Of Freedom
7 STRING Music Exec Fires Back At Apple CEO
8 STRING Grammar Traces Language Roots
9 STRING RMS Previews GPL3 Terms
10 STRING Massachusetts Finalizes OpenDocument Standard PlanExtracting 1st, 3rd and 7th comment of the document:
XPath2.0> \i [inline] XPath2.0> \x Timer On XPath2.0> (//comment())[position()=(1,3,7)] <!-- BEGIN: AdSolution-Tag 4.2: Global-Code --> <!-- begin OSTG navbar --> <!-- end ad code --> -- 0.038567s(parse) + 0.006351s(eval) --
Querying distinct value attributes of the document:
XPath2.0> \f [full] XPath2.0> distinct-values(//@value) 1 NODE ATTRIBUTE[1856] value = `` 2 NODE ATTRIBUTE[1866] value = `//slashdot.org/` 3 NODE ATTRIBUTE[1871] value = `userlogin` 4 NODE ATTRIBUTE[1882] value = `yes` 5 NODE ATTRIBUTE[1889] value = `Log in` 6 NODE ATTRIBUTE[1940] value = `1307` 7 NODE ATTRIBUTE[1945] value = `mainpage` 8 NODE ATTRIBUTE[1954] value = `1` 9 NODE ATTRIBUTE[1960] value = `2` 10 NODE ATTRIBUTE[1966] value = `3` 11 NODE ATTRIBUTE[1972] value = `4` 12 NODE ATTRIBUTE[1978] value = `5` 13 NODE ATTRIBUTE[1984] value = `6` 14 NODE ATTRIBUTE[1990] value = `7` 15 NODE ATTRIBUTE[1995] value = `Vote` 16 NODE ATTRIBUTE[2346] value = `freshmeat.net` 17 NODE ATTRIBUTE[2439] value = `Search` -- 0.012589s(parse) + 0.055717s(eval) --
Computing number of pixels covered by img elements
XPath2.0> \d
[default]
XPath2.0> for $img in //img return $img/@width * $img/@height
Sequence(19800, 4800, 2862, 4307, 4225, 4602, 5, 208, 4800, 208, 2862, \
208, 4307, 208, 4800, 208, 4225, 208, 4602, 208, 4125, 208, 6216, 208, \
5184, 208, 5070, 230, 230, 230, 230, 230, 230, 230, 5, nan, 1)or alternatively:
XPath2.0> //img/(@height * @width)
Sequence(19800, 4800, 2862, 4307, 4225, 4602, 5, 208, 4800, 208, 2862, \
208, 4307, 208, 4800, 208, 4225, 208, 4602, 208, 4125, 208, 6216, 208, \
5184, 208, 5070, 230, 230, 230, 230, 230, 230, 230, 5, nan, 1)Looking for rel attribute, with location (the XPath expression that could be used to identify precisely the node in the resulting sequence):
XPath2.0> \f [full] XPath2.0> \l Location on XPath2.0> //@rel 1 NODE /html/head/link[1]/@rel ATTRIBUTE[7] rel = `top` 2 NODE /html/head/link[2]/@rel ATTRIBUTE[12] rel = `search` 3 NODE /html/head/link[3]/@rel ATTRIBUTE[17] rel = `alternate` 4 NODE /html/head/link[4]/@rel ATTRIBUTE[23] rel = `shortcut icon`
Displaying parse tree of some expressions (mainly useful for debugging purpose !):
XPath2.0> \e 1+2
(exprlist (+ (path (integer "1"))
(path (integer "2"))))
XPath2.0> \e foo() + bar()
(exprlist (+ (path (call "foo"))
(path (call "bar"))))
XPath2.0> \e ../foo | @bar
(exprlist (union (path (parent (node))
(child (element "foo")))
(path (attribute (attribute "bar")))))
XPath2.0> \e /html/child::head/element(title)/string()
(exprlist (path "/"
(child (element "html"))
(child (element "head"))
(child (element "title"))
(call "string")))
XPath2.0> \e for $att in distinct-values(//@*/name()) return ($att,count(//attribute()[name()=$att]))
(exprlist (for ((att (path (call "distinct-values"
(path "/"
(descendant-or-self (node))
(attribute (attribute "*"))
(call "name"))))))
(path (exprlist (path (varref "att"))
(path (call "count"
(path "/"
(descendant-or-self (node))
(predicates (attribute (attribute))
(exprlist (= (path (call "name"))
(path (varref "att"))))))))))))Patterns are used for XSLT nodes matching.
node.match( 'a/b' ) # Return True if node is an element `b` with a parent element `a`
Note: Internally patterns are translated to XPath expression. Such expression return the empty sequence if pattern doesn't match the node. For example, a/b become something like self::element(b)/parent::element(a).
The htmlparser module provide a replacement for HTMLParser module provided with Python. The main difference is that the tx module never throw an error. It is able to parse the worst HTML documents.
Note: To parse regular XML document, a parser like rxp could be used instead with help of rxpcompat module, because the HTMLParser is not designed for XML and is not necessary good enough for this purpose. Your mileage may vary.
The htmltree module use the htmlparser module and produce Document.
>>> import sys
>>> from tuxeenet.tx.htmltree import parse
>>> parse( '<html><test></html>' ).asDebug( file = sys.stdout )
DOCUMENT[0] with 3 nodes
ELEMENT[1] html
ELEMENT[2] test
>>> parse( '<html>some ~< bad <p>document</b> /><p>really' ).asDebug( file = sys.stdout )
DOCUMENT[0] with 7 nodes
ELEMENT[1] html
TEXT[2] 'some ~< bad '
ELEMENT[3] p
TEXT[4] 'document />'
ELEMENT[5] p
TEXT[6] 'really'
>>> parse( '<html>some ~< bad <p>document</b> /><p>really' ).serialize()
'<html>some ~< bad <p>document /></p><p>really</p></html>'