-
Notifications
You must be signed in to change notification settings - Fork 72
Using the Reader API
Note: This was written years back, and needs to be checked against git head.
This post describes the Reader API of ruby-libxml.
##Introduction Several techniques exist to parse XML documents. You can read up on them on this Wikipedia article. Reader provides a StAX API for parsing XML documents.
The Reader API provides a "cursor" that moves forward through the XML document node by node and you process the data in a node while the cursor is at it. This paradigm is also called "pull parsing". You can initialize an XML document from a file, string, uri or an io object and then call XML::Reader#read
to move through the document. The read method returns false
when there is no more node to read. Optionally you can provide a hash while initializing the document to control how parsing is done. Typically, you would do something like this:
doc = XML::Reader.file("trees.xml", :options =>XML::Parser::Options::NOENT)
process(doc) while doc.read
Possible parsing options are constants of the class XML::Parser::Options
. More than one options can be combined using bitwise or ( | ).
After a document is parsed you should free the resources by calling doc.close
.
While the cursor is at one of the nodes, you can query it for:
- Node Type:
doc.node_type
, will return the type of the node from the following,- Start of an element : 1
- Attributes : 2
- Text : 3
- CDATA : 4
- Entity References : 5
- Entity Declarations : 6
- Processing Instruction : 7
- Comments : 8
- Document : 9
- DTD/Doctype : 10
- Document Fragment : 11
- Notation : 12
- Whitespace : 13
- Significant Whitespace : 14
- End of an element : 15
- End entity : 16
- XML Declaration : 17
See this for a description of all the node types. Constants are defined for the node types under theXML::Reader
class.
- Name :
doc.name
, will return the qualified name of the node( prefix + local name )- Local Name :
doc.local_name
, will return the local name of the node( name, without the associated prefix ) - Prefix :
doc.prefix
, will return the namespace prefix associated with the node
- Local Name :
- Namespace :
doc.namespace_uri
, will return the URI of the node's namespace.- Namespace declarations are also considered node, in line with the DOM API. You can use
doc.namespace_declaration?
to find if the attribute node is a namespace declaration or not. - Given the prefix( see 2 ) you can find out the associated namespace with
doc.lookup_namespace("prefix")
; usenil
if you want the default namespace.
- Namespace declarations are also considered node, in line with the DOM API. You can use
- Value :
doc.value
, will return the text value of the node if present elsenil
. Alternatively, you can also check if the node has a text value or not bydoc.has_value?
- Empty :
doc.empty_element?
, will tell you if the node is empty or not. Empty elements are those that are closed in their start tag itself. - Depth :
doc.depth
, will return the depth of the node in the tree from the base element
To find out if a node has an attribute or not, use doc.has_attributes?
. You can find the attribute count of the node with doc.attribute_count
. Even though attributes are also nodes, doc.read
does not move the cursor to an attribute node.
- Attributes can be accessed in a hash like manner with the
[]
method.[]
can be called with the attribute's name or index( the first attribute is indexed 0). - With the
doc.move_to_next_attribute
you can move the cursor to the next attribute. It returns 1 if the cursor moved to the next attribute and 0 if there is no attribute to move to. While the cursor is at an attribute node you can query it like any other node( for name, value, node type, depth ) as described above. You must remember to move back to the element node bydoc.move_to_element
. Alternatively, you can call themove_to_attribute
function on the cursor with a node's name as the argument to move to an attribute node. I prefer the array notation. -
read_attribute_value
is a related method whose use I have not understood fully. Refer the document if you will.
To check if the XML document confirms to valid schema definition, call the schema_validate
method on the reader object and pass it the location of the schema file. It returns 0 if the document validates and -1 in case of an error. Note that this function should be called just after you instantiate a Reader
object. Trying to validate an XML document after you have started reading( called read on the document object ) is an error.
doc.schema_validate("schema.xsd")
There are a few more API calls which you can refer here.
Below is a "hello world" code example using the Reader API, a sample XML file and the result of parsing it. Since I have described the technicalities above, I am not going to walk you through the code.
require "rubygems"
require "xml"
#parse the sample.xml ignoring whitespaces and
#performing entity substitution.
doc = XML::Reader.file("sample.xml", :options => XML::Parser::Options::NOBLANKS |
XML::Parser::Options::NOENT
)
#display a node's name: local and prefix
def display_name( node )
puts "\tName: #{node.name}"
if node.prefix
puts "\t\tPrefix: #{node.prefix}" if node.prefix
puts "\t\tLocal: #{node.local_name}"
end
end
#display attributes of a node
def display_attributes( node )
node.attribute_count.times do | index |
puts "Attribute # #{index + 1}"
node.move_to_next_attribute
display node
end
node.move_to_element
end
#process a node
def display( node )
display_name node
puts "\tDepth: #{node.depth}"
puts "\tEmpty Element" if node.empty_element?
puts "\tValue: #{node.value}" if node.has_value?
display_attributes node
print "\n"
end
#shift through the document.
i = 1
while doc.read
unless doc.node_type == XML::Reader::TYPE_END_ELEMENT
puts "Node # #{i}"
display doc
i += 1
end
end
#free the resources
doc.close
Sample: it is an NeXML file.
<?xml version="1.0" encoding="ISO-8859-1"?>
<nex:nexml
version="0.8"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.nexml.org/1.0 ../xsd/nexml.xsd"
xmlns:nex="http://www.nexml.org/1.0"
generator="mesquite"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns="http://www.nexml.org/1.0">
<otus
id="taxa1"
label="My taxa block"
xml:base="http://example.org/"
xml:id="taxa1"
class="taxset1"
xml:lang="EN"
xlink:href="#taxa1">
<!--
The taxon element is analogous to a single label in
a nexus taxa block. It may have the same additional
attributes (label, xml:base, xml:lang, xml:id, xlink:href
and class) as the taxa element.
-->
<otu id="t1"/>
<otu id="t2"/>
<otu id="t3"/>
<otu id="t4"/>
<otu id="t5"/>
</otus>
</nex:nexml>
Output:
Node # 1
Name: nex:nexml
Prefix: nex
Local: nexml
Depth: 0
Attribute # 1
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 2
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 3
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 4
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 5
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 6
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 7
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Node # 2
Name: otus
Depth: 1
Attribute # 1
Name: id
Depth: 2
Value: taxa1
Attribute # 2
Name: id
Depth: 2
Value: taxa1
Attribute # 3
Name: id
Depth: 2
Value: taxa1
Attribute # 4
Name: id
Depth: 2
Value: taxa1
Attribute # 5
Name: id
Depth: 2
Value: taxa1
Attribute # 6
Name: id
Depth: 2
Value: taxa1
Attribute # 7
Name: id
Depth: 2
Value: taxa1
Node # 3
Name: #comment
Depth: 2
Value:
The taxon element is analogous to a single label in
a nexus taxa block. It may have the same additional
attributes (label, xml:base, xml:lang, xml:id, xlink:href
and class) as the taxa element.
Node # 4
Name: otu
Depth: 2
Empty Element
Attribute # 1
Name: id
Depth: 3
Value: t1
Node # 5
Name: otu
Depth: 2
Empty Element
Attribute # 1
Name: id
Depth: 3
Value: t2
Node # 6
Name: otu
Depth: 2
Empty Element
Attribute # 1
Name: id
Depth: 3
Value: t3
Node # 7
Name: otu
Depth: 2
Empty Element
Attribute # 1
Name: id
Depth: 3
Value: t4
Node # 8
Name: otu
Depth: 2
Empty Element
Attribute # 1
Name: id
Depth: 3
Value: t5
XML::Reader
is primarily a streaming interface, but, it also provides convenient methods to mix the DOM API( XML::Parser
). Xpath queries can then be used. Perhaps I will write about it in some future post, after I have tried it out. You can find good info here.