Skip to: Site menu | Main content

Woodstox

High-performance XML processor.

FAQ Print

Woodstox XML-parser Frequently Asked/Anticipated Questions

1. General

1.1 What is Woodstox?

Woodstox is a high-performance XML processor that implements StAX API,

specified by JSR-173. XML processor means that it can read (parse,

unmarshall, deserialize) and write (output, marshall, serialize)

XML content such as XML documents.

Woodstox is developed as Open Source, and is available under 2 "standard"

Open Source/Free Software licenses: Free Software Foundataion's LGPL 2.1

and Apache Foundation's ASL 2.0.

1.2 Where can I find Woodstox?

Woodstox project home page is at Codehaus:

http://woodstox.codehaus.org

The home page contains most up-to-date information regarding current

status of the project, and contact information

1.3 Standard compliancy

1.3.1 Which XML technologies does Woodstox support?

Currently supported specifications are:

  • XML 1.0 and 1.1 (latter mostly); including full support for all entity

    types (internal, external; parsed, unparsed) and full DTD validation.

    (DTD validation from Woodstox 2.0 on; entities from 1.0)
  • XML Namespaces 1.0 (Woodstox 1.0 and above)
  • RelaxNG (using external validation component) (Woodstox 3.0 and above)

And ones planned for future:

  • XML Id (planned for Woodstox 4.0)
  • XML Schema (using external validation component) (planned for Woodstox 4.0)
  • XML Include (planned for Woodstox 4.0 or later)

1.3.2 Which Java API does Woodstox implement?

Currently Woodstox implements following standard APIs:

  • Stax 1.0 API as specified by JSR-173.

Additionally, Woodstox also implements:

  • Experimental "Stax2" extension set (a collection of interfaces

    and abstract classes in org.codehaus.stax2 package), which is

    not proprietary to Woodstox implementation (although at present

    Woodstox is the only implementation of this API).

1.4 How do I report bugs and request new features?

Woodstox development team uses Jira bug tracking system at codehaus:

http://jira.codehaus.org/browse/WSTX

and it can be used for bug reports, requests for enchanced functionality

and so on.

Alternatively, you can also join the Codehaus mailing lists:

  • user@woodstox.codehaus.org is the general mailing list for Woodstox users
  • dev@woodstox.codehaus.org is for more involved technical questions,

    and discussion on implementation aspects of Woodstox.

Additionally, for problems regarding Stax (1.0) specification and API,

you may want to use Stax API bug tracking system at:

http://www.extreme.indiana.edu/bugzilla/query.cgi

(search for component Stax)

or, for some of the problems, Jira instance for Stax reference implementation:

http://jira.codehaus.org/browse/STAX

General discussion about Stax API, and various implementations (including

Woodstox) usually happens at Stax builders list:

stax_builders@yahoogroups.com

which is open for anyone (not just stax implementation developers) to join.

1.5 What are the design goals of Woodstox?

Main goals are, in rough order:

  • Write XML processor that completely implements STaX API, to the

    fullest extent possible based on common sense interpretation of

    the specification and associated documentation (javadocs of the

    reference implementation).
  • Make parser as efficient as possible without completely sacrificing

    its maintainability (code clarity, simplicity). Efficiency is meant

    to encompass both time AND space constraints, ie. not only should it be

    fast, but also try to use memory sparingly.
  • Implement full XML (1.1) functionality; specifically make sure all

    well-formed/valid documents are properly handled. Secondary goal is to

    gracefully handle non-wellformed documents (and to catch problems).
  • Make features that can have significant impact on performance

    configurable; use reasonably defaults for settings. It should be easy

    to just plug-in and use, but also allow "power coders" to configure it

    optimally for specific use cases.
  • Sensible default values, so that Woodstox functions adequately with

    the default settings, with no need for extensive tweaking of settings.
  • Modularity; try to implement only features that can not be implement

    efficiently or reliably on top of StAX interface: other features should be

    implemented as separate add-on packages, to be usable with other StAX

    implementations.
  • Good error reporting: there is nothing more frustrating than getting

    either minimal information about problem ("Invalid content"), or too

    much of information ("Element 'xyz' does not match Content model

    (a|b|(c, d+)|.......) derived from (foo, bar?, ...)...")
  • Extensive, modular, pluggable validation functionality; not just for input

    but also output side; allow for writing custom validators and plugging

    them in, efficiently chaining multiple validators if necessary.

1.6 What's in the Name?

Name Woodstox is just a silly combination of various motifs; mainly

mutation of "STaX" part (from the Java API it implements), and then

similarity to both a sidekick cartoon character and the music festival

location. There is no real reason for it – it just sounded like a good

idea at the time.

2. StAX API features

2.1. How do I use XMLStreamWriter?Unable to render embedded object: File (? This API is a mess) not found.

Yes, it indeed is bit of a mess. Unfortunately Stax 1.0 specification

underspecifies writer side, leading to lots of confusion, not only for

users, but for Stax implementors as well.

A full explanation of how Woodstox implementors undestand how

XMLStreamWriter functionality should work is at

[link to be added]

but here is a quick rundown on various modes and settings.

Basic Stax 1.0 specifies two different operating modes; where the

different is between handling of namespace bindings (declarations,

prefix mappings). If you do not use namespaces, there is no difference

between these modes

  • Repairing mode means that the writer takes full responsibility for

    declaring and binding namespaces. Application can request specific

    bindings, but the writer ultimately decides on which bindings to

    use, to produce well-formed namespace output that corresponds to

    the fully-qualified name (namespace URI and local name via prefix

    bindings). Writer thus will output all namespace declarations

    automatically, and application should not try outputting them.

    This mode has associated overhead with it, but it is convenient

    and useful especially when merging documents that use different

    namespaces.
  • Non-repairing mode is simple manual mode, in which the stream writer

    does not output any namespace declarations, nor map prefixes and

    namespace URIs. Application is to call appropriate output methods

    to produce valid output. The only namespace support available is

    the possibility to add bindings between prefix and namespace URI:

    this allows for using prefix-less write methods.

    This mode has very little overhead for namespace management (and if

    prefix mapping is not used, practically none), but it can lead

    to invalid output.

2.1.1. XMLStreamWriter in non-repairing (manual namespaces) mode

In this mode, application has to output all namespace declarations similar

to the way regular attributes are added:

  • Namespace output methods (writeNamespace(), writeDefaultNamespace()

    should be called AFTER outputting element that is to contain the

    declaration. The declarations do not have to be output before attributes

    that use the binding; stream writer does not verify bindings in any

    way during output.
  • If application uses 'full' write methods for elements and attributes

    (ones that 3 arguments; local name, prefix, namespace URI), prefix

    given is output as is with no checks done regarding binding.
  • If application wants to use write methods that do NOT take prefix

    as the argument (but just local name and namespace URI), application

    is to call 'setPrefix()' (when mapping explicit prefix to a namespace

    URI) or 'setDefaultNamespace()' (when defining mapping of the default

    namespace). These bindings are guaranteed to persist for the element

    that was output last (or for root level, for the document scope), but

    some implementations may leave bindings in effect until the end of the

    document (Stax 1.0 specification does not specify life cycle for these

    bindings).

    Note that even if prefixes are bound, output will still not be done

    by the stream writer. And conversely, adding prefix bindings is not

    a requirement for calling 'writeNamespace()'/'writeDefaultNamespace()':

    these methods are orthogonal.
  • Methods that take neither prefix nor namespace URI are assumed to

    be output with no prefix; which means that (as per XML specs) elements

    will be in the currently bound default namespace, if any, and attributes

    will not be in any namespace.

2.1.2. XMLStreamWriter in repairing (automatic namespaces) mode

In repairing mode, application does not have to do anything to manage

namespace bindings and mappings. It can, if it wants to, indicate prefix

preferences. There are 2 ways to do that:

  • If application uses 'full' write methods (ones that take prefix and

    namespace URI), prefix passed is taken as the preferred prefix (if

    empty, trying to use the default namespace for elements): if prefix

    is already bound, it is used as is; if not, writer may try to bind

    it (exact behaviour is unspecified by Stax specs – Woodstox tries

    to bind it if prefix is unbound, but not if it is already bound to

    another namespace URI).
  • Application may also indicate preferred binding of namespaces by

    calling 'setPrefix()' and 'setDefaultNamespace()' methods. These

    will indicate preference that will be used when using write methods

    that only take the namespace URI.
  • Write methods that take neither namespace URI nor prefix behave as

    in non-repairing mode, ie. they will output elements and attributes

    that have no prefixes, and bind respectively as per xml specification

    (elements to currently active default namespace, if any, attributes

    belong to no namespace).

If a namespace binding is needed and either no preference is found, or

the preference can not be used (for example, different binding for the

prefix is already output for the current element), stream writer will

generate an implementation dependant prefix to bind (and ensure it

does not collide with other bindings).

2.2. Text handling: Why do I get these short partial segments?

By default StAX readers are allowed to return text and CDATA segments in

parts, ie. more than one event per physical segment. This is usually done

so that readers need not allocate big consequtive memory buffers for

long text segments. With default settings, it is possible to sometimes

get as little as 64 characters per event, even if the text/CDATA segment

itself was significantly longer.

However, you can easily change this behaviour. There are two properties

you can modify (check documentation for details):

  • IS_COALESCING is a standard StAX property; turning it to true will

    force reader to coalesce ALL adjacent text/CDATA segments into just

    one text event. This may make it easier to process document. Downside

    is that it may slightly impact performance; the effect should not be

    drastic in normal use cases, however.
  • P_MIN_TEXT_SEGMENT is a Woodstox-specific property that defines the

    smallest text/CDATA fragment that reader is allowed to return. The

    default value is 64 characters; setting it to Integer.MAX_VALUE

    effectively forces reader to always return the full segment. However,

    unlike IS_COALESCING, it does not make reader coalesce adjacent

    segments. Because of this, the performance impact is smaller, and

    changing this value is unlikely to have big performance impact.

3 Deployment/packaging

Basic distributable jars that one needs to use Woodstox are:

(a) Stax 1.0 API jar that contains javax.xml.stream.* classes.

This is based on JSR-173 specification.

(b) Woodstox implementation jar (under appropriate license, see below)

In addition, it is possible use following optional jars:

  • stax2.jar contains only classes of the experimental Stax2 API

    (interfaces and classes in 'org.codehaus.stax2' package).

    These can be used by applications that want to be able to dynamically

    use extended Woodstox capabilities, if available, but otherwise

    revert basic Stax 1.0 API. This can be achieved by only including

    stax2.jar by default, and allowing full Woodstox jar to be included

    as an optional component.

    Note that the full woodstox jar does contains these API classes by

    default, for convenience.

3.1 Licensing

Currently (Woodstox 2.0 and above), you can choose to use Woodstox either

according to terms of LGPL (2.1) or ASL (2.0) licenses. The choice is made

by using one of two distributed implementation jars, which contains

appropriate license, and determines licensing restrictions.

Please note that the functionality provided is identical – there are

no technical differences, or reasons to use one over the other.

The choice you make has only effect in regards to specific use for that

particular jar – you may use instances of both jars for different

purposes; in each case, licensing restrictions are based on specific jar

used.

In general, choice depends mostly on other (Open Source) components you

are using; some limit you so that you may have to use LGPL version; others

that you have to use ASL version. This is the main reason Woodstox is

dual-licensed: to offer the choice, while maintaining some basic

Open Source restrictions on redistribution.

3.2 Functionality subsets (alternate jars)

Although it is most common to use one of 2 full standard implementation

jars, there are situations where application only needs to use subset

of Woodstox functionality. For example, some applications may only want

to use input functionality (parsing), while others only produce xml

output. Or, in some cases validation is never used.

In these cases it may be beneficial to use a jar that only contains subset

of the full functonality. These jars are smaller, and may reduce size of

application deployment, and potentially slightly reduce memory usage.

One thing to note about these subsets: due to the way Stax 1.0 is

structured, it is not possible to transparently support subsets while

implementing other parts of the API. As a result, normal Stax 1.0

factories can NOT be used with these subsets – special factory classes

needed to be used directly. This makes using these jars non-portable,

and best suited for resource limited environments like mobile phones.

By default, Woodstox Ant build scripts produce following subset jars

(using nifty 'classfileset' optional Ant task)

  • wstx-j2me-min-input.jar contains non-validating stream reader classes;

    and excludes Event API implementation, output classes and validation

    functionality (except for classes that non-validating reader classes

    need to support API).

    NOTE: although name implies j2me compliancy, this has not been verified,

    and is likely not the case.
  • wstx-j2me-min-output.jar similar to above, but only contains non-validating

    stream writer functionality.
  • wstx-j2me-min-both.jar. Combination of both of above, ie. contains

    non-validating cursor API (no event API) implementation.

When using input functionality, factory to use is:

com.ctc.wstx.stax.MinimalInputFactory

and when using output functionality:

com.ctc.wstx.stax.MinimalOutputFactory

both of which have subset of methods from XMLInputFactory and

XMLOutputFactory, respectively.

4. Implementation details

4.1 String interning

Which Strings and when does Woodstox intern?

  • Names (prefixes and local names of elements and attributes, names

    of processing instruction targets and entities) are always intern()ed

    (and this is also visible using

    streamReader.getProperty(XMLInputFactory2.P_INTERN_NAMES))
  • Namespace URIs MAY be interned, depending on setting of

    XMLInputFactory2.P_INTERN_URIS (accessible via

    streamReader.getProperty(XMLInputFactory2.P_INTERN_URIS)).

    By default this interning is NOT done. However, URI Strings for a single

    document are still shared, so that within a single document, namespace

    URIs CAN always be compared for String identity (nsUri1 == nsUri2 is true

    if and only if they contain same String).

5. Performance

5.1. How can I make Woodstox work as fast as possible?

Although default settings of Woodstox are chosen to allow efficient operation, there are things that application needs to do, to help.

Here are some of more important things to do:

  • Reuse factories (XMLInputFactory, XMLOutputFactory, validation schema factories). This important, because:
    • Instantiation factories through Stax API is costly (although actual construction of Woodstox factories is less so)
    • Most caches are per-factory: symbol (element, attribute name) caching, DTD caching.
  • Let Woodstox take care of character encoding: pass InputStreams and OutputStreams as is, without trying to help by creating Writers. Similarly, if you have a File or URL, consider using these (via Stax2 create methods), instead of constructing InputStreams.
  • Close XMLStreamReader and XMLStreamWriter instances when you are done with them: this allows Woodstox to possibly reuse underlying buffers.

So how significant are these simple rules? They are most important when dealing with small documents: in these cases difference can be an order of magnitude. For bigger documents effects are more limited, but still significant.