Skip to: Site menu | Main content

Woodstox

High-performance XML processor.

Performance Print

Optimizing Woodstox Performance

Optimizing Woodstox performance consists of two main components: configuring Woodstox objects (StAX factories (input, output, event), instances), and by using these objects in certain ways. Some of the information only relates to Woodstox, and some may be applicable to other StAX implementations.

Specific documentation regarding Woodstox configuration settings can be found on Configuring page.

Also, in addition to this page, you may be interested in following related external articles:

General StAX Usage

Try to reuse factories (input, output, event)

The most important general usage guideline is that you should try to reuse factory instances if possible, for similar use cases. This because much of caching done by instances (esp. readers) is done on per-factory basis. For example, symbol tables for readers are shared between all instances created by an input factory. As such, reusing a factory instance allows the cache to be more efficiently used by reader instances.

There are two concerns related to factory reuse:

  • Woodstox factories are thread-safe after configuration phase (calling setProperty()), but not during it. Most importantly, once all configuration is done, calling 'createXMLxxx' methods is fully thread-safe. So as long as configuration of the factory instance is done once, it is ok to use same instance concurrently from multiple threads (Reader and Writer instances themselves are non-thread safe, but there is seldom need to use them from multiple threads).
  • In case of Reader symbol tables, an entry is created for all element names. Thus, if a single global instance is used, symbol table grows to encompass all element and attribute names (including prefixes). Generally this is NOT a problem (after all, most XML vocabularies are bounded), but for very large systems, or ones that generate dynamic element names this might be a concern. New: with Woodstox version 3.0, there are sanity checks to reset "too big" symbol tables, so that the size does not grow without bounds.

Let Woodstox handle input conversions

If you need to read XML content from a stream, or write content to a stream, it is practically always preferable to let Woodstox deal with character conversion (and buffering), instead of constructing your own java.io.Reader/Writer instances. Woodstox may be able to use its internal optimized versions of converters; but even if not, it will just fall back on JDK default ones that you would probably be using the first place.

Specifically, when dealing with UTF-8 streams, Woodstox' i/o performance is better when it can do its own handling. Difference is less significant for single-byte encodings like ISO-8859-1 (== ISO-Latin1).

Clean up after you are done

Although it is always a good idea to clean up after you are done – like closing the InputStreams, OutputStreams, Readers and Writers – there are also some non-obvious performance benefits to closing XMLStreamReader and XMLStreamWriter instances. Woodstox may try to reuse some of more extensive Objects (like underlying buffers and symbol tables), and it can only do so when it knows for sure that a reader or writer does not need these objects any more. Thus, closing stream readers and writers as you are done with them can improve performance.

Only use Event API if you need it

StAX Event API achieves one goal: it allows for persistence of the parsing events. That is, event objects returned are not transient; one can store them during parsing, pipeline them to processors and so on. It does not add any other functionality. What it does add, however, is the overhead of constructing an event object for practically all events, whether you need them or not.

So, if you do not need to persist event data (ie. you can do a single pass), or if you have your own object model that does that, it is preferable to just access data using the "raw" cursor API.

Of course, if you do need to persist these events for a while (or permanently), Event API does not add unreasonable overhead: it is usually good enough for most use cases.

Use non-String text accessors

There is significant overhead for constructing String objects, compared to accesing "raw" character arrays via cursor API. This is mainly since String object will create a copy of the underlying character array that Woodstox has parsed into its internal buffers.

If you do not specifically need String objects for further processing (for example, if you will immediately output text content to another stream), it will be more efficient to use char[] accessors and avoid String construction overhead.

Generic StAX configuration settings

Reader settings

Enabling following basic StAX properties will generally have a negative performance impact:

  • IS_COALESCING will generally add something like 10% overhead, since it requires all adjacent text/CDATA nodes to be combined into single events. Even for documents that do not have such events, checking for possible existence adds overhead, due to lookaheads required.
  • IS_VALIDATING will add some overhead (anywhere from 10% to 40%), depending on complexity of DTD (and possibly more if DTD caching is disabled). So enable it only if you need it. Note: if you only need to be able to resolve entities, it is enough to enable SUPPORT_DTD (and IS_REPLACING_ENTITY_REFERENCES, but that is on by default); validation is not needed.
  • SUPPORT_DTD in and of itself (without IS_VALIDATING) does not have major effect, as long as DTDs are cached. It does allow for entity expansion, as well as getting type information for attributes (to know which attributes are declared as IDs, for example), and defaulting of attribute values.

Writer settings

For writers, the only defined generic property (IS_REPAIRING_NAMESPACES) adds overhead, so you may want to enable it only if you need it (need to merge possibly conflicting namespaces; do not want to track if a namespace is declared). However, overhead is not considered very high.

Note, too, that you seldom neet to use 'setPrefix()' at all, since it will only be used by Woodstox as either a suggestion (in repairing mode), or as a mapping when omitting prefix in a call (in non-repairing mode). setPrefix() has minor overhead, in that writer then has to keep track of these suggestions over the life-cycle of the writer: but this is seldom if ever of real concern.

Woodstox-specific configuration settings (2.0.x and up)

Reader settings

If you don't care about type of linefeeds, disable normalization

Prior to Woodstox 4.0, it was possible to suppress some input normalization processing. This is no longer possible; nor did it have noticeable performance impact even when it was possible.

This feature was removed since it added complexity to the implementation, and in general is considered a bad idea as it reduces XML conformance.

Writer settings

If you do not need namespaces, disable namespace support

Although overhead of enabling namespaces is quite limited, it is nonetheless true that if you do not need namespace support (do not use namespaces), there is little point in enabling id. So, you can just set XMLInputFactory2.P_NAMESPACE_AWARE to Boolean.FALSE, and avoid that overhead.

(to be continued)

XML Content Structuring and Usage Patterns

DTDs

DTD processing is one area where features used do have a clear effect on performance. And although performance should (and often) not be the main decision factor on how to define DTDs, it is good to know which features have associated overhead (at least with Woodstox DTD-validation).

Avoid using DTD internal subsets

If possible, do not use internal DTD subsets (internal DTD subset is part of DTD embedded in document's !DOCTYPE declaration, instead of in an external file). Although processing this subset is no slower than that of an external subset, this part is not cachable, and may even prevent using a cached external subset (when overriding entity definitions).

Avoid attribute default values

If possible, it is good idea to avoid defining default values to attributes, especially from performance standpoint. While overhead may not be huge (5-10% for parsing), it does exist.