Showing posts with label xml. Show all posts
Showing posts with label xml. Show all posts

Tuesday, December 2, 2008

Completely ignoring DTD with SAXBuilder/JDOM

I recently worked on a project where we were sent a large number of XML documents, all bearing a SYSTEM doctype. The source had provided a somewhat large zip file of of XML DTDs for those documents, all organized into subfolders.

Not horrible, BUT they were obviously a Windows shop because there was no consistency with the dtd location filenames and they didn't necessarily match the case of the provided dtd files.

As an example:
<!DOCTYPE MLB_SCORING_UPDATE SYSTEM "xmldtds/Major League Baseball/MLB_SCORING_UPDATE.dtd">
To make matters more annoying, they weren't even consistent with the main directory "xmldtds" here, "Xmldtds" there... fine if you're on Windows, not so great for a case-sensitive filesystem.

Sure, I could have written a script to down-case all the files in the directory structure AND captured the xml string before I transformed it into a document and downcased the dtd path through a regex/replace... but that's a lot of kludging for something I don't really care about... I'm not validating the document, I just care that it's well-formed.

First thought, this should be easy, I'll just tell it not to validate:

SAXBuilder builder = new SAXBuilder(false);
builder.setValidation(false);
Oops, actually that doesn't work... it still tries to find the DTD (and throws an exception since that filepath doesn't exist on my system).

Ok, second thought - there's an EntityResolver interface, that should be helpful... I can just grab the systemId when it hits my overridden resolveEntity() method and handle it there...
package org.xml.sax;
public interface EntityResolver {
public InputSource resolveEntity(String publicID, String systemID)
throws SAXException;
}
Only problem, that doesn't seem to work. It tries to resolve the SYSTEM dtd (and blows up with an exception) before it hits my overridden method.

Finally, after a bunch of searching and wading through many posts where other people also wanted to just ignore the DTDs, I finally chanced upon a solution. You need to explicitly set the features on the SAXBuilder.

private Document parseXmlDocumentFromString(String input) throws JDOMException, IOException {
SAXBuilder builder = new SAXBuilder(false);
builder.setValidation(false);
builder.setFeature("http://xml.org/sax/features/validation", false);
builder.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
builder.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
return builder.build(new StringReader(input));
}

Hope this saves someone else some time.