Tuesday, December 2, 2008

Completely ignoring DTD with SAXBuilder/JDOM

I recently worked on a project where we were sent a large number of XML documents, all bearing a SYSTEM doctype. The source had provided a somewhat large zip file of of XML DTDs for those documents, all organized into subfolders.

Not horrible, BUT they were obviously a Windows shop because there was no consistency with the dtd location filenames and they didn't necessarily match the case of the provided dtd files.

As an example:
<!DOCTYPE MLB_SCORING_UPDATE SYSTEM "xmldtds/Major League Baseball/MLB_SCORING_UPDATE.dtd">
To make matters more annoying, they weren't even consistent with the main directory "xmldtds" here, "Xmldtds" there... fine if you're on Windows, not so great for a case-sensitive filesystem.

Sure, I could have written a script to down-case all the files in the directory structure AND captured the xml string before I transformed it into a document and downcased the dtd path through a regex/replace... but that's a lot of kludging for something I don't really care about... I'm not validating the document, I just care that it's well-formed.

First thought, this should be easy, I'll just tell it not to validate:

SAXBuilder builder = new SAXBuilder(false);
builder.setValidation(false);
Oops, actually that doesn't work... it still tries to find the DTD (and throws an exception since that filepath doesn't exist on my system).

Ok, second thought - there's an EntityResolver interface, that should be helpful... I can just grab the systemId when it hits my overridden resolveEntity() method and handle it there...
package org.xml.sax;
public interface EntityResolver {
public InputSource resolveEntity(String publicID, String systemID)
throws SAXException;
}
Only problem, that doesn't seem to work. It tries to resolve the SYSTEM dtd (and blows up with an exception) before it hits my overridden method.

Finally, after a bunch of searching and wading through many posts where other people also wanted to just ignore the DTDs, I finally chanced upon a solution. You need to explicitly set the features on the SAXBuilder.

private Document parseXmlDocumentFromString(String input) throws JDOMException, IOException {
SAXBuilder builder = new SAXBuilder(false);
builder.setValidation(false);
builder.setFeature("http://xml.org/sax/features/validation", false);
builder.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
builder.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
return builder.build(new StringReader(input));
}

Hope this saves someone else some time.

20 comments:

Anonymous said...

Hi
Thanks for this solution .. I had the same problem

Cya

Ben Leadholm said...

Had the same problem. I'll reference your solution for others. Again, THANKS!!

Anonymous said...

Thanks for your solution!

Anonymous said...

you are the man. thanks.

henning said...

thanks a lot - this solution worked for me!

Al said...

It works! thx.

Timbutu said...

Saved me time, so perfect! Thanks.

Anonymous said...

Thanks a lot!

Anonymous said...

Hello,

I tried this but I got:

Error 500: org/jdom/input/SAXBuilder.setFeature(Ljava/lang/String;Z)V

any clue? what jar are you using for SAX builder?

I am using achexxml-2.0.jar and achexmapi.jar

Anonymous said...

Thanks! You solved my problem, too!

Anonymous said...

Tank you very much!!

Anonymous said...

Thanks a lot man !

David K Newton said...

Adding to the pile of thanks! Oddly this also works on a SAXParserFactory where you _do_ want to do validation against an XSD, by setting saxFactory.setValidating(true); and then setting those three features to false...

Perhaps best not to think too much about it ;)

Anonymous said...

Muchas gracias!!!!

I was suffering with hibernate-cfg.xml

Thank you very much!!!!

Anonymous said...

You saved my life man !

Anonymous said...

Thanks for saving my time.

Anonymous said...

Thanks for the solution!

Anonymous said...

Thank you so much for the solution!

i actually have XML files without DOCTYPE declaration nor DTD. i have to figure out what to do to ignore DOCTYPE as well.

Anonymous said...

Thank you <3

Anonymous said...

thx for the solution, good work.