java - Ignore org.xml.sax.SAXParseExceptions when transfer xml string to org.w3c.dom.Document? -
i have lot of html pages (i mean source codes) represented java.util.list of strings in java. need convert document objects in java (from package org.w3c.dom).
i way documentbuilderfactory , document:
public static org.w3c.dom.document inputstream2document(inputstream inputstream) throws ioexception, saxexception, parserconfigurationexception { documentbuilderfactory dbf = documentbuilderfactory.newinstance(); dbf.setvalidating(false); org.w3c.dom.document parse = dbf.newdocumentbuilder().parse(inputstream); return parse; }
some of pages transformed right way there problem there other pages example wrong written attributes , not valid (attributes without ="" ... looks
<a href="somepage.html" someattr>
for wrong written attribut called "someattr"). , in cases exceptions, example
nested exception: org.xml.sax.saxparseexception; linenumber: 7558; columnnumber: 71; element type "a" must followed either attribute specifications, ">" or "/>".
or
nested exception: org.xml.sax.saxparseexception; linenumber: 109; columnnumber: 32; string "--" not permitted within comments.
is there way documentbuilderfactory should ignore exceptions? want convert these pages document , not mind not valid.
<a href="somepage.html" someattr>
not xml, xml parser never able parse it, reasonable html try html parser such nekohtml instead of xml parser. there examples on nekohtml's usage page showing how parse both complete documents , fragments of html dom nodes.
import org.cyberneko.html.parsers.domparser; import org.xml.sax.inputsource; import org.w3c.dom.document; import java.io.stringreader; domparser parser = new domparser(); inputsource in = new inputsource(new stringreader(thehtmlstring)); parser.parse(in); document doc = parser.getdocument();
Comments
Post a Comment