java - Ignore org.xml.sax.SAXParseExceptions when transfer xml string to org.w3c.dom.Document? -

- September 15, 2012

i have lot of html pages (i mean source codes) represented java.util.list of strings in java. need convert document objects in java (from package org.w3c.dom).

i way documentbuilderfactory , document:

public static org.w3c.dom.document inputstream2document(inputstream inputstream) throws ioexception, saxexception, parserconfigurationexception {     documentbuilderfactory dbf = documentbuilderfactory.newinstance();     dbf.setvalidating(false);     org.w3c.dom.document parse = dbf.newdocumentbuilder().parse(inputstream);     return parse; }

some of pages transformed right way there problem there other pages example wrong written attributes , not valid (attributes without ="" ... looks

<a href="somepage.html" someattr>

for wrong written attribut called "someattr"). , in cases exceptions, example

nested exception: org.xml.sax.saxparseexception; linenumber: 7558; columnnumber: 71; element type "a" must followed either attribute specifications, ">" or "/>".

nested exception: org.xml.sax.saxparseexception; linenumber: 109; columnnumber: 32; string "--" not permitted within comments.

is there way documentbuilderfactory should ignore exceptions? want convert these pages document , not mind not valid.

<a href="somepage.html" someattr> not xml, xml parser never able parse it, reasonable html try html parser such nekohtml instead of xml parser. there examples on nekohtml's usage page showing how parse both complete documents , fragments of html dom nodes.

import org.cyberneko.html.parsers.domparser; import org.xml.sax.inputsource; import org.w3c.dom.document; import java.io.stringreader;  domparser parser = new domparser(); inputsource in = new inputsource(new stringreader(thehtmlstring)); parser.parse(in); document doc = parser.getdocument();

Search This Blog

HPH

java - Ignore org.xml.sax.SAXParseExceptions when transfer xml string to org.w3c.dom.Document? -

Comments

Post a Comment

Popular posts from this blog

c++ - Function signature as a function template parameter -

algorithm - What are some ways to combine a number of (potentially incompatible) sorted sub-sets of a total set into a (partial) ordering of the total set? -

How to call a javascript function after the page loads with a chrome extension? -