java - Ignore org.xml.sax.SAXParseExceptions when transfer xml string to org.w3c.dom.Document? -


i have lot of html pages (i mean source codes) represented java.util.list of strings in java. need convert document objects in java (from package org.w3c.dom).

i way documentbuilderfactory , document:

public static org.w3c.dom.document inputstream2document(inputstream inputstream) throws ioexception, saxexception, parserconfigurationexception {     documentbuilderfactory dbf = documentbuilderfactory.newinstance();     dbf.setvalidating(false);     org.w3c.dom.document parse = dbf.newdocumentbuilder().parse(inputstream);     return parse; }    

some of pages transformed right way there problem there other pages example wrong written attributes , not valid (attributes without ="" ... looks

<a href="somepage.html" someattr> 

for wrong written attribut called "someattr"). , in cases exceptions, example

nested exception: org.xml.sax.saxparseexception; linenumber: 7558; columnnumber: 71; element type "a" must followed either attribute specifications, ">" or "/>". 

or

nested exception: org.xml.sax.saxparseexception; linenumber: 109; columnnumber: 32; string "--" not permitted within comments. 

is there way documentbuilderfactory should ignore exceptions? want convert these pages document , not mind not valid.

<a href="somepage.html" someattr> not xml, xml parser never able parse it, reasonable html try html parser such nekohtml instead of xml parser. there examples on nekohtml's usage page showing how parse both complete documents , fragments of html dom nodes.

import org.cyberneko.html.parsers.domparser; import org.xml.sax.inputsource; import org.w3c.dom.document; import java.io.stringreader;  domparser parser = new domparser(); inputsource in = new inputsource(new stringreader(thehtmlstring)); parser.parse(in); document doc = parser.getdocument(); 

Comments

Popular posts from this blog

Perl - how to grep a block of text from a file -

delphi - How to remove all the grips on a coolbar if I have several coolbands? -

javascript - Animating array of divs; only the final element is modified -