c# - PDFBox 0.7.3 convert pdf to text -


i want convert pdf file text file of pdf files not work pdfbox dll version of acrobat in newer acrobat 5.x

please tell me do?

output.writeline("begin parsing....."); output.writeline(datetime.now.tostring());  pddocument doc = pddocument.load(path); pdftextstripper stripper = new pdftextstripper();  output.write(stripper.gettext(doc)); 

your first attempt should try current version of pdfbox. version 0.7.3 dates 2006! pdfbox meanwhile has become apache project , located here: http://pdfbox.apache.org/ , current version (as of may 2013) 1.8.1. , i'm sure pdfbox nowerdays support pdf object streams , cross reference streams new in pdf reference version 1.5, version adobe acrobat 6 has been built for

if not work, might want try other pdf libraries, e.g. itext (or itextsharp in case) version 5.4.x if agpl (or alternatively buying license) no problem you.

information on text parsing using itext(sharp) can found in chapter15 marked content , parsing pdf of itext in action — 2nd edition. samples chapter can found online: java , .net.

for first test sample extractpagecontentsorted2.cs / extractpagecontentsorted2.java start. central code:

pdfreader reader = new pdfreader(pdf_file); pdfreadercontentparser parser = new pdfreadercontentparser(reader); stringbuilder sb = new stringbuilder(); (int = 1; <= reader.numberofpages; i++) {     sb.appendline(pdftextextractor.gettextfrompage(reader, i)); } 

if neither current pdfbox version nor current itext(sharp) version can parse pdf, might want post sample inspection; there ways drop information required text parsing pdf...


Comments

Popular posts from this blog

Perl - how to grep a block of text from a file -

delphi - How to remove all the grips on a coolbar if I have several coolbands? -

javascript - Animating array of divs; only the final element is modified -