Split PDF into separate files based on text -
i have large single pdf document consists of multiple records. each record takes 1 page use 2 pages. record starts defined text, same.
my goal split pdf separate pdfs , split should happen before "header text" found.
note: looking tool or library using java or python. must free , available on win 7.
any ideas? afaik imagemagick won't work this. may itext this? never used , it's pretty complex need hints.
edit:
marked answer led me solution. completeness here exact implementation:
public void splitbyregex(string filepath, string regex, string destinationdirectory, boolean removeblankpages) throws ioexception, documentexception { logger.entry(filepath, regex, destinationdirectory); destinationdirectory = destinationdirectory == null ? "" : destinationdirectory; pdfreader reader = null; document document = null; pdfcopy copy = null; pattern pattern = pattern.compile(regex); try { reader = new pdfreader(filepath); final string result = destinationdirectory + "/record%d.pdf"; // loop on pages in original pdf int n = reader.getnumberofpages(); (int = 1; < n; i++) { final string text = pdftextextractor.gettextfrompage(reader, i); if (pattern.matcher(text).find()) { if (document != null && document.isopen()) { logger.debug("match found. closing previous document.."); document.close(); } string filename = string.format(result, i); logger.debug("match found. creating new document " + filename + "..."); document = new document(); copy = new pdfcopy(document, new fileoutputstream(filename)); document.open(); logger.debug("adding page document..."); copy.addpage(copy.getimportedpage(reader, i)); } else if (document != null && document.isopen()) { logger.debug("found open document. adding additonal page document..."); if (removeblankpages && !isblankpage(reader, i)){ copy.addpage(copy.getimportedpage(reader, i)); } } } logger.exit(); } { if (document != null && document.isopen()) { document.close(); } if (reader != null) { reader.close(); } } } private boolean isblankpage(pdfreader reader, int pagenumber) throws ioexception { // see http://itext-general.2136553.n4.nabble.com/detecting-blank-pages-td2144877.html pdfdictionary pagedict = reader.getpagen(pagenumber); // need examine resource dictionary /font or // /xobject keys. if either present, they're // used on page -> not blank. pdfdictionary resdict = (pdfdictionary) pagedict.get(pdfname.resources); if (resdict != null) { return resdict.get(pdfname.font) == null && resdict.get(pdfname.xobject) == null; } else { return true; } }
you can create tool requirements using itext.
whenever looking code samples concerning (current versions of) itext library, should consult itext in action — 2nd edition code samples online , searchable keyword here.
in case relevant samples burst.java , extractpagecontentsorted2.java.
burst.java shows how split 1 pdf in multiple smaller pdfs. central code:
pdfreader reader = new pdfreader("allrecords.pdf"); final string result = "record%d.pdf"; // we'll create many new pdfs there pages document document; pdfcopy copy; // loop on pages in original pdf int n = reader.getnumberofpages(); (int = 0; < n; ) { // step 1 document = new document(); // step 2 copy = new pdfcopy(document, new fileoutputstream(string.format(result, ++i))); // step 3 document.open(); // step 4 copy.addpage(copy.getimportedpage(reader, i)); // step 5 document.close(); } reader.close(); this sample splits pdf in single-page pdfs. in case need split different criteria. means in loop have add more 1 imported page (and decouple loop index , page numbers import).
to recognize on pages new dataset starts, inspired extractpagecontentsorted2.java. sample shows how parse text content of page string. central code:
pdfreader reader = new pdfreader("allrecords.pdf"); (int = 1; <= reader.getnumberofpages(); i++) { system.out.println("\npage " + i); system.out.println(pdftextextractor.gettextfrompage(reader, i)); } reader.close(); simply search record start text: if text page contains it, new record starts there.
Comments
Post a Comment