Split PDF into separate files based on text -


i have large single pdf document consists of multiple records. each record takes 1 page use 2 pages. record starts defined text, same.

my goal split pdf separate pdfs , split should happen before "header text" found.

note: looking tool or library using java or python. must free , available on win 7.

any ideas? afaik imagemagick won't work this. may itext this? never used , it's pretty complex need hints.

edit:

marked answer led me solution. completeness here exact implementation:

public void splitbyregex(string filepath, string regex,         string destinationdirectory, boolean removeblankpages) throws ioexception,         documentexception {      logger.entry(filepath, regex, destinationdirectory);     destinationdirectory = destinationdirectory == null ? "" : destinationdirectory;     pdfreader reader = null;     document document = null;     pdfcopy copy = null;     pattern pattern = pattern.compile(regex);              try {         reader = new pdfreader(filepath);         final string result = destinationdirectory + "/record%d.pdf";         // loop on pages in original pdf         int n = reader.getnumberofpages();         (int = 1; < n; i++) {              final string text = pdftextextractor.gettextfrompage(reader, i);             if (pattern.matcher(text).find()) {                 if (document != null && document.isopen()) {                     logger.debug("match found. closing previous document..");                     document.close();                 }                 string filename = string.format(result, i);                 logger.debug("match found. creating new document " + filename + "...");                 document = new document();                 copy = new pdfcopy(document,                         new fileoutputstream(filename));                 document.open();                 logger.debug("adding page document...");                 copy.addpage(copy.getimportedpage(reader, i));              } else if (document != null && document.isopen()) {                 logger.debug("found open document. adding additonal page document...");                 if (removeblankpages && !isblankpage(reader, i)){                     copy.addpage(copy.getimportedpage(reader, i));                 }             }         }         logger.exit();     } {         if (document != null && document.isopen()) {             document.close();         }         if (reader != null) {             reader.close();         }     } }  private boolean isblankpage(pdfreader reader, int pagenumber)         throws ioexception {      // see http://itext-general.2136553.n4.nabble.com/detecting-blank-pages-td2144877.html     pdfdictionary pagedict = reader.getpagen(pagenumber);     // need examine resource dictionary /font or     // /xobject keys.  if either present, they're     // used on page -> not blank.     pdfdictionary resdict = (pdfdictionary) pagedict.get(pdfname.resources);     if (resdict != null) {         return resdict.get(pdfname.font) == null                 && resdict.get(pdfname.xobject) == null;     } else {         return true;     } } 

you can create tool requirements using itext.

whenever looking code samples concerning (current versions of) itext library, should consult itext in action — 2nd edition code samples online , searchable keyword here.

in case relevant samples burst.java , extractpagecontentsorted2.java.

burst.java shows how split 1 pdf in multiple smaller pdfs. central code:

pdfreader reader = new pdfreader("allrecords.pdf"); final string result = "record%d.pdf";  // we'll create many new pdfs there pages document document; pdfcopy copy; // loop on pages in original pdf int n = reader.getnumberofpages(); (int = 0; < n; ) {     // step 1     document = new document();     // step 2     copy = new pdfcopy(document,             new fileoutputstream(string.format(result, ++i)));     // step 3     document.open();     // step 4     copy.addpage(copy.getimportedpage(reader, i));     // step 5     document.close(); } reader.close(); 

this sample splits pdf in single-page pdfs. in case need split different criteria. means in loop have add more 1 imported page (and decouple loop index , page numbers import).

to recognize on pages new dataset starts, inspired extractpagecontentsorted2.java. sample shows how parse text content of page string. central code:

pdfreader reader = new pdfreader("allrecords.pdf"); (int = 1; <= reader.getnumberofpages(); i++) {     system.out.println("\npage " + i);     system.out.println(pdftextextractor.gettextfrompage(reader, i)); } reader.close(); 

simply search record start text: if text page contains it, new record starts there.


Comments

Popular posts from this blog

php - Why I am getting the Error "Commands out of sync; you can't run this command now" -

linux - Does gcc have any options to add version info in ELF binary file? -

java - Are there any classes that implement javax.persistence.Parameter<T>? -