Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Generating a preview image for a PDF is easy with PDFBox (an Apache project). With the PDFBox API, you can convert any PDF page to an image. We tested generating previews for PDFs in TikaInputTransformer using PDFBox and the resulting images were accurate.

...

There are a number of options for generating Word document previews: some are cross-platform and some are not, and some require an additional program to be installed to convert the document. 

There is an API called documents4j that  that is capable of converting Word documents to PDFs or to HTML by delegating the conversion to Microsoft Office. It requires Office to be installed, and it only works on Windows. We did not test this option. 

There is another API called JODConverter that  that uses OpenOffice. It’s similar to documents4j in that it requires OpenOffice to be installed and in that it can convert Word documents to PDFs or to HTML. It’s not being actively maintained, but there are a number of active forks. We did not test this option.

docx4j is  is a popular Java library for working with Office files. It provides the ability to convert Word documents to either HTML or to PDFs. We tested this API and it provides reasonably accurate conversions at a decent speed. For a simple two-page Word document, converting it to a PDF took around 6 seconds. The end result was a good representation of the original Word document. Converting the same two-page Word document to HTML took just over 3 seconds. The result wasn’t as good; it didn’t really look like a Word document. If we had some reasonable default CSS to apply to the generated HTML, it might be good enough to use as a thumbnail. 

Since we need to end up with an image, we have to convert the intermediate PDF or HTML. For a PDF, we could just use PDFBox and convert the first page. However, rendering a PDF for the entire Word document when we only care about the first page is wasteful. We can’t easily convert a Word document directly to an image of the first page because the Word document format is not page-based. Internally, docx4j converts the Word document to XSL-FO and then uses Apache FOP to  to render the XSL-FO as a PDF. We could use docx4j to generate just the XSL-FO. It may be possible to—with the correct calls to FOP (or another FO rendering library)—render only the first page of the PDF rather than every page.

...

For HTML, generating an acceptable thumbnail would be more difficult. There are Java APIs capable of rendering HTML + CSS to an image, such as Flying Saucer. We tested this API and the generated images were reasonably accurate. A Word document preview should be the first page of the document, but how would we know what portion of the HTML page to use as the preview? We could assume a reasonable default area of the HTML page to clip and then scale that down to use as the thumbnail.

Another Java library that converts Word documents to PDFs or to HTML is xdocreport. Internally, it does conversions using docx4j, but it also allows for conversions via a library called iText. The conversion via iText was slower than the conversion via docx4j, but the resulting PDFs were comparable. A shortcoming of this library is that it isn’t as configurable as using either of the underlying libraries directly.

...

Converting a PowerPoint slide to an image is easy to do with Apache POI. We tested it and it generates reasonably accurate (good enough for thumbnails) image versions of both .ppt and .pptx slides. DDF is currently using the POI ServiceMix bundle version 3.9_2, and we encountered some errors with certain PowerPoint files we tried with the corresponding version of POI. These files worked with the latest version of POI (version 3.13).

...