Document Previews

We’re researching ways to create thumbnail images for Office documents (Word, PowerPoint, Excel, and Visio) and PDFs on ingest. These would essentially be previews of the documents, similar to the preview you get when you select a document in the file explorer on Windows or Mac. Ideally, the solution would be cross-platform (i.e., pure Java).

PDFs

Generating a preview image for a PDF is easy with PDFBox (an Apache project). With the PDFBox API, you can convert any PDF page to an image. We tested generating previews for PDFs in TikaInputTransformer using PDFBox and the resulting images were accurate.

Word Documents

There are a number of options for generating Word document previews: some are cross-platform and some are not, and some require an additional program to be installed to convert the document. 

There is an API called documents4j that is capable of converting Word documents to PDFs or to HTML by delegating the conversion to Microsoft Office. It requires Office to be installed, and it only works on Windows. We did not test this option. 

There is another API called JODConverter that uses OpenOffice. It’s similar to documents4j in that it requires OpenOffice to be installed and in that it can convert Word documents to PDFs or to HTML. It’s not being actively maintained, but there are a number of active forks. We did not test this option.

docx4j is a popular Java library for working with Office files. It provides the ability to convert Word documents to either HTML or to PDFs. We tested this API and it provides reasonably accurate conversions at a decent speed. For a simple two-page Word document, converting it to a PDF took around 6 seconds. The end result was a good representation of the original Word document. Converting the same two-page Word document to HTML took just over 3 seconds. The result wasn’t as good; it didn’t really look like a Word document. If we had some reasonable default CSS to apply to the generated HTML, it might be good enough to use as a thumbnail. 

Since we need to end up with an image, we have to convert the intermediate PDF or HTML. For a PDF, we could just use PDFBox and convert the first page. However, rendering a PDF for the entire Word document when we only care about the first page is wasteful. We can’t easily convert a Word document directly to an image of the first page because the Word document format is not page-based. Internally, docx4j converts the Word document to XSL-FO and then uses Apache FOP to render the XSL-FO as a PDF. We could use docx4j to generate just the XSL-FO. It may be possible to—with the correct calls to FOP (or another FO rendering library)—render only the first page of the PDF rather than every page.

docx4j only handles .docx files directly. If we need to handle .doc files as well, docx4j provides a utility that converts .doc files into .docx files, but we didn’t test it. 

For HTML, generating an acceptable thumbnail would be more difficult. There are Java APIs capable of rendering HTML + CSS to an image, such as Flying Saucer. We tested this API and the generated images were reasonably accurate. A Word document preview should be the first page of the document, but how would we know what portion of the HTML page to use as the preview? We could assume a reasonable default area of the HTML page to clip and then scale that down to use as the thumbnail.

Another Java library that converts Word documents to PDFs or to HTML is xdocreport. Internally, it does conversions using docx4j, but it also allows for conversions via a library called iText. The conversion via iText was slower than the conversion via docx4j, but the resulting PDFs were comparable. A shortcoming of this library is that it isn’t as configurable as using either of the underlying libraries directly.

To summarize: it doesn’t appear that there are any options that would allow us to convert a Word document directly to an image. We would have to convert the document to either a PDF or to HTML and then convert that to an image. It seems that converting a document to HTML is faster, but generating a useful thumbnail from the HTML would be more difficult.

PowerPoints

Converting a PowerPoint slide to an image is easy to do with Apache POI. We tested it and it generates reasonably accurate (good enough for thumbnails) image versions of both .ppt and .pptx slides. DDF is currently using the POI ServiceMix bundle version 3.9_2, and we encountered some errors with certain PowerPoint files we tried with the corresponding version of POI. These files worked with the latest version of POI (version 3.13).

Excel Spreadsheets

There don’t seem to be too many options for generating previews of Excel spreadsheets. JODConverter and documents4j both claim to be able to convert Excel files to PDFs or to HTML, but they require either Office or OpenOffice to be installed. We didn’t test either option. 

POI includes pretty good support for Excel files (both .xls and .xlsx), and it’s possible to convert them to HTML with limited support for a number of features. We tested a simple Excel file and the generated HTML looked reasonably accurate. It seems that converting the cell contents isn’t too difficult, but POI appears to have limited support for charts and other graphics.

A feasible solution would be to convert the first sheet of an Excel file to HTML and then use an API such as Flying Saucer to generate an image from the HTML. There’s no obvious segment of a sheet to use as a preview since sheets aren’t bounded, so it seems reasonable to just clip some area of the upper-left portion of the generated image and scale it down to use as a thumbnail.

Visio

The only option here seems to be POI, but its Visio support is very limited. We couldn’t find any examples of converting a Visio file to another format with POI.