Efficient extraction of text document metadata

Introduction

The TikaInputTransformer is currently the (fallback) transformer that the content framework uses to ingest any file that is not associated with a specific input transformer. This transformer depends the Apache Tika ToXMLContentHandler class to parse the file into xhtml, but it bloats the output by putting the entire original document into the body of the xml document.

The purpose of this research spike is two fold:

  • Continue to leverage tika for its mime-type detection and basic metadata generation.
  • Eliminate metadata bloat.
  • Extract more detailed metadata from the original document.

 Research was done in the following areas:

  • Determine the general mechanism
  • Producing a prototype algorithm.

General Mechanism

The Tika parser interface encapsulates and hides a lot complexity, but this design also makes it hard to configure tika at the low level of granularity to achieve our purpose . However, the following general mechanism has been found:

  • Extend the ToXMLContentHandler class and override its methods so that the content of the xhtml body is removed from normal output, but available for additional parsing.
  • Associate specific subclasses of ToXMLContentHandler to certain mime types in order to apply different kinds of parsing to different file types.

Algorithm

EmptyBodyXMLHandler extends ToXMLContentHandler {

    @Override characters(...) {

        if not inside body tag {

            output characters as usual.

        }

        else {

            parse body content into metadata.

        }

    }

 

    @Override startElement(...) {

        if  not inside body tag {

            output the start tag as usual.

        }

    }

 

    @Override endElement(...) {

        if not inside body tag {

            output the end tag as usual.

        }

    }

}