Introduction

The TikaInputTransformer is currently the (fallback) transformer that the content framework uses to ingest any file that is not associated with a specific input transformer. This transformer depends the Apache Tika ToXMLContentHandler class to parse the file into xhtml, but it bloats the output by putting the entire original document into the body of the xml document.

The purpose of this research spike is two fold:

 Research was done in the following areas:

General Mechanism

The Tika parser interface encapsulates and hides a lot complexity, but this design also makes it hard to configure tika at the low level of granularity to achieve our purpose . However, the following general mechanism has been found:

Algorithm

EmptyBodyXMLHandler extends ToXMLContentHandler {

    @Override characters(...) {

        if not inside body tag {

            output characters as usual.

        }

        else {

            parse body content into metadata.

        }

    }

 

    @Override startElement(...) {

        if  not inside body tag {

            output the start tag as usual.

        }

    }

 

    @Override endElement(...) {

        if not inside body tag {

            output the end tag as usual.

        }

    }

}