Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

The TikaInputTransformer is currently the (fallback) transformer that the content framework uses to ingest any file that is not associated with a specific input transformer. This transformer depends the Apache Tika ToXMLContentHandler class to parse the file into xhtml, but it bloats the output by putting the entire original document into the body of the xml document.

...

  • Determine the general mechanism
  • Producing a prototype algorithm.

General Mechanism

The Tika parser interface encapsulates and hides a lot complexity, but this design also makes it hard to configure tika at the low level of granularity to achieve our purpose . However, the following general mechanism has been found:

  • Extend the ToXMLContentHandler class and override its methods so that the content of the xhtml body is removed from normal output, but available for additional parsing.
  • Associate specific subclasses of ToXMLContentHandler to certain mime types in order to apply different kinds of parsing to different file types.

Algorithm

EmptyBodyXMLHandler extends ToXMLContentHandler {

...