Randall Hauch
on 18 Sep 12
MODE-1639, MODE-1640, MODE-1634 Replaced the Aperture-based MIME type detector with a Tika-based one
This required quite a bit of dependency… Show more
MODE-1639, MODE-1640, MODE-1634 Replaced the Aperture-based MIME type detector with a Tika-based one

This required quite a bit of dependency gymnastics, since Tika has quite a few more transitive

dependencies than the Aperture library (which we had successfully pared down several years ago).

Tika references about 25 dependencies (including transitive dependencies), but this was reduced

in 'modeshape-jcr' to about 8 for basic MIME type detection. Note that Tika usually includes

two BouncyCastle libraries in its dependencies (used for encrypted PDFs, among other things),

but ModeShape intentionally excludes these (as we don't want to ship or depend on any

security-related JARs).

Not only do we get Tika's substantial MIME type database, we've made it possible for users

to edit the 'org/modeshape/custom-mimetypes.xml' file and provide the updated one on the application

classpath. What goes in that file will overwrite all of the other sources (namely Tika's built-in

file and its customization file, both of which are to be found on the classpath), which means

it's easiest to simply provide an updated version of this file at 'org/modeshape/custom-mimetypes.xml'.

Be sure to not remove any of the (few) customizations that ModeShape includes - those are important.

As we upgrade Tika, we'll get updated versions of the media type data. This is far more preferable

than having a ModeShape-specific version.

The MIME type related interfaces in ModeShape's public API (e.g., 'modeshape-jcr-api') have been removed.

These were added sometime in one of the 3.0 releases, so removing them will not introduce compatibility

issues for users.

Instead, we've decided to get out of the MIME type detection framework business, and have decided

to switch to Tika for all MIME type detection. In fact, you can still write your own MIME type detector,

but you do that by implementing Tika's interface and reference the implementation class(es) in the

corresponding service loader file in your JAR. (See the TIKA documentation for details.)

However, internally we still have an abstraction. This is because it is possible to remove the Tika

(and transitive dependencies) from a ModeShape installation, as long as your applications will not

expect any kind of automatic MIME type detection. This is a perfectly valid use case: for example,

using a repository to store data and do not store files (and don't use sequencers).

The AS7 kits required a bit more modification. There is now a new AS7 module for 'org.apache.tika'

that contains all of the JARs, and this is used by the ModeShape module and by the Tika text extractor


All unit and integration tests pass with these changes. Several new tests were added.

Show less

master + 15 more