Clone Tools
  • last updated a few minutes ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
MODE-2684 Removes the compile time dependency of modeshape-core towards Apache Tika The mime type extraction functionality will still function as-is if Tika is present, but now there is also an independent extension-based default which will be used if Tika is not present in the CP at runtime

  1. … 23 more files in changeset.
MODE-2528 Integrates the new relational provider with the modeshape codebase This is a huge commit which makes the necessary changes to remove all Infinispan configuration and dependencies, replacing it with the new mechanism. It also contains several changes to the relational provider design because of various failing tests. This includes among other thing the necessity for ModeShape to notify the provider once exclusive locks have been obtained as part of each transaction.

  1. … 305 more files in changeset.
MODE-2489 Refactored mime-type handling and added the possibility of configuring the repository to use either "content", "name" or no mime-type detection at all.

  1. … 35 more files in changeset.
Added tika XML text extraction test.

    • -0
    • +7
    ./TikaTextExtractorRepositoryTest.java
  1. … 1 more file in changeset.
MODE-2246 Implemented default FTS via strict regex matching. No stemming or punctuation processing is done by default (unlike what Lucene did in 3.x) which means that some of the tests had to be adapted. Also, enabled back the tests that had been previously disabled for full text search.

    • -3
    • +0
    ./TikaTextExtractorRepositoryTest.java
  1. … 10 more files in changeset.
MODE-2097, MODE-2169, MODE-2197 Integrated the latest version of the jboss-integration BOM. This commit includes changes for multiple different issues that snowballed: - packaging Javadocs in a zip - updating Apache POI In addition, after integrating the BOM a number of unit tests had to be updated to reflect changes in dependencies both from a functionality perspective and from a deprecation perspective. The most significant change there was the rewriting of the ConnectorTestCase (modeshape-jca) because the new versions of Arquillian + IronJacamar hold filelocks on Windows: https://issues.jboss.org/browse/JBJCA-1027

    • -3
    • +2
    ./TikaTextExtractorRepositoryTest.java
  1. … 92 more files in changeset.
MODE-2018 Implemented new query engine.

Refactored the query functionality to now use several new service provider interfaces (SPI),

and implemented a new query engine that can take advantage of administrator-defined indexes.

When no such indexes are defined, the query engine is able to still answer the queries

by "scanning" all nodes in the repository. This is like a regular relational database:

all query functionality works (albeith slowly) even when no indexes are defined, though

to improve performance simply define an appropriate index based upon the query or queries

that are being used.

All of ModeShape's query parsing, planning, and optimization steps are basically unchanged

from the previous query system. There is one addition to the rule-based optimizer: a new

rule looks at query plans and adds the potential indexes that might be of use in each

access query portion of a query plan. Then, the query execution process (see below)

chooses one of the identified indexes based upon the selectivity and cardinality. If no index

is available for that portion of the query plan, then the query engine simply iterates

over all queryable nodes in the repository.

A new kind of component, called a "query index provider", allows the query engine to delegate

various responsibilities around indexes to these providers. For example, a provider must

provide an index planner that can examine the constraints that apply to an access query

and determine if any of the provider's indexes can be used. When they are, ModeShape

adds those indexes to the query plan. If the query engine uses one of those indexes,

then provider must be able to return all of those nodes that satisfy the criteria

as described earlier by its index planner. Finally, as ModeShape content changes, ModeShape

will notify the index providers' of the changes so that they can ensure their indexes

are kept up-to-date with the content.

This means that a provider can implement the functionality using any kind of technology,

and consequently, that ModeShape can begin to leverage multiple kinds of search and index

technology within its query system. The ModeShape community anticipates having providers

that use Lucene, Solr, and ElasticSearch. ModeShape will also likely come with a provider

that maintains file-system based indexes. Additionally, providers can optionally support

indexes on one or more properties. Thus, it will be possible to mix and match

these providers, selecting the best technology for the specific kind of index.

The new query engine does the execution in a very different way than the previous engine,

which used Lucene to determine the tuples (that is, the values in each row) for each access

query and that were then further processed and combined to form the tuples that were returned

in the result set. The new engine instead uses a new concept of a "stream of node keys"

for each access query: what actually implements that stream depends on many factors.

A node sequence is an abstraction of a stream of "rows" containing one or more node keys.

The interfaces are designed to make it possibly to lazily implement a stream in a very

efficient manner. Specifically, a node stream is actually comprised of multiple "batches"

of rows, and batches can be of any size.

Consider when the engine findes no indexes are available for a certain access query. The

engine simply uses a "node sequence" (or NodeSequence) implementation that returns in batches

a row for each node in the repository.

But if an access query involves a criteria on the path of a node, such as

"... WHERE ISSAMENODE('/foo/bar') ...", then ModeShape knows that this query (or portion of

a query) will have only one result, namely the node at "/foo/bar". ModeShape doesn't need

an index to quickly find this node; it merely has to navigate to that path to find the one

node that satisfies this query. ModeShape has several other optimizations, too: it knows

when a query involves all children or descendants of a node at a given path, and can take

this into account when optimizing and executing the query. All of these are handled with

special NodeSequence implementations optimized for each case.

For many access queries (i.e., part of a larger query), the engine will use one of the

indexes identified by one of the providers. When this happens, ModeShape uses other

NodeSequence implementations that utilize the underlying indexes to find the nodes that satisfy

some of the criteria.

The above describes how the engine uses a single NodeSequence instance for each each access

query in a larger query. But how does the engine combine these to determine the ultimate

query results? Basically, the engine constructs a series of functions that process one or more

NodeSequence instances to filter and combine into other NodeSequences.

For example, a custom index might be used to find all nodes that have a 'jcr:lastModified'

timestamp within some range. Presumably this index is used because it has a higher selectivity,

meaning that it will filter out more nodes and return fewer nodes than other indexes.

Other criteria that are also applied to this access query might then be applied by a filter

that processes the actual nodes' property values.

While the result of this commit is a functioning query engine that is shown to work in most

of the query-related unit and integration tests, there still are a few areas that are not complete.

Specifically:

* The new engine does not support full-text search, and currently throws an exception

* No index providers are implemented. Therefore, all queries involve "scanning" the repository.

This can be time consuming, especially for federated repositories. Consequently, all such

tests that query federated content have been disabled/ignored.

    • -1
    • +5
    ./TikaTextExtractorRepositoryTest.java
  1. … 231 more files in changeset.
MODE-2154 Changed the Tika text extractor behavior so that even in the case of an exception, if text has been partially extracted it will be returned.

    • -2
    • +2
    ./TikaTextExtractorRepositoryTest.java
  1. … 1 more file in changeset.
MODE-2154 Changed the Tika text extractor behavior so that even in the case of an exception, if text has been partially extracted it will be returned.

    • -2
    • +2
    ./TikaTextExtractorRepositoryTest.java
  1. … 1 more file in changeset.
MODE-2148 Added checkstyle to our build, and corrected numerous potential problems or issues in the code. Also removed lots of meaningless JavaDoc

    • -1
    • +1
    ./TikaTextExtractorRepositoryTest.java
  1. … 365 more files in changeset.
MODE-2081 Changed the license for ModeShape code to ASL 2.0.

    • -18
    • +9
    ./TikaTextExtractorRepositoryTest.java
  1. … 557 more files in changeset.
MODE-2107 Fixed the setting of excluded & included mime-types on the Tika text extractor.

    • -0
    • +18
    ./TikaTextExtractorRepositoryTest.java
  1. … 3 more files in changeset.
MODE-2030 Fixed the update of indexes for child nodes when the parent node is moved/renamed. Also, replaced uses of JcrTools#printQuery in tests with simple query execution & asserts, because of the performance penalty.

    • -2
    • +12
    ./TikaTextExtractorRepositoryTest.java
  1. … 3 more files in changeset.
MODE-2022 - Updated the Tika text extractor to ignore audio/video/image files by default and log a specific error in case of a NoClassDefFound.

    • -0
    • +6
    ./TikaTextExtractorRepositoryTest.java
  1. … 5 more files in changeset.
MODE-1920 Corrected compiler warnings and JavaDoc errors

    • -5
    • +4
    ./TikaTextExtractorRepositoryTest.java
  1. … 26 more files in changeset.
MODE-1960 Updated the POI dependency to 3.10-beta1 and added back the MSOffice Sequencer and Tika Extractor, which were disabled as a result of https://issues.jboss.org/browse/MODE-1934.

    • -1
    • +0
    ./TikaTextExtractorRepositoryTest.java
  1. … 14 more files in changeset.
MODE-1934 - "De-activated" all Apache POI dependencies. No code was removed, so that if the underlying issue is fixed in a future version of POI, we should be able to easily bring it back.

    • -0
    • +2
    ./TikaTextExtractorRepositoryTest.java
  1. … 14 more files in changeset.
MODE-1810 - Added test cases which show that the problem cannot be reproduced

  1. … 4 more files in changeset.
MODE-1810 - Added test cases which show that the problem cannot be reproduced

  1. … 4 more files in changeset.
MODE-1791- Updated the read-only option for connectors by adding a WritableConnector base class which now contains the flag and by moving the read-only checks to the FederatedDocumentStore.

  1. … 12 more files in changeset.
MODE-1791- Updated the read-only option for connectors by adding a WritableConnector base class which now contains the flag and by moving the read-only checks to the FederatedDocumentStore.

  1. … 12 more files in changeset.
MODE-1561 - Added the writeLimit parameter to the TikaTextExtractor.

    • -5
    • +30
    ./TikaTextExtractorRepositoryTest.java
  1. … 5 more files in changeset.
MODE-1561 - Added the writeLimit parameter to the TikaTextExtractor.

    • -5
    • +30
    ./TikaTextExtractorRepositoryTest.java
  1. … 5 more files in changeset.
MODE-1639, MODE-1640, MODE-1634 Replaced the Aperture-based MIME type detector with a Tika-based one

This required quite a bit of dependency gymnastics, since Tika has quite a few more transitive

dependencies than the Aperture library (which we had successfully pared down several years ago).

Tika references about 25 dependencies (including transitive dependencies), but this was reduced

in 'modeshape-jcr' to about 8 for basic MIME type detection. Note that Tika usually includes

two BouncyCastle libraries in its dependencies (used for encrypted PDFs, among other things),

but ModeShape intentionally excludes these (as we don't want to ship or depend on any

security-related JARs).

Not only do we get Tika's substantial MIME type database, we've made it possible for users

to edit the 'org/modeshape/custom-mimetypes.xml' file and provide the updated one on the application

classpath. What goes in that file will overwrite all of the other sources (namely Tika's built-in

file and its customization file, both of which are to be found on the classpath), which means

it's easiest to simply provide an updated version of this file at 'org/modeshape/custom-mimetypes.xml'.

Be sure to not remove any of the (few) customizations that ModeShape includes - those are important.

As we upgrade Tika, we'll get updated versions of the media type data. This is far more preferable

than having a ModeShape-specific version.

The MIME type related interfaces in ModeShape's public API (e.g., 'modeshape-jcr-api') have been removed.

These were added sometime in one of the 3.0 releases, so removing them will not introduce compatibility

issues for users.

Instead, we've decided to get out of the MIME type detection framework business, and have decided

to switch to Tika for all MIME type detection. In fact, you can still write your own MIME type detector,

but you do that by implementing Tika's interface and reference the implementation class(es) in the

corresponding service loader file in your JAR. (See the TIKA documentation for details.)

However, internally we still have an abstraction. This is because it is possible to remove the Tika

(and transitive dependencies) from a ModeShape installation, as long as your applications will not

expect any kind of automatic MIME type detection. This is a perfectly valid use case: for example,

using a repository to store data and do not store files (and don't use sequencers).

The AS7 kits required a bit more modification. There is now a new AS7 module for 'org.apache.tika'

that contains all of the JARs, and this is used by the ModeShape module and by the Tika text extractor

module.

All unit and integration tests pass with these changes. Several new tests were added.

  1. … 70 more files in changeset.
MODE-1419 Enabled full text searching

Added full-text search back in. Note that it can be explicitly disabled/enabled within a repository configuration,

and it is currently *enabled* by default. Quite a few test cases were added back in, and some of them highlighted

some issues with search/query scores (were floats, but expected to be doubles).

Also, one quirk of the AS7 subsystem startup is that the text extractors are added to a running repository,

and that means that there's a time during startup when there are no text extractors (and any added binary

values don't get extracted during storage), followed quickly by the enabling of text extractors (by the time

the property with the binary value is indexed). This manifested itself as a blocked sequencing thread within

the ZIP Sequencer integration test (using AS7).

All unit and integration tests pass with these changes, including those that were enabled by these changes.

    • -17
    • +12
    ./TikaTextExtractorRepositoryTest.java
  1. … 24 more files in changeset.
MODE-1547 - Fixed full text search queries which involve stop-words. The solution was to not add empty PhraseQuery instances to the BooleanQuery.

    • -12
    • +20
    ./TikaTextExtractorRepositoryTest.java
  1. … 4 more files in changeset.
MODE-1545 - Implemented a mechanism for the binary store to persist the mime-types of binary values, in order to avoid detection each time.

While working on this, another issue was exposed and fixed: when persisting data to disk (e.g. the AS7 kit), the defaultPrimaryType of node type definitions was not properly re-initialized on a restart.

  1. … 13 more files in changeset.
MODE-1544 - Extracted Tika based mime-type detector and updated the way mime type detectors are loaded and initialized.

Because of the AS7 support, the detectors need to be loaded via the Environment class loader. Also, because text extraction (and implicitly mime-type detection) can be triggered preemptively, some of Tika's excluded dependencies needed to be added back (e.g. for .java and .class files)

    • -0
    • +105
    ./TikaMimeTypeDetectorTest.java
    • -1
    • +1
    ./TikaTextExtractorRepositoryTest.java
  1. … 16 more files in changeset.
MODE-1527 -Added AS7 support for configuring and working with text extractors. To validate the configuration and Arquillian integration test was added as well.

    • -1
    • +1
    ./TikaTextExtractorRepositoryTest.java
  1. … 35 more files in changeset.
MODE-1527 - Updated the text extraction process to be triggered preemptively by the binary storage, when a binary value is created.

For this to be possible, the context of the extractor cannot contain any node-specific information. Also, this exposed an issue with the SharedLockingInputStream: if the stream is closed in the "read" methods, Tika's parsers will keep reading it over and over (effectively reopening it each time) either causing OOM errors or duplicate text. This means the "close" call from the read methods has been removed.

    • -4
    • +20
    ./TikaTextExtractorIntegrationTest.java
  1. … 12 more files in changeset.