Clone
 

trangvh <trangvh@exoplatform.com> in eXo-JCR-core

COR-354: Upgrade the versions of pdfbox, poi, tika

Fix description:

* Update the versions in the main pom

* Remove InvalidPasswordExcetion in PDDocument.decrypt(). This change exists since pdfbox 1.8.6 (PDFBOX-1474).

* TIKA-1400 (tika 1.10) extracts the header and footer of Excel file (.xls). The information is then put into class "outside".

The output of TestMSExcelOnTikaDocumentReader must be therefore updated.

Merge branch 'fix/2.5.12-GA/COR-338' into stable/2.5.x

PLF-6122: Update dependencies to next snapshot

Update testGetContentAsString2

COR-331: Implement MSPPTXStreamDocumentReader using SAXParser

Problem analysis:

* Apache's POI for MS PPTX files provides only in-memory model.

In this model, SAXParser is used too many times (triple the slide number) even to get some meta data information.

It is therefore unsuitable to parse very big files (in terms of slide number).

Fix description:

* Implement a new document reader for PPTX files by reading the stream.

* Get meta data information directly from the corresponding file (core.xml) if this file exists.

* Parse and index text in a certain number of first slides.

    • binary
    /exo.core.component.document/src/test/resources/eXo-JCR-1.15.pptx