Clone Tools
  • last updated a few minutes ago
Constraints
Constraints: committers
 
Constraints: files
Constraints: dates
Fix randomly failing test

  1. … 1 more file in changeset.
Fix randomly failing test

  1. … 1 more file in changeset.
Fix Test randomly failing dependeing on OS End of line character

Fix Test randomly failing dependeing on OS End of line character

COR-356 : fix all javadoc errors raised by doclint (JDK 8)

  1. … 54 more files in changeset.
COR-356 : fix all javadoc errors raised by doclint (JDK 8)

  1. … 54 more files in changeset.
COR-354: Upgrade the versions of pdfbox, poi, tika

Fix description:

* Update the versions in the main pom

* Remove InvalidPasswordExcetion in PDDocument.decrypt(). This change exists since pdfbox 1.8.6 (PDFBOX-1474).

* TIKA-1400 (tika 1.10) extracts the header and footer of Excel file (.xls). The information is then put into class "outside".

The output of TestMSExcelOnTikaDocumentReader must be therefore updated.

  1. … 2 more files in changeset.
COR-337 : add new test XXE External Entity point to non-existing resource

COR-338: add XXE unit test - external entity point to non-existing resource

COR-338: Add XXE unit test - External Entity points to non-existing resource

COR-334: Fix the test TestPropertiesExtractionOnTika.testPPTDocumentReaderService()

COR-333: TikaDocumentReader causes 'Unparseable date'

  1. … 4 more files in changeset.
COR-333: TikaDocumentReader causes 'Unparseable date'

Fix description:

* Don't convert date value extracted from document's properties to String of Java's Date object. This format doesn't conform to ISO8601 standard used in JCR

  1. … 4 more files in changeset.
COR-333: TikaDocumentReader causes 'Unparseable date'

  1. … 4 more files in changeset.
COR-337: Fix vulnerabilities related to XML parsing

  1. … 7 more files in changeset.
COR-338: Fix vulnerabilities related to XML parsing

  1. … 7 more files in changeset.
COR-338: Fix vulnerabilities related to XML parsing

  1. … 7 more files in changeset.
COR-338: Fix vulnerabilities relating to XML parsing

Fix description:

* Use Apache poi-ooxml 3.8-eXo01 which:

** Switch from dom4j to JAXP (SAX)

** New helper class: SAXHelper

* Use SAXHelper instead of SAXParser in eXo Core's XML Document parsers

* Upgrade xmlbeans from 2.3 to 2.6 for MSXWordDocumentReader.

Both Apache poi-ooxml 3.8-eXo01 and Xmlbeans2.6 add XMLReader classe to read XML document before parsing.

The XMLReader initiated by SAXHelper has the parameters to prevent XEE/XXE attacks by setting maximum expansion entity and disabling external entity.

  1. … 7 more files in changeset.
COR-332: Improved the test testGetContentAsString2

Update testGetContentAsString2

COR-332: Add a unit test to test the limit

  1. … 1 more file in changeset.
COR-329: Add a unit test to test the limit

  1. … 1 more file in changeset.
COR-332: Fixed the issue with the slide order

  1. … 2 more files in changeset.
COR-329: Fixed the issue with the slide order

  1. … 2 more files in changeset.
COR-332: getContentAsText and getProperties of MSXPPTDocumentReader are done by parsing the content thanks to SAX

  1. … 3 more files in changeset.
COR-329: getContentAsText and getProperties of MSXPPTDocumentReader are done by parsing the content thanks to SAX

  1. … 3 more files in changeset.
COR-331: Implement MSPPTXStreamDocumentReader using SAXParser

Problem analysis:

* Apache's POI for MS PPTX files provides only in-memory model.

In this model, SAXParser is used too many times (triple the slide number) even to get some meta data information.

It is therefore unsuitable to parse very big files (in terms of slide number).

Fix description:

* Implement a new document reader for PPTX files by reading the stream.

* Get meta data information directly from the corresponding file (core.xml) if this file exists.

* Parse and index text in a certain number of first slides.

  1. … 4 more files in changeset.
COR-329: Streaming parser for MSXPPTDocumentReader

Fix description:

* Implement streaming model to get properties and content of Microsoft Powerpoint files (OOXML).

* Index the content of the first 500 slides.

  1. … 4 more files in changeset.
COR-333: Re-add the todos related to dates

COR-334: Re-add the todos related to dates