Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

By default, Apache Tika will use some basic file scanning to detect the content type of a file. It does so by reading a few kilobytes from the start of the file, and search for some evidence, a so called magic number, that defines what kind of file was uploaded. This mechanism is relatively fast, but is not fool proof. Attackers could be able to craft some files that contain a valid magic number, but are actually another file type.

As a result, this magic number scanning detects the new Office documents (docx, xslx, pptx, etc.) as "office document". This means that you can upload a Word document with an xslx extension, and the Runtime would accept that.

It is possible to add an additional Tika module to your custom Runtime, so that the contents of the Office documents and other documents are inspected more thoroughly. Note that the Runtime will need extra time to inspect the Office documents, but the inspection is more complete. This is a consideration that you should make for your solution.

...

Code Block
languagexml
titlepom.xml
<dependencyManagement>
  <dependencies>
    ...
	<dependency>
		<groupId>org.apache.tika</groupId>
		<artifactId>tika-parserparsers-microsoftstandard-module<package</artifactId>
		<version>${apache.tika.version}</version>
	</dependency>
    ....
  <dependencies>
</dependencyManagement>

...

Code Block
languagexml
titlepom.xml
<dependencies>
  ...
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parserparsere-microsoftstandard-module<package</artifactId>
	<scope>runtime</scope>
  </dependency>
  ...
<dependencies> 

Note that the core module of Apache Tika is already included in the Runtime. Depending on the way you set the project up, you may have to redefine the apache.tika.version property  property to the same value as the version that is shipped with the runtime.

...