This guide explains how your Quarkus application can use Apache Tika to parse the documents.

Apache Tika is a content analysis toolkit which is used to parse the documents in PDF, Open Document, Excel and many other well known binary and text formats using a simple uniform API. Both the document text and properties (metadata) are available once the document has been parsed.

Creating the Maven Project

First, we need a new project. Create a new project with the following command:

mvn io.quarkus:quarkus-maven-plugin:0.23.2:create \
    -DprojectGroupId=org.acme \
    -DprojectArtifactId=using-tika \
    -Dextensions="tika"

This command generates a Maven project, importing the tika extension.

If you already have your Quarkus project configured you can add the tika extension to your project by running the following command in your project base directory.

mvn quarkus:add-extension -Dextensions="tika"

This will add the following to your pom.xml:

    <dependency>
        <groupId>io.quarkus</groupId>
        <artifactId>quarkus-tika</artifactId>
    </dependency>

Configuration

Property Name Default Description

quarkus.tika.tika-config-path

The resource path within the application artifact to the Tika configuration resource. Typically a file named tika-config.xml is added to the root of the application resources path. The default configuration will be used if no configuration resource is available.

quarkus.tika.parsers

Comma-separated list of the abbreviated or full parser class names which have to be loaded by the extension. A pdf abbreviation can be used to refer to the PDF parser and custom abbreviations must be used for all other parsers. Note that this property is mutually exclusive with the tika-config-path property.

quarkus.tika.append-embedded-content

true

The document may have other embedded documents, for example, an Excel document may include a PDF content. If such an embedded content is available then, by default, it will be appended to the content of the master document, thus, in this example, the text extracted from PDF file will be appended to the text extracted from the Excel file. This property has to be set to false if one needs to access the content of the master and each of the embedded documents individually.

Simple text extraction example

package org.acme.tika;

import java.io.InputStream;
import javax.inject.Inject;
import javax.ws.rs.Consumes;
import javax.ws.rs.POST;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;


import io.quarkus.tika.TikaParser;

@Path("/parse")
public class TikaParserResource {

    @Inject
    TikaParser parser;

    @POST
    @Path("/text")
    @Consumes({ "application/pdf", "application/vnd.oasis.opendocument.text" })
    @Produces(MediaType.TEXT_PLAIN)
    public String extractText(InputStream stream) {
        return parser.getText(stream);
    }
}

Embedded document text extraction example

package org.acme.tika;

import java.io.InputStream;
import java.util.List;
import java.util.stream.Collectors;

import javax.inject.Inject;
import javax.ws.rs.Consumes;
import javax.ws.rs.POST;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;

import io.quarkus.tika.TikaContent;
import io.quarkus.tika.TikaParser;

@Path("/parse")
public class TikaParserResource {

    @Inject
    TikaParser parser;

    @POST
    @Path("/text")
    @Consumes({ "application/pdf", "application/vnd.oasis.opendocument.text" })
    @Produces(MediaType.TEXT_PLAIN)
    public List<String> extractText(InputStream stream) {
        TikaContent masterContent = parser.parse(stream);
        return masterContent.getEmbeddedContent().stream()
            .map(tc -> tc.getText())
            .collect(Collectors.toList());
    }
}