This guide explains how your Quarkus application can use Apache Tika to parse the documents.
Apache Tika is a content analysis toolkit which is used to parse the documents in PDF, Open Document, Excel and many other well known binary and text formats using a simple uniform API. Both the document text and properties (metadata) are available once the document has been parsed.
Configuration
Property Name | Default | Description |
---|---|---|
|
|
Path to the Tika configuration resource. Typically a file named |
|
|
The document may have other embedded documents, for example, an Excel document may include a PDF content. If such an embedded content is available then, by default, it will be appended to the content of the master document, thus, in this example, the text extracted from PDF file will be appended to the text extracted from the Excel file. This property has to be set to |
Simple text extraction example
package org.acme.tika;
import java.io.InputStream;
import javax.inject.Inject;
import javax.ws.rs.Consumes;
import javax.ws.rs.POST;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;
import io.quarkus.tika.TikaParser;
@Path("/parse")
public class TikaParserResource {
@Inject
TikaParser parser;
@POST
@Path("/text")
@Consumes({ "application/pdf", "application/vnd.oasis.opendocument.text" })
@Produces(MediaType.TEXT_PLAIN)
public String extractText(InputStream stream) {
return parser.getText(stream);
}
}
Embedded document text extraction example
package org.acme.tika;
import java.io.InputStream;
import java.util.List;
import java.util.stream.Collectors;
import javax.inject.Inject;
import javax.ws.rs.Consumes;
import javax.ws.rs.POST;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.core.MediaType;
import io.quarkus.tika.TikaContent;
import io.quarkus.tika.TikaParser;
@Path("/parse")
public class TikaParserResource {
@Inject
TikaParser parser;
@POST
@Path("/text")
@Consumes({ "application/pdf", "application/vnd.oasis.opendocument.text" })
@Produces(MediaType.TEXT_PLAIN)
public List<String> extractText(InputStream stream) {
TikaContent masterContent = parser.parse(stream);
return masterContent.getEmbeddedContent().stream()
.map(tc -> tc.getText())
.collect(Collectors.toList());
}
}