Configuring Document Extraction

Atolio has functionality to run document extraction on a variety of sources. Under the hood, this uses the battle-tested Apache Tika software. Currently, document extraction (canonically known as DocEx and named throughout the remainder of this document) is disabled by default. This document covers how to enable and configure it.

Enable DocEx & Tika

By default, Tika and DocEx are disabled. To get document extraction running, first enable Tika helm, which will ensure Terraform manages the tika-pipes project:

# config.hcl
disable_tika_helm = false

Azure deployments must override the image repository for Tika. Create or modify values-tika.yaml:

image:
  # Override for Azure with relevant atolioimages.acurecr.io reference
  repository: atolioimages.acurecr.io/tika
  pullPolicy: IfNotPresent
  tag: 4.14.0 # Adjust to desired version

Then, enable the DocEx service and deployment. In values-lumen.yaml:

enableDocex: true

With all this in place, now update the infrastructure:

./scripts/create-infra.sh --name={deployment name}

Tika will be deployed under the tika namespace and DocEx under the atolio-svc namespace. In future versions of Atolio, Tika and DocEx will be on by default.