Configuring Document Extraction
Atolio has functionality to run document extraction on a variety of sources. Under the hood, this uses the battle-tested Apache Tika software. Currently, document extraction (canonically known as DocEx and named throughout the remainder of this document) is disabled by default. This document covers how to enable and configure it.
Enable DocEx & Tika
By default, Tika and DocEx are disabled. To get document extraction running, first enable Tika helm, which will ensure Terraform manages the tika-pipes project:
# config.hcl
disable_tika_helm = false
Azure deployments must override the image repository for Tika. Create or modify values-tika.yaml
:
image:
# Override for Azure with relevant atolioimages.acurecr.io reference
repository: atolioimages.acurecr.io/tika
pullPolicy: IfNotPresent
tag: 4.14.0 # Adjust to desired version
Then, enable the DocEx service and deployment. In values-lumen.yaml
:
enableDocex: true
With all this in place, now update the infrastructure:
./scripts/create-infra.sh --name={deployment name}
Tika will be deployed under the tika
namespace and DocEx under the atolio-svc
namespace. In future versions of Atolio, Tika and DocEx will be on by default.