This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Configuring Document Extraction

    Atolio has functionality to run document extraction on a variety of sources. Under the hood, this uses the battle-tested Apache Tika software. Currently, document extraction (canonically known as DocEx and named throughout the remainder of this document) is disabled by default. This document covers how to enable and configure it.

    Enable DocEx & Tika

    By default, Tika and DocEx are disabled. To get document extraction running, first enable Tika helm, which will ensure Terraform manages the tika-pipes project:

    # config.hcl
    disable_tika_helm = false
    

    Azure deployments must override the image repository for Tika. Create or modify values-tika.yaml:

    image:
      # Override for Azure with relevant atolioimages.acurecr.io reference
      repository: atolioimages.acurecr.io/tika
      pullPolicy: IfNotPresent
      tag: 4.14.0 # Adjust to desired version
    

    Then, enable the DocEx service and deployment. In values-lumen.yaml:

    enableDocex: true
    

    With all this in place, now update the infrastructure:

    ./scripts/create-infra.sh --name={deployment name}
    

    Tika will be deployed under the tika namespace and DocEx under the atolio-svc namespace. In future versions of Atolio, Tika and DocEx will be on by default.