Process Files

This page shows you how to process files you have uploaded to your Catalog into a unified AI- and RAG-ready format.

It uses three preset 💧 Instill VDP pipelines: Transformation, Splitting, and Embedding.

  1. Transformation Converts different file types into a single source of truth, primarily in Markdown format for non-text files like PDFs. Process:

    • PDF Files: Converted to Markdown for a unified textual representation.
    • Plain Text Files (.txt, .md): Text is directly extracted.

    For more details, please refer to the VDP Pipeline: Indexing Convert PDF.

  2. Splitting Breaks down the single source of truth into smaller chunks for enhanced search efficiency and alignment with embedding models' context windows. Process:

    • Markdown Text: Uses headings to determine optimal splitting points.
    • Plain Text Files: Employs a recursive strategy for segmentation without explicit headings.

    For more details, please refer to the Indexing Split Markdown and Indexing Split Text VDP Pipelines.

  3. Embedding Converts chunks into vector representations using an embedding model, which are then stored as part of the Instill Catalog. Process:

    • The chunks obtained from the Splitting step are transformed into vector representations using an embedding model.
    • These vectors are efficiently stored in Instill Catalog for low-latency retrieval.

    For more details, refer to the VDP Pipeline: Indexing Embed.

INFO

If you are using your own deployment of 🔮 Instill Core, you must setup the environment with a valid OpenAI API key in order for the Embedding stage to work. Please follow the instructions on the Configuration page to correctly set this up.

#Process Files via API

cURL
Python

export INSTILL_API_TOKEN=********
curl -X POST 'https://api.instill.tech/v1alpha/catalogs/files/processAsync' \
--header "Authorization: Bearer $INSTILL_API_TOKEN" \
--header "Content-Type: application/json" \
--data-raw '{
"fileUids": ["fileUid1", "fileUid2"]
}'

Note that the {namespaceId} path parameter must be replaced by the Catalog owner's ID (namespace).

The fileUids field in the request body contains an array of strings, representing the unique identifiers (UIDs) of the files to be processed.

#Process Files via 📺 Instill Console

To process files from 📺 Instill Console, follow these steps:

  1. Launch 📺 Instill Console on ☁️ Instill Cloud or via a local 🔮 Instill Core deployment at http://localhost:3000.
  2. Navigate to the Artifacts page using the navigation bar.
  3. Ensure that you have followed the steps in the Upload Files page.
  4. Click the Process Files button.

The processing status of your files appears in the Files tab. When the status is Completed, you can view your Files and Chunks, and also use the Chunk Search API.