Process Files

This page shows you how to process files you have uploaded to your Catalog into a unified AI- and RAG-ready format.

It uses three preset 💧 Instill VDP pipelines: Transformation, Splitting, and Embedding.

  1. Transformation Converts different file types into a single source of truth, primarily in Markdown format for non-text files like PDFs/DOCXs/PPTXs/XLSXs/CSVs/etc. Process:

    • Plain Text Files (.txt, .md): Text is directly extracted.

    • Other File Types: Converted to Markdown for a unified textual representation.

    For more details, please refer to the VDP Pipeline: Indexing Convert PDF.

  2. Splitting Breaks down the single source of truth into smaller chunks for enhanced search efficiency and alignment with embedding models' context windows. Process:

    • Markdown Text: Uses headings to determine optimal splitting points.
    • Plain Text Files: Employs a recursive strategy for segmentation without explicit headings.

    For more details, please refer to the Indexing Split Markdown and Indexing Split Text VDP Pipelines.

  3. Embedding Converts chunks into vector representations using an embedding model, which are then stored as part of the Instill Catalog. Process:

    • The chunks obtained from the Splitting step are transformed into vector representations using an embedding model.
    • These vectors are efficiently stored in Instill Catalog for low-latency retrieval.

    For more details, refer to the VDP Pipeline: Indexing Embed.

INFO
  • Instill Cloud Users: When using ☁️ Instill Cloud, processing files will consume Instill Credits due to the usage of the preset pipelines in Catalog construction. The amount of credits consumed depends on the number of tokens and chunks generated during processing; larger or longer files will consume more credits.
  • Instill Core Users: If you are using your own deployment of 🔮 Instill Core, you must set up the environment with a valid OpenAI API key for the Embedding stage to work. Please follow the instructions on the Configuration page to correctly set this up. Processing files may incur costs from OpenAI API usage based on the number of tokens processed.

#Process Files via API

You can process files in your Catalog by making a POST request to the processAsync endpoint.

cURL
Python

export INSTILL_API_TOKEN=********
curl -X POST 'https://api.instill.tech/v1alpha/catalogs/files/processAsync' \
--header "Authorization: Bearer $INSTILL_API_TOKEN" \
--header "Content-Type: application/json" \
--data-raw '{
"fileUids": ["fileUid1", "fileUid2"]
}'

#Body Parameters

  • fileUids (array of strings, required): An array of file UIDs that you want to process.

Notes:

  • The fileUids field in the request body contains an array of strings representing the unique identifiers (UIDs) of the files to be processed. You can obtain the fileUid when you upload files to the Catalog.
  • The processing of files is asynchronous. You can check the processing status by retrieving the file information from the Catalog.

#Example Response

A successful response will return a JSON object containing the list of files that are being processed.


{
"files": [
{
"fileUid": "fileUid1",
"name": "example.pdf",
"type": "FILE_TYPE_PDF",
"processStatus": "FILE_PROCESS_STATUS_WAITING",
"size": "102400"
},
{
"fileUid": "fileUid2",
"name": "document.txt",
"type": "FILE_TYPE_TEXT",
"processStatus": "FILE_PROCESS_STATUS_WAITING",
"size": "20480"
}
]
}

#Output Description

  • files: An array of file objects that are being processed.
    • fileUid (string): The unique identifier of the file.
    • name (string): The name of the file.
    • type (string): The type of the file (e.g., FILE_TYPE_PDF, FILE_TYPE_TEXT).
    • processStatus (string): The current processing status of the file. Possible values include:
      • FILE_PROCESS_STATUS_NOTSTARTED
      • FILE_PROCESS_STATUS_WAITING
      • FILE_PROCESS_STATUS_CONVERTING
      • FILE_PROCESS_STATUS_CHUNKING
      • FILE_PROCESS_STATUS_EMBEDDING
      • FILE_PROCESS_STATUS_COMPLETED
      • FILE_PROCESS_STATUS_FAILED
    • size (string): The size of the file in bytes.

#Process Files via 📺 Instill Console

To process files from 📺 Instill Console, follow these steps:

  1. Launch 📺 Instill Console on ☁️ Instill Cloud or via a local 🔮 Instill Core deployment at http://localhost:3000.
  2. Navigate to the Artifacts page using the navigation bar.
  3. Ensure that you have followed the steps in the Upload Files page.
  4. Click the Process Files button.

The processing status of your files appears in the Files tab. When the status is Completed, you can view your Files and Chunks, and also use the Retrieve Chunks API.

Note for Instill Cloud Users: Processing files will consume Instill Credits based on the amount of data processed. Larger files with more content will consume more credits.

Note for Instill Core Users: Ensure that you have set up a valid OpenAI API key in your environment configuration to enable the Embedding stage of the processing.