Get File Catalog

This page guides you through retrieving detailed information about a specific file stored in a Catalog using the Get File Catalog API.

This API provides metadata about a file, including its type, size, and processing status, as well as details on the content transformed through various pipelines. It also returns any chunks associated with the file, providing a comprehensive view of the file's data within the Catalog.

#Get File Catalog via API

#Example of Using fileUid

cURL

export INSTILL_API_TOKEN=********
curl --location 'https://api.instill.tech/v1alpha/namespaces/{namespaceId}/catalogs/{catalogId}?fileUid=9f1c8f09-52d6-4aca-8f61-58909d3adcde' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $INSTILL_API_TOKEN'

#Example of Using fileId

cURL

export INSTILL_API_TOKEN=********
curl --location 'https://api.instill.tech/v1alpha/namespaces/{namespaceId}/catalogs/{catalogId}?fileId=test.pdf' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--header 'Authorization: Bearer $INSTILL_API_TOKEN'

Note that the {namespaceId} and {catalogId} path parameters must be replaced by the Catalog owner's ID (namespace) and the identifier of the Catalog you are querying. The fileUid and fileId fields identify the specific file within the Catalog that you want to retrieve information about.

#Example Response

A successful response will return detailed metadata about the file, including the transformed content and any chunks associated with it:


{
"metadata": {
"fileUid": "example-file-uid",
"fileId": "example-file-id.pdf",
"fileType": "FILE_TYPE_PDF",
"fileSize": "12345",
"fileUploadTime": "2024-07-23T14:35:00Z",
"fileProcessStatus": "FILE_PROCESS_STATUS_COMPLETED"
},
"text": {
"pipelineIds": ["pipeline1", "pipeline2"],
"transformedContent": "Transformed content here...",
"transformedContentChunkNum": 10,
"transformedContentTokenNum": 1500,
"transformedContentUpdateTime": "2024-07-23T15:00:00Z"
},
"chunks": [
{
"uid": "chunk1-uid",
"type": "CHUNK_TYPE_TEXT",
"startPos": 0,
"endPos": 100,
"content": "This is a chunk of text.",
"tokensNum": 20,
"embedding": [0.1, 0.2, 0.3, ...],
"createTime": "2024-08-13T15:01:00Z",
"retrievable": true
},
{
"uid": "chunk2-uid",
"type": "CHUNK_TYPE_TEXT",
"startPos": 101,
"endPos": 200,
"content": "Another chunk of text.",
"tokensNum": 25,
"embedding": [0.4, 0.5, 0.6, ...],
"createTime": "2024-08-13T15:02:00Z",
"retrievable": true
}
]
}

#Output Description

  • originalData: The original file data encoded in base64. (coming soon)
  • metadata: An object containing metadata about the file.
    • fileUid (string): The unique identifier of the file.
    • fileId (string): The file's ID.
    • fileType (string): The type of the file, e.g., FILE_TYPE_TEXT, FILE_TYPE_PDF, FILE_TYPE_HTML, FILE_TYPE_PPTX, FILE_TYPE_DOCX.
    • fileSize (string): The size of the file in bytes.
    • fileUploadTime (string): The time when the file was uploaded.
    • fileProcessStatus (string): The processing status of the file, which could be FILE_PROCESS_STATUS_COMPLETED, FILE_PROCESS_STATUS_FAILED, etc.
  • text: An object containing the transformed text content.
    • pipelineIds (array): The IDs of the pipelines that processed the file.
    • transformedContent (string): The content transformed through the pipelines.
    • transformedContentChunkNum (integer): The number of chunks in the transformed content.
    • transformedContentTokenNum (integer): The number of tokens in the transformed content.
    • transformedContentUpdateTime (string): The last update time of the transformed content.
  • chunks: An array of objects, each representing a chunk of the file's content.
    • uid (string): The unique identifier of the chunk.
    • type (string): The type of the chunk, e.g., CHUNK_TYPE_TEXT.
    • startPos (integer): The start position of the chunk in the file.
    • endPos (integer): The end position of the chunk in the file.
    • content (string): The content of the chunk.
    • tokensNum (integer): The number of tokens in the chunk.
    • embedding (array): The embedding vector of the chunk.
    • createTime (string): The time when the chunk was created.
    • retrievable (boolean): Whether the chunk is retrievable.

This API is essential for retrieving and understanding detailed information about files stored within a Catalog, enabling users to analyze file data at a granular level.