Document Extraction API

Updated March 25, 2026 17:48

The Document Extraction API extracts text and optional metadata from uploaded files.

Endpoint

Method: POST
Endpoint pattern: https://<extractor-host>/api/v1/<app_id>/extract
Content type: multipart/form-data

UI Navigation

Find the extractor endpoint in Site Search > App Settings > All APIs > Document Extractor.

Document Extractor endpoint panel in Site Search app settings.

Authentication

Use token-style discovery auth for extractor calls.

Authorization: Token <DISCOVERY_API_KEY>

Note: Raw key, Basic, and x-api-key header variants are not accepted by this endpoint.

Find the Discovery key in Site Search > App Settings > All APIs > Discovery.

Discovery API key panel in Site Search app settings.

Request Parameters

Name	Required	Type	Description
Authorization	Yes	header	Use `Authorization: Token <DISCOVERY_API_KEY>` for extractor endpoint calls.
file	Yes	form field	Uploaded document content for extraction (`multipart/form-data`).
pages	No	form field	Optional page range selector (for example, `1,3,5-8`) for paginated formats.
include_metadata	No	form field	Optional boolean to include document metadata in response.

Request Example

curl -sS -X POST \
  -H "Authorization: Token <extractor-discovery-key>" \
  -F "file=@./sample.pdf" \
  -F "include_metadata=true" \
  "https://<extractor-host>/api/v1/<app_id>/extract"

Response Example

{
  "text": "Extracted document text...",
  "metadata": {
    "filename": "sample.pdf"
  }
}

HTTP Status Codes and Error Handling

HTTP Code	When It Happens	Typical Response Body	What to Do
`200`	Extraction succeeds	JSON with `text` and optional `metadata`	Consume extracted output
`401`	Invalid token or wrong auth header format	`Unauthorized`	Use `Authorization: Token <key>`
`422`	Missing required multipart fields (for example `file`)	Validation payload (`VALIDATION_ERROR`)	Send required fields and retry
`429`	Shared rate or plan limit condition	Too-many-requests response	Retry with backoff

Pagination

Not applicable to extraction responses.

Note: pages narrows extraction scope within a document. It is not response pagination.

Rate Limits and Payload Constraints

Documented extractor constraints include:

Maximum file size: 1 GB
Maximum extracted text per request: 100 KiB
Shared app-level rate and plan handling can still surface 429 behavior.

Shared API Foundations

Use these shared references for authentication, request and response structure, and pagination behavior used across Site Search APIs:

Articles in this section