Document Extraction API

The Document Extraction API extracts text and optional metadata from uploaded files.

Endpoint

  • Method: POST
  • Endpoint pattern: https://<extractor-host>/api/v1/<app_id>/extract
  • Content type: multipart/form-data

UI Navigation

Find the extractor endpoint in Site Search > App Settings > All APIs > Document Extractor.

Document Extractor endpoint panel in Site Search app settings.

Authentication

Use token-style discovery auth for extractor calls.

Authorization: Token <DISCOVERY_API_KEY>

Note: Raw key, Basic, and x-api-key header variants are not accepted by this endpoint.

Find the Discovery key in Site Search > App Settings > All APIs > Discovery.

Discovery API key panel in Site Search app settings.

Request Parameters

Name Required Type Description
Authorization Yes header Use Authorization: Token <DISCOVERY_API_KEY> for extractor endpoint calls.
file Yes form field Uploaded document content for extraction (multipart/form-data).
pages No form field Optional page range selector (for example, 1,3,5-8) for paginated formats.
include_metadata No form field Optional boolean to include document metadata in response.

Request Example

curl -sS -X POST \
  -H "Authorization: Token <extractor-discovery-key>" \
  -F "file=@./sample.pdf" \
  -F "include_metadata=true" \
  "https://<extractor-host>/api/v1/<app_id>/extract"

Response Example

{
  "text": "Extracted document text...",
  "metadata": {
    "filename": "sample.pdf"
  }
}

HTTP Status Codes and Error Handling

HTTP Code When It Happens Typical Response Body What to Do
200 Extraction succeeds JSON with text and optional metadata Consume extracted output
401 Invalid token or wrong auth header format Unauthorized Use Authorization: Token <key>
422 Missing required multipart fields (for example file) Validation payload (VALIDATION_ERROR) Send required fields and retry
429 Shared rate or plan limit condition Too-many-requests response Retry with backoff

Pagination

Not applicable to extraction responses.

Note: pages narrows extraction scope within a document. It is not response pagination.

Rate Limits and Payload Constraints

Documented extractor constraints include:

  • Maximum file size: 1 GB
  • Maximum extracted text per request: 100 KiB
  • Shared app-level rate and plan handling can still surface 429 behavior.

Shared API Foundations

Use these shared references for authentication, request and response structure, and pagination behavior used across Site Search APIs:

Articles in this section