The Document Extraction API extracts text and optional metadata from uploaded files.
Endpoint
-
Method:
POST -
Endpoint pattern:
https://<extractor-host>/api/v1/<app_id>/extract -
Content type:
multipart/form-data
UI Navigation
Find the extractor endpoint in Site Search > App Settings > All APIs > Document Extractor.
Authentication
Use token-style discovery auth for extractor calls.
Authorization: Token <DISCOVERY_API_KEY>
Note: Raw key, Basic, and x-api-key header variants are not accepted by this endpoint.
Find the Discovery key in Site Search > App Settings > All APIs > Discovery.
Request Parameters
| Name | Required | Type | Description |
|---|---|---|---|
| Authorization | Yes | header | Use Authorization: Token <DISCOVERY_API_KEY> for extractor endpoint calls.
|
| file | Yes | form field | Uploaded document content for extraction (multipart/form-data).
|
| pages | No | form field | Optional page range selector (for example, 1,3,5-8) for paginated formats.
|
| include_metadata | No | form field | Optional boolean to include document metadata in response. |
Request Example
curl -sS -X POST \
-H "Authorization: Token <extractor-discovery-key>" \
-F "file=@./sample.pdf" \
-F "include_metadata=true" \
"https://<extractor-host>/api/v1/<app_id>/extract"
Response Example
{
"text": "Extracted document text...",
"metadata": {
"filename": "sample.pdf"
}
}
HTTP Status Codes and Error Handling
| HTTP Code | When It Happens | Typical Response Body | What to Do |
|---|---|---|---|
200
|
Extraction succeeds | JSON with text and optional metadata
|
Consume extracted output |
401
|
Invalid token or wrong auth header format |
Unauthorized
|
Use Authorization: Token <key>
|
422
|
Missing required multipart fields (for example file)
|
Validation payload (VALIDATION_ERROR)
|
Send required fields and retry |
429
|
Shared rate or plan limit condition | Too-many-requests response | Retry with backoff |
Pagination
Not applicable to extraction responses.
Note: pages narrows extraction scope within a document. It is not response pagination.
Rate Limits and Payload Constraints
Documented extractor constraints include:
- Maximum file size:
1 GB - Maximum extracted text per request:
100 KiB - Shared app-level rate and plan handling can still surface
429behavior.
Shared API Foundations
Use these shared references for authentication, request and response structure, and pagination behavior used across Site Search APIs: