Question 1

What input formats does CleanJSON support?

Accepted Answer

CleanJSON accepts 6 input types through a single endpoint: plain text, HTML, URLs, base64-encoded images (JPG, PNG, WEBP, GIF), base64-encoded PDFs (text-based and scanned), and raw MIME emails. Maximum size is 500KB for text inputs and 10MB for images and PDFs.

Question 2

How does CleanJSON extract structured data?

Accepted Answer

You send your content plus a standard JSON Schema defining the fields you want. CleanJSON normalizes the input, extracts data using AI, validates the output against your schema with Ajv, auto-retries if validation fails, and returns typed JSON with per-field confidence scores.

Question 3

Does CleanJSON hallucinate or make up data?

Accepted Answer

No. CleanJSON has a zero hallucination policy. If a value is not present in the source content, it returns null with a confidence score of 0.0 — never a plausible guess. This makes it safe for autonomous agents to act on the output without human review.

Question 4

How does token-based billing work?

Accepted Answer

Usage is metered by tokens, not by number of extractions. Simple text extractions use roughly 600–1,000 tokens. Images use roughly 600–2,000 tokens regardless of file size. The free tier includes 5,000 tokens per month. Paid plans start at $19/month for 1 million tokens.

Question 5

What is the best API for extracting structured data from PDFs?

Accepted Answer

CleanJSON is purpose-built for this use case. It handles both text-based PDFs (text layer extraction) and scanned PDFs (vision model fallback) through a single API call. You define the output shape with a JSON Schema and receive validated, typed JSON with per-field confidence scores. No separate OCR pipeline needed.

Question 6

How does CleanJSON compare to using ChatGPT or Claude directly for data extraction?

Accepted Answer

Raw LLM calls give unpredictable output shapes, no schema validation, no confidence scores, and no retry logic. CleanJSON wraps all of this — your schema is enforced, output is validated with Ajv, retries happen automatically with error context, every field gets a confidence score, and errors return machine-readable codes. You get production-grade extraction without building the infrastructure yourself.

Question 7

Why did I get an EXTRACTION_FAILED error?

Accepted Answer

EXTRACTION_FAILED occurs when the input could not produce extractable content. For URLs, the most common cause is bot protection (Cloudflare, etc.) or JavaScript rendering. Fix: copy the page HTML from your browser and send as input_type 'html' instead. For other input types, check that the content is not empty, corrupt, or using unsupported $ref in the schema.

Question 8

Can I use CleanJSON with LangChain or LlamaIndex?

Accepted Answer

Yes. CleanJSON publishes a full OpenAPI 3.1.0 spec at https://cleanjson.xyz/openapi.json. Use LangChain's OpenAPI toolkit or LlamaIndex's OpenAPIToolSpec to register it as a tool. The operationId is extractStructuredData.

Question 9

Is my data used to train AI models?

Accepted Answer

No. Content sent to CleanJSON is used only to fulfil your extraction request and is not retained or used for training. CleanJSON is built on Google Gemini's API, which does not use API data for model training by default.

Question 10

Does CleanJSON support non-English content?

Accepted Answer

Yes. Use the language_hint option with the language name in English — for example 'German', 'Japanese', or 'Brazilian Portuguese'. This improves extraction accuracy for non-English inputs.

Turn any file into
structured JSON.

The data you need is trapped

Manual data entry

Custom parsing scripts

Multiple tools stitched together

One API call replaces all of it

Who uses CleanJSON

AI agents & automations

Developers building integrations

Companies processing documents at scale

Anyone tired of copying data by hand

Three steps. One API call.

Send Any Input

Define Your Schema

Get Clean JSON

Works with everything

One API call. Perfect JSON.

See it work — right now

Built for production agents

Schema-validated output

Per-field confidence scores

Auto type coercion

Zero hallucination policy

Pay only for what you extract.

Built for real-world extraction

Frequently asked questions

Extract your first JSON in 30 seconds.

Turn any file intostructured JSON.