Best File Format for AI: PDF vs TXT vs Markdown vs CSV (2026)

ChatGPT Claude Gemini PDF Markdown Token efficiency LLM
Format impact on AI accuracy & cost
30–50%fewer tokens: TXT vs PDF
7formats ranked A+ → D
A+TXT & Markdown
3platforms compared

You upload a PDF to ChatGPT, ask it to summarize the key points, and get back a confident-sounding response that misses half the document. The problem usually isn't the AI — it's the format you gave it. AI language models don't read files. They read the text that gets extracted from files, and that extraction process is far messier than most people realize. Choosing the right format before you upload — or knowing when to convert — is one of the highest-leverage things you can do to get better AI output.

1. TL;DR — Quick Reference

A+
Plain TXT — fewest tokens, zero parsing errors
A+
Markdown — plain text + structure signals
A
CSV — best for any tabular / spreadsheet data
D
Scanned PDF — OCR required, accuracy varies

The short answer: For prose documents, plain text (.txt) or Markdown (.md) is the best format to send to any AI tool — fewest tokens, no parsing errors, cleanest extraction. For spreadsheet or tabular data, CSV is the right choice. If you're stuck with a PDF or DOCX, convert it first. The quality of the AI's response is almost entirely a function of the quality of the text it receives.

2. How AI Tools Read Documents

None of the major AI tools actually read your file the way you do. What happens behind the scenes: when you upload a document, a parser extracts the text and converts it to a token stream — basically a long string of words — which gets passed to the language model. The model never sees your carefully formatted PDF with its columns and tables. It only sees whatever came out the other end of that extraction step. That's why two documents with identical content can produce radically different AI responses depending on format:

  • Plain text: Extracted 1:1 — what you write is exactly what the model sees
  • Markdown: Same as plain text, plus structural signals (# headers, lists, code blocks) that the model understands natively
  • PDF (searchable): Text extracted, but reading order breaks in multi-column layouts and tables often come out scrambled
  • PDF (scanned): Requires OCR first — if there's no text layer, the model sees nothing at all without it
  • DOCX: Text pulled from XML — generally accurate but formatting tags leave residual token overhead
  • XLSX: Cell values extracted to text; formulas aren't evaluated and row-column relationships often don't survive
  • PPTX: Slide text extracted sequentially; visual layout, spatial context, and animations are all invisible to the model

3. Format Rankings for AI Consumption

If you had to rank every common document format by how well AI tools handle them, it looks like this. The grades reflect a combination of extraction accuracy, token overhead, and how often the output is actually usable:

A+

Plain Text (.txt)

Maximum clarity, minimum tokens, zero parsing errors

A+

Markdown (.md)

Plain text + structure. Best for documents with headings and lists

A

CSV

Best format for tabular/spreadsheet data. Clean, parseable rows

B

Searchable PDF

Acceptable for single-column text docs. Tables and multi-column layouts degrade

B

DOCX

Good text accuracy; XML overhead adds token cost. Fine for standard docs

C

XLSX / PPTX

Structure often lost. Export to CSV or copy text before uploading

D

Scanned PDF

OCR required. Accuracy varies. Convert to searchable PDF first

4. Why PDFs Are Problematic for AI

PDF is the world's most common document sharing format, which makes it the most common format people upload to AI tools — and the format that causes the most confusion. It's not that PDF is a bad format. It's that PDF was designed for printing and visual presentation, not for text extraction. There are four specific ways this mismatch causes problems:

Problem 1: Multi-Column Layouts

A two-column academic paper that reads left-column-top → left-column-bottom → right-column-top → right-column-bottom in print is often extracted as: left-line-1, right-line-1, left-line-2, right-line-2 — interleaving text from both columns. The AI receives scrambled text and produces confused summaries.

Problem 2: Tables

PDF tables are often stored as positioned text without cell structure. Extraction produces rows of numbers with no indication of which column header they belong to. A table of financial data becomes a stream of numbers the AI cannot reliably interpret.

Problem 3: Headers and Footers

A 50-page PDF with "Page X of 50 — Confidential — Company Name" in the footer repeats this text 50 times in the extraction — wasting ~100 tokens per occurrence = 5,000 extra tokens in a 50-page document.

Problem 4: Scanned Documents

Scanned PDFs contain images of text, not text. The AI sees nothing unless the tool runs OCR. OCR accuracy on clean scans is ~99%, but on poor-quality scans, numbers and letters can be misread in ways that corrupt financial figures, legal terms, or technical specifications.

Convert Documents for AI Parsing

Convert PDFs to DOCX or TXT in your browser — nothing uploaded to a server — to get cleaner text extraction before sending to AI tools.

5. Token Efficiency by Format

For API-based AI tools (ChatGPT API, Claude API, Gemini API), token count directly determines cost. The same 10,000-word document can result in very different token counts depending on format:

Format Token overhead Est. tokens (10,000 words) Relative cost AI accuracy
Plain text (.txt) None ~13,000 1× baseline Excellent
Markdown (.md) Minimal (# * -) ~13,500 1.04× Excellent
CSV (tabular data) Commas, quotes ~14,000 1.08× Excellent for tables
Searchable PDF Headers, footers, whitespace ~17,000–20,000 1.3–1.5× Good (single-column)
DOCX XML tag artifacts ~15,000–18,000 1.15–1.4× Good
Scanned PDF (OCR) OCR errors + formatting ~18,000–25,000 1.4–2× Variable

6. Format Recommendations by Document Type

📄

Legal contracts / reports

Convert to plain text or Markdown. Run PDF through a text extractor and clean up headers/footers before uploading.

📊

Spreadsheet / financial data

Export to CSV. Copy the relevant sheet, not the entire workbook. Name files descriptively: sales-q1.csv.

📝

Meeting notes / docs

Write in Markdown from the start. ## sections and bullet lists help AI navigate long documents.

🔬

Research papers

Request the HTML version when available (arXiv has HTML for every paper). HTML extracts far cleaner than multi-column PDF.

💻

Code

Plain text or Markdown with fenced code blocks. Never convert code to PDF before uploading.

🖨️

Scanned documents

Run OCR first (Google Drive auto-OCR, Adobe Acrobat, or Tesseract CLI) to create a searchable PDF before uploading.

7. Image Formats for AI Vision Tasks

ChatGPT-4o, Claude 3, and Gemini all accept images directly — but the format you send affects how accurately the model reads what's in them.

JPG works fine for photos. When you're asking an AI to identify objects, describe a scene, or analyze a natural image, compression artifacts at normal quality settings don't cause problems. The AI isn't reading individual pixels; it's processing visual features that survive JPEG compression without meaningful loss.

PNG is the right choice the moment your image contains text. Screenshots, diagrams, annotated charts, slides — anything where letters need to be read accurately. JPEG's lossy compression targets exactly the kind of subtle edge contrast that distinguishes a lowercase "l" from a "1", or a semicolon from a colon. At the small font sizes that appear in screenshots, those distinctions disappear in a JPG. PNG's lossless compression preserves every pixel, so the AI sees the same sharp characters you do.

WebP is supported in Claude and Gemini, but older ChatGPT interface versions occasionally have issues with it. If you're distributing a workflow that other people will use, stick to JPG or PNG — they work everywhere, every time.

Image typeBest formatWhy
Photos (no text)JPGCompression artifacts don't matter; keeps file small
Screenshots / UIPNGLossless — text stays sharp, letters stay distinct
Diagrams / charts with labelsPNGPreserves fine text at small font sizes
Slides / annotated imagesPNGAny text in the image needs pixel-perfect clarity
General web imagesWebPSmaller than PNG/JPG — check platform support first

Convert Images Before Uploading to AI

Switch between JPG, PNG, and WebP instantly — browser-based, no upload limits, no signup.

8. What Each Platform Actually Does With Your Files

The three major AI tools handle file uploads quite differently under the hood. Knowing the specifics saves you time when something isn't working.

ChatGPT (GPT-4o)

File upload works through the paperclip icon in the chat interface. Supported formats include PDF, DOCX, TXT, CSV, and images. When you upload a PDF, GPT-4o extracts the text and can quote passages directly from it — but before you ask your real question, ask it to list the section headings first. This takes five seconds and tells you whether extraction worked. If the headings look right, the text order is probably intact. If they're garbled, you'll know to convert the file before proceeding.

XLSX is supported, but with an important caveat: formulas aren't evaluated. The model sees the raw values that were stored in each cell, not the results of calculations. Export to CSV with paste-as-values first if formula outputs matter. The context window is 128K tokens, but very long documents still get truncated in practice — GPT-4o prioritizes the beginning of a document, so key information in the final third may not be reached reliably.

Claude (Sonnet / Opus)

Claude handles PDFs particularly well — better than most tools at maintaining text order in complex layouts. Academic papers with multi-column text, financial reports with side-by-side tables, legal documents with footnotes — Claude's PDF extraction tends to stay coherent where ChatGPT's scrambles. This isn't universal, but it's a reliable enough pattern that starting with Claude for complex PDFs is a sensible default.

What's distinctive about Claude is how well it responds to Markdown-structured instructions — not just in uploaded files but in the prompt itself. Writing your prompt with clear ## sections and bullet lists produces noticeably more organized responses. The Files API (for API users) lets you upload a document once and reference it across multiple conversations without re-sending the file each time.

Gemini

Gemini was built as a multimodal model from the ground up. It processes text, images, PDF, audio, and video natively without workarounds. The Google Drive integration is genuinely useful: you can share a Drive link directly rather than downloading and re-uploading files. The 1M token context window in Gemini 1.5 is impressive — you could theoretically process a 1,500-page PDF in a single request. In practice, extraction quality still degrades on complex layouts. What the large context window helps most is long-but-simple documents: technical manuals in single-column text, long contracts without complex formatting, transcripts.

PlatformFile size limitBest PDF handlingNative audio/videoContext window
ChatGPT (GPT-4o)512MB / 10 filesGood (single-column)No (transcribe first)128K tokens
Claude (Sonnet/Opus)30MB / fileBest (multi-column)No (transcribe first)200K tokens
Gemini 1.5100MB / DriveGoodYes (MP3, MP4)1M tokens

9. The Convert-First Workflow

Converting a file before uploading takes a few minutes. Debugging confused AI output because you skipped that step can take an hour. Here are the three conversions that solve most problems:

Scanned PDF → TXT

1
Upload to Google Drive

Upload the scanned PDF to Google Drive — Drive runs OCR automatically when it indexes the file.

2
Open with Google Docs

Right-click the file → Open with Google Docs. Google converts the image-based PDF to an editable text document via OCR. Takes 10–30 seconds.

3
Export as plain text

File → Download → Plain text (.txt). Upload this file to your AI tool. The difference in output quality is substantial.

Excel Workbook → CSV

1
File → Save As → CSV

In Excel, save each relevant sheet as a separate CSV with a descriptive filename: sales-q1.csv, budget-2026.csv.

2
Upload both files and name them in your prompt

"Based on sales-q1.csv and sales-q2.csv, what changed between quarters?" — the AI reads filenames as context. Works better than one combined XLSX.

3
Or convert directly in the browser

Use the XLSX to CSV converter — no software needed, nothing uploaded to a server. See also the Excel to CSV guide.

PowerPoint → Markdown

1
Don't upload the PPTX

The visual layout of a presentation doesn't survive file upload anyway. Transitions, animations, spatial arrangements — all gone. The AI only needs the text.

2
Copy-paste slide text into Markdown

Use ## Slide: [title] as the header for each slide. Clear slide boundaries that PPTX extraction often loses — now explicit.

10. Formats Most People Don't Think to Use

HTML (for research papers)

If a research paper is on arXiv, the HTML version (arxiv.org/html/XXXX.XXXXX) is almost always better than the PDF. The HTML version has no columns, tables are semantic HTML elements, and equations come through as readable text rather than rendered images. This applies to any web-native content — W3C specifications, MDN documentation, standards documents. When an HTML version exists, use it.

JSON (for structured data)

Structured data is better as JSON than as anything else. AI models have processed petabytes of JSON during training — API responses, configuration files, data exports — and they parse it naturally. A product catalog, a configuration file, a database export, an API response: all of these are better uploaded as JSON than reformatted into a PDF table or spreadsheet. The model understands the key-value structure without any extraction overhead, and nested relationships (which flatten badly in CSV) remain intact.

Plain Text With Custom Structure

You don't need a formal format to add structure. Consistent section markers in plain text work well: write === SECTION: Financial Summary === before each section. AI models are pattern matchers — consistent structural signals in plain text are picked up reliably even without Markdown syntax.

11. Best Format for AI Training Data

Everything above applies to people using AI tools. There's a completely different audience who fine-tune or train models themselves. The formats that matter are not the same.

JSONL (JSON Lines) is the universal standard for fine-tuning datasets. OpenAI's fine-tuning API, Hugging Face, Axolotl, and virtually every other fine-tuning framework expect it. Each line is a complete JSON object representing one training example:

{"messages": [{"role": "user", "content": "Summarize this contract."}, {"role": "assistant", "content": "The contract establishes..."}]}

For large-scale pretraining, the standard shifts to large raw text files with document separator tokens between entries. Parquet is the columnar storage format used on the Hugging Face Datasets Hub for large shared datasets — compressed, typed, and designed for analytical workloads. If you're creating your own fine-tuning data from scratch, JSONL is simpler and more than adequate.

If you're just uploading documents to ChatGPT, Claude, or Gemini for analysis — not training a model — the training formats (JSONL, Parquet) are completely irrelevant. TXT, Markdown, and CSV are what matter for you.

12. Frequently Asked Questions

What is the best file format to upload to ChatGPT or Claude?
Plain text (.txt) or Markdown (.md) for text documents — fewest tokens, no parsing overhead, and structure is preserved clearly. CSV for spreadsheet or tabular data. Searchable PDF is acceptable for standard single-column documents but degrades significantly with multi-column layouts and tables. Avoid scanned PDFs, DOCX, and PPTX when you have a choice.
Can AI read PDF files accurately?
Searchable PDFs are read reasonably well for single-column text. Problems arise with multi-column layouts (text order scrambles between columns), tables (extracted as unstructured text), repeated headers/footers (waste tokens — a 50-page doc with a footer wastes ~5,000 tokens), and scanned PDFs (require OCR with potential recognition errors). For precision work, convert to plain text first.
Does file format affect AI token costs?
Yes, significantly. Plain text uses 30–50% fewer tokens than the equivalent content in a PDF or DOCX. Token count determines cost when using the API. A 50-page PDF might contain 25,000 words of actual content but extract to 40,000+ tokens due to formatting overhead, repeated headers, and whitespace artifacts. For high-volume document processing, format choice directly impacts your API bill.
What format is best for feeding spreadsheet data to AI?
CSV is best for tabular data. It's plain text, every AI tool parses it directly, and it uses minimal tokens. Export Excel sheets as individual CSVs before uploading. Name files descriptively — AI tools read filenames as context. Avoid XLSX: formulas aren't evaluated, and Excel's formatting metadata adds noise that doesn't help the model.
Should I use Markdown when writing documents for AI?
Yes. Markdown provides structure (headers, lists, code blocks) with minimal token overhead. Headers help AI navigate long documents; code blocks signal code vs prose; bullet lists communicate hierarchy. Markdown is the ideal format for documents destined for AI consumption. Claude in particular is trained heavily on Markdown and responds noticeably better to Markdown-structured prompts.
What image format works best for AI image recognition tasks?
PNG is best when the image contains text, labels, charts, or diagrams — lossy JPG compression blurs fine detail at small font sizes. For photos without text (identifying objects, scenes, faces), JPG works perfectly and keeps file size manageable. WebP works in Claude and Gemini but has inconsistent support in older ChatGPT versions — convert to PNG if in doubt.
Does Claude handle PDFs better than ChatGPT?
In practice, yes — Claude tends to maintain text order more accurately in complex PDF layouts like academic papers, financial reports, and legal documents. ChatGPT's PDF extraction is good for standard single-column documents but struggles more with multi-column layouts. Both tools degrade on scanned PDFs — run OCR first if accuracy matters.
Can I upload audio or video files to AI for analysis?
Gemini 1.5 can process audio files (MP3, WAV) and videos (MP4) directly without any conversion. ChatGPT and Claude do not process audio or video natively — transcribe audio first using Whisper (free from OpenAI), save as a .txt file, then upload the transcript. Transcripts often produce better AI analysis than raw audio anyway because you can review and clean them before submitting.
What file format does ChatGPT accept for uploads?
ChatGPT (GPT-4o) accepts PDF, DOCX, TXT, CSV, and image files (JPG, PNG, WebP, GIF). Per-file size limit is 512MB. You can upload up to 10 files per message. For text documents, TXT and CSV give cleaner extraction than PDF. For images with text, use PNG rather than JPG to preserve sharpness.
Is PDF or Word better for uploading to AI tools?
Neither is ideal — both are worse than plain text. If you have a choice, convert to TXT first. Between the two: Word (DOCX) generally gives better extraction than PDF because text is stored directly in structured XML, while PDF extraction quality depends heavily on how the PDF was created. Scanned PDFs are worst of all. If you must use one unchanged, DOCX is the safer bet.
What are the file upload limits for ChatGPT, Claude, and Gemini?
As of 2026 — ChatGPT: 512MB per file, up to 10 files per message. Claude: 30MB per file. Gemini: 100MB per file, plus Google Drive links (bypasses size limits entirely). These limits rarely affect typical documents but matter when processing high-resolution images, long videos, or large datasets.
What is the best file format for AI training data?
JSONL (JSON Lines) is the standard for fine-tuning datasets — used by OpenAI's fine-tuning API, Hugging Face, and most frameworks. Each line is a complete JSON conversation object. For large pretraining datasets, Parquet is standard. If you're using AI tools (not training them), these formats don't apply — use TXT, Markdown, and CSV for document analysis.

Format choice is one of those things that sounds like a minor detail until you've spent 20 minutes wondering why a perfectly capable AI gave you a useless answer about a document it clearly misread. Getting this right takes about two minutes. The converters are right here if you need them.

✍️
Convertlo Editorial Team
We test file conversion tools, formats, and workflows so you don't have to. All guides are written from hands-on testing with real documents and real AI tools.
More articles →