Best File Format for AI: PDF vs TXT vs Markdown vs CSV (2026)
Table of Contents
- TL;DR — Quick Reference
- How AI Tools Read Documents
- Format Rankings (A+ to D)
- Why PDFs Are Problematic for AI
- Token Efficiency by Format
- Format Recommendations by Document Type
- Image Formats for AI Vision Tasks
- What Each Platform Does With Your Files
- The Convert-First Workflow
- Formats Most People Overlook
- Best Format for AI Training Data
- Frequently Asked Questions
You upload a PDF to ChatGPT, ask it to summarize the key points, and get back a confident-sounding response that misses half the document. The problem usually isn't the AI — it's the format you gave it. AI language models don't read files. They read the text that gets extracted from files, and that extraction process is far messier than most people realize. Choosing the right format before you upload — or knowing when to convert — is one of the highest-leverage things you can do to get better AI output.
1. TL;DR — Quick Reference
The short answer: For prose documents, plain text (.txt) or Markdown (.md) is the best format to send to any AI tool — fewest tokens, no parsing errors, cleanest extraction. For spreadsheet or tabular data, CSV is the right choice. If you're stuck with a PDF or DOCX, convert it first. The quality of the AI's response is almost entirely a function of the quality of the text it receives.
2. How AI Tools Read Documents
None of the major AI tools actually read your file the way you do. What happens behind the scenes: when you upload a document, a parser extracts the text and converts it to a token stream — basically a long string of words — which gets passed to the language model. The model never sees your carefully formatted PDF with its columns and tables. It only sees whatever came out the other end of that extraction step. That's why two documents with identical content can produce radically different AI responses depending on format:
- Plain text: Extracted 1:1 — what you write is exactly what the model sees
- Markdown: Same as plain text, plus structural signals (# headers, lists, code blocks) that the model understands natively
- PDF (searchable): Text extracted, but reading order breaks in multi-column layouts and tables often come out scrambled
- PDF (scanned): Requires OCR first — if there's no text layer, the model sees nothing at all without it
- DOCX: Text pulled from XML — generally accurate but formatting tags leave residual token overhead
- XLSX: Cell values extracted to text; formulas aren't evaluated and row-column relationships often don't survive
- PPTX: Slide text extracted sequentially; visual layout, spatial context, and animations are all invisible to the model
3. Format Rankings for AI Consumption
If you had to rank every common document format by how well AI tools handle them, it looks like this. The grades reflect a combination of extraction accuracy, token overhead, and how often the output is actually usable:
Plain Text (.txt)
Maximum clarity, minimum tokens, zero parsing errors
Markdown (.md)
Plain text + structure. Best for documents with headings and lists
CSV
Best format for tabular/spreadsheet data. Clean, parseable rows
Searchable PDF
Acceptable for single-column text docs. Tables and multi-column layouts degrade
DOCX
Good text accuracy; XML overhead adds token cost. Fine for standard docs
XLSX / PPTX
Structure often lost. Export to CSV or copy text before uploading
Scanned PDF
OCR required. Accuracy varies. Convert to searchable PDF first
4. Why PDFs Are Problematic for AI
PDF is the world's most common document sharing format, which makes it the most common format people upload to AI tools — and the format that causes the most confusion. It's not that PDF is a bad format. It's that PDF was designed for printing and visual presentation, not for text extraction. There are four specific ways this mismatch causes problems:
Problem 1: Multi-Column Layouts
A two-column academic paper that reads left-column-top → left-column-bottom → right-column-top → right-column-bottom in print is often extracted as: left-line-1, right-line-1, left-line-2, right-line-2 — interleaving text from both columns. The AI receives scrambled text and produces confused summaries.
Problem 2: Tables
PDF tables are often stored as positioned text without cell structure. Extraction produces rows of numbers with no indication of which column header they belong to. A table of financial data becomes a stream of numbers the AI cannot reliably interpret.
Problem 3: Headers and Footers
A 50-page PDF with "Page X of 50 — Confidential — Company Name" in the footer repeats this text 50 times in the extraction — wasting ~100 tokens per occurrence = 5,000 extra tokens in a 50-page document.
Problem 4: Scanned Documents
Scanned PDFs contain images of text, not text. The AI sees nothing unless the tool runs OCR. OCR accuracy on clean scans is ~99%, but on poor-quality scans, numbers and letters can be misread in ways that corrupt financial figures, legal terms, or technical specifications.
Convert Documents for AI Parsing
Convert PDFs to DOCX or TXT in your browser — nothing uploaded to a server — to get cleaner text extraction before sending to AI tools.
5. Token Efficiency by Format
For API-based AI tools (ChatGPT API, Claude API, Gemini API), token count directly determines cost. The same 10,000-word document can result in very different token counts depending on format:
| Format | Token overhead | Est. tokens (10,000 words) | Relative cost | AI accuracy |
|---|---|---|---|---|
| Plain text (.txt) | None | ~13,000 | 1× baseline | Excellent |
| Markdown (.md) | Minimal (# * -) | ~13,500 | 1.04× | Excellent |
| CSV (tabular data) | Commas, quotes | ~14,000 | 1.08× | Excellent for tables |
| Searchable PDF | Headers, footers, whitespace | ~17,000–20,000 | 1.3–1.5× | Good (single-column) |
| DOCX | XML tag artifacts | ~15,000–18,000 | 1.15–1.4× | Good |
| Scanned PDF (OCR) | OCR errors + formatting | ~18,000–25,000 | 1.4–2× | Variable |
6. Format Recommendations by Document Type
Legal contracts / reports
Convert to plain text or Markdown. Run PDF through a text extractor and clean up headers/footers before uploading.
Spreadsheet / financial data
Export to CSV. Copy the relevant sheet, not the entire workbook. Name files descriptively: sales-q1.csv.
Meeting notes / docs
Write in Markdown from the start. ## sections and bullet lists help AI navigate long documents.
Research papers
Request the HTML version when available (arXiv has HTML for every paper). HTML extracts far cleaner than multi-column PDF.
Code
Plain text or Markdown with fenced code blocks. Never convert code to PDF before uploading.
Scanned documents
Run OCR first (Google Drive auto-OCR, Adobe Acrobat, or Tesseract CLI) to create a searchable PDF before uploading.
7. Image Formats for AI Vision Tasks
ChatGPT-4o, Claude 3, and Gemini all accept images directly — but the format you send affects how accurately the model reads what's in them.
JPG works fine for photos. When you're asking an AI to identify objects, describe a scene, or analyze a natural image, compression artifacts at normal quality settings don't cause problems. The AI isn't reading individual pixels; it's processing visual features that survive JPEG compression without meaningful loss.
PNG is the right choice the moment your image contains text. Screenshots, diagrams, annotated charts, slides — anything where letters need to be read accurately. JPEG's lossy compression targets exactly the kind of subtle edge contrast that distinguishes a lowercase "l" from a "1", or a semicolon from a colon. At the small font sizes that appear in screenshots, those distinctions disappear in a JPG. PNG's lossless compression preserves every pixel, so the AI sees the same sharp characters you do.
WebP is supported in Claude and Gemini, but older ChatGPT interface versions occasionally have issues with it. If you're distributing a workflow that other people will use, stick to JPG or PNG — they work everywhere, every time.
| Image type | Best format | Why |
|---|---|---|
| Photos (no text) | JPG | Compression artifacts don't matter; keeps file small |
| Screenshots / UI | PNG | Lossless — text stays sharp, letters stay distinct |
| Diagrams / charts with labels | PNG | Preserves fine text at small font sizes |
| Slides / annotated images | PNG | Any text in the image needs pixel-perfect clarity |
| General web images | WebP | Smaller than PNG/JPG — check platform support first |
Convert Images Before Uploading to AI
Switch between JPG, PNG, and WebP instantly — browser-based, no upload limits, no signup.
8. What Each Platform Actually Does With Your Files
The three major AI tools handle file uploads quite differently under the hood. Knowing the specifics saves you time when something isn't working.
ChatGPT (GPT-4o)
File upload works through the paperclip icon in the chat interface. Supported formats include PDF, DOCX, TXT, CSV, and images. When you upload a PDF, GPT-4o extracts the text and can quote passages directly from it — but before you ask your real question, ask it to list the section headings first. This takes five seconds and tells you whether extraction worked. If the headings look right, the text order is probably intact. If they're garbled, you'll know to convert the file before proceeding.
XLSX is supported, but with an important caveat: formulas aren't evaluated. The model sees the raw values that were stored in each cell, not the results of calculations. Export to CSV with paste-as-values first if formula outputs matter. The context window is 128K tokens, but very long documents still get truncated in practice — GPT-4o prioritizes the beginning of a document, so key information in the final third may not be reached reliably.
Claude (Sonnet / Opus)
Claude handles PDFs particularly well — better than most tools at maintaining text order in complex layouts. Academic papers with multi-column text, financial reports with side-by-side tables, legal documents with footnotes — Claude's PDF extraction tends to stay coherent where ChatGPT's scrambles. This isn't universal, but it's a reliable enough pattern that starting with Claude for complex PDFs is a sensible default.
What's distinctive about Claude is how well it responds to Markdown-structured instructions — not just in uploaded files but in the prompt itself. Writing your prompt with clear ## sections and bullet lists produces noticeably more organized responses. The Files API (for API users) lets you upload a document once and reference it across multiple conversations without re-sending the file each time.
Gemini
Gemini was built as a multimodal model from the ground up. It processes text, images, PDF, audio, and video natively without workarounds. The Google Drive integration is genuinely useful: you can share a Drive link directly rather than downloading and re-uploading files. The 1M token context window in Gemini 1.5 is impressive — you could theoretically process a 1,500-page PDF in a single request. In practice, extraction quality still degrades on complex layouts. What the large context window helps most is long-but-simple documents: technical manuals in single-column text, long contracts without complex formatting, transcripts.
| Platform | File size limit | Best PDF handling | Native audio/video | Context window |
|---|---|---|---|---|
| ChatGPT (GPT-4o) | 512MB / 10 files | Good (single-column) | No (transcribe first) | 128K tokens |
| Claude (Sonnet/Opus) | 30MB / file | Best (multi-column) | No (transcribe first) | 200K tokens |
| Gemini 1.5 | 100MB / Drive | Good | Yes (MP3, MP4) | 1M tokens |
9. The Convert-First Workflow
Converting a file before uploading takes a few minutes. Debugging confused AI output because you skipped that step can take an hour. Here are the three conversions that solve most problems:
Scanned PDF → TXT
Upload the scanned PDF to Google Drive — Drive runs OCR automatically when it indexes the file.
Right-click the file → Open with Google Docs. Google converts the image-based PDF to an editable text document via OCR. Takes 10–30 seconds.
File → Download → Plain text (.txt). Upload this file to your AI tool. The difference in output quality is substantial.
Excel Workbook → CSV
In Excel, save each relevant sheet as a separate CSV with a descriptive filename: sales-q1.csv, budget-2026.csv.
"Based on sales-q1.csv and sales-q2.csv, what changed between quarters?" — the AI reads filenames as context. Works better than one combined XLSX.
Use the XLSX to CSV converter — no software needed, nothing uploaded to a server. See also the Excel to CSV guide.
PowerPoint → Markdown
The visual layout of a presentation doesn't survive file upload anyway. Transitions, animations, spatial arrangements — all gone. The AI only needs the text.
Use ## Slide: [title] as the header for each slide. Clear slide boundaries that PPTX extraction often loses — now explicit.
10. Formats Most People Don't Think to Use
HTML (for research papers)
If a research paper is on arXiv, the HTML version (arxiv.org/html/XXXX.XXXXX) is almost always better than the PDF. The HTML version has no columns, tables are semantic HTML elements, and equations come through as readable text rather than rendered images. This applies to any web-native content — W3C specifications, MDN documentation, standards documents. When an HTML version exists, use it.
JSON (for structured data)
Structured data is better as JSON than as anything else. AI models have processed petabytes of JSON during training — API responses, configuration files, data exports — and they parse it naturally. A product catalog, a configuration file, a database export, an API response: all of these are better uploaded as JSON than reformatted into a PDF table or spreadsheet. The model understands the key-value structure without any extraction overhead, and nested relationships (which flatten badly in CSV) remain intact.
Plain Text With Custom Structure
You don't need a formal format to add structure. Consistent section markers in plain text work well: write === SECTION: Financial Summary === before each section. AI models are pattern matchers — consistent structural signals in plain text are picked up reliably even without Markdown syntax.
11. Best Format for AI Training Data
Everything above applies to people using AI tools. There's a completely different audience who fine-tune or train models themselves. The formats that matter are not the same.
JSONL (JSON Lines) is the universal standard for fine-tuning datasets. OpenAI's fine-tuning API, Hugging Face, Axolotl, and virtually every other fine-tuning framework expect it. Each line is a complete JSON object representing one training example:
{"messages": [{"role": "user", "content": "Summarize this contract."}, {"role": "assistant", "content": "The contract establishes..."}]}
For large-scale pretraining, the standard shifts to large raw text files with document separator tokens between entries. Parquet is the columnar storage format used on the Hugging Face Datasets Hub for large shared datasets — compressed, typed, and designed for analytical workloads. If you're creating your own fine-tuning data from scratch, JSONL is simpler and more than adequate.
If you're just uploading documents to ChatGPT, Claude, or Gemini for analysis — not training a model — the training formats (JSONL, Parquet) are completely irrelevant. TXT, Markdown, and CSV are what matter for you.
12. Frequently Asked Questions
What is the best file format to upload to ChatGPT or Claude?
Can AI read PDF files accurately?
Does file format affect AI token costs?
What format is best for feeding spreadsheet data to AI?
Should I use Markdown when writing documents for AI?
What image format works best for AI image recognition tasks?
Does Claude handle PDFs better than ChatGPT?
Can I upload audio or video files to AI for analysis?
What file format does ChatGPT accept for uploads?
Is PDF or Word better for uploading to AI tools?
What are the file upload limits for ChatGPT, Claude, and Gemini?
What is the best file format for AI training data?
Format choice is one of those things that sounds like a minor detail until you've spent 20 minutes wondering why a perfectly capable AI gave you a useless answer about a document it clearly misread. Getting this right takes about two minutes. The converters are right here if you need them.