🐦 Twitter Post Details

Viewing enriched Twitter post

@jerryjliu0

Parsing PDFs is insanely hard This is completely unintuitive at first glance, considering PDFs are the most commonly used container of unstructured data in the world. I wrote a blog post digging into the PDF representation itself, why its impossible to “simply” read the page into plaintext, and what the modern parsing techniques are 👇 The crux of the issue is that PDFs are designed to display text on a screen, and not to represent what a word means. 1️⃣ PDF text is represented as glyph shapes positioned at absolute x,y coordinates. Sometimes there’s no mapping from character codes back to a unicode representation 2️⃣ Most PDFs have no concept of a table. Tables are described as grid lines drawn with coordinates. Traditional parser would have to find intersections between lines to infer cell boundaries and associate with text within cells through algorithms 3️⃣ The order of operators has no relationship with reading order. You would need clustering techniques to be able to piece together text into a coherent logical format. That’s why everyone today is excited about using VLMs to parse text. Which to be clear has a ton of benefits, but still limitations in terms of accuracy and cost. At @llama_index we’re building hybrid pipelines that interleave both text and VLMs to give both extremely accurate parsing at the cheapest price points. Blog: https://t.co/iLJpIr7cbH LlamaParse: https://t.co/TqP6OT5U5O

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2029998812216127763/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2029998812216127763/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    },
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2029998812216127763/media_1.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2029998812216127763/media_1.jpg?",
      "type": "photo",
      "filename": "media_1.jpg"
    }
  ],
  "processed_at": "2026-03-07T14:16:44.383768",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2029998812216127763",
  "url": "https://x.com/jerryjliu0/status/2029998812216127763",
  "twitterUrl": "https://twitter.com/jerryjliu0/status/2029998812216127763",
  "text": "Parsing PDFs is insanely hard\n\nThis is completely unintuitive at first glance, considering PDFs are the most commonly used container of unstructured data in the world. I wrote a blog post digging into the PDF representation itself, why its impossible to “simply” read the page into plaintext, and what the modern parsing techniques are 👇\n\nThe crux of the issue is that PDFs are designed to display text on a screen, and not to represent what a word means.\n\n1️⃣ PDF text is represented as glyph shapes positioned at absolute x,y coordinates. Sometimes there’s no mapping from character codes back to a unicode representation\n2️⃣ Most PDFs have no concept of a table. Tables are described as grid lines drawn with coordinates. Traditional parser would have to find intersections between lines to infer cell boundaries and associate with text within cells through algorithms\n3️⃣ The order of operators has no relationship with reading order. You would need clustering techniques to be able to piece together text into a coherent logical format.\n\nThat’s why everyone today is excited about using VLMs to parse text. Which to be clear has a ton of benefits, but still limitations in terms of accuracy and cost.\n\nAt @llama_index we’re building hybrid pipelines that interleave both text and VLMs to give both extremely accurate parsing at the cheapest price points.\n\nBlog: https://t.co/iLJpIr7cbH\nLlamaParse: https://t.co/TqP6OT5U5O",
  "source": "Twitter for iPhone",
  "retweetCount": 51,
  "replyCount": 17,
  "likeCount": 648,
  "quoteCount": 8,
  "viewCount": 68056,
  "createdAt": "Fri Mar 06 19:13:27 +0000 2026",
  "lang": "en",
  "bookmarkCount": 642,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2029998812216127763",
  "displayTextRange": [
    0,
    276
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "jerryjliu0",
    "url": "https://x.com/jerryjliu0",
    "twitterUrl": "https://twitter.com/jerryjliu0",
    "id": "369777416",
    "name": "Jerry Liu",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1283610285031460864/1Q4zYhtb_normal.jpg",
    "coverPicture": "",
    "description": "",
    "location": "",
    "followers": 70551,
    "following": 1462,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Wed Sep 07 22:54:31 +0000 2011",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 8474,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 1435,
    "statusesCount": 6679,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2029725308866498850"
    ],
    "profile_bio": {
      "description": "document OCR + workflows @llama_index. cofounder/CEO\n\nCareers: https://t.co/EUnMNmb4DZ\nEnterprise: https://t.co/Ht5jwxRU13",
      "entities": {
        "description": {
          "hashtags": [],
          "symbols": [],
          "urls": [
            {
              "display_url": "llamaindex.ai/careers",
              "expanded_url": "https://www.llamaindex.ai/careers",
              "indices": [
                63,
                86
              ],
              "url": "https://t.co/EUnMNmb4DZ"
            },
            {
              "display_url": "llamaindex.ai/contact",
              "expanded_url": "https://www.llamaindex.ai/contact",
              "indices": [
                99,
                122
              ],
              "url": "https://t.co/Ht5jwxRU13"
            }
          ],
          "user_mentions": [
            {
              "id_str": "0",
              "indices": [
                25,
                37
              ],
              "name": "",
              "screen_name": "llama_index"
            }
          ]
        },
        "url": {
          "urls": [
            {
              "display_url": "llamaindex.ai",
              "expanded_url": "https://www.llamaindex.ai/",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/YiIfjVl1ly"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.twitter.com/OATQaCz7Xf",
        "expanded_url": "https://twitter.com/jerryjliu0/status/2029998812216127763/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": [
              {
                "h": 127,
                "w": 127,
                "x": 1275,
                "y": 501
              }
            ]
          },
          "orig": {
            "faces": [
              {
                "h": 128,
                "w": 128,
                "x": 1279,
                "y": 503
              }
            ]
          }
        },
        "id_str": "2029998772659609600",
        "indices": [
          277,
          300
        ],
        "media_key": "3_2029998772659609600",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARwsASDa2uAACgACHCwBKhCbcRMAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABHCwBINra4AAKAAIcLAEqEJtxEwAA",
            "media_key": "3_2029998772659609600"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/HCwBINra4AAkUEm.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 1150,
              "w": 2053,
              "x": 0,
              "y": 0
            },
            {
              "h": 1159,
              "w": 1159,
              "x": 894,
              "y": 0
            },
            {
              "h": 1159,
              "w": 1017,
              "x": 977,
              "y": 0
            },
            {
              "h": 1159,
              "w": 580,
              "x": 1195,
              "y": 0
            },
            {
              "h": 1159,
              "w": 2053,
              "x": 0,
              "y": 0
            }
          ],
          "height": 1159,
          "width": 2053
        },
        "sizes": {
          "large": {
            "h": 1156,
            "w": 2048
          }
        },
        "type": "photo",
        "url": "https://t.co/OATQaCz7Xf"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [
      {
        "display_url": "llamaindex.ai/blog/why-readi…",
        "expanded_url": "https://www.llamaindex.ai/blog/why-reading-pdfs-is-hard?utm_source=xjl&utm_medium=social",
        "indices": [
          1367,
          1390
        ],
        "url": "https://t.co/iLJpIr7cbH"
      },
      {
        "display_url": "cloud.llamaindex.ai/?utm_source=xj…",
        "expanded_url": "https://cloud.llamaindex.ai/?utm_source=xjl&utm_medium=social",
        "indices": [
          1403,
          1426
        ],
        "url": "https://t.co/TqP6OT5U5O"
      }
    ],
    "user_mentions": [
      {
        "id_str": "1604278358296055808",
        "indices": [
          1210,
          1222
        ],
        "name": "LlamaIndex 🦙",
        "screen_name": "llama_index"
      }
    ]
  },
  "quoted_tweet": {
    "type": "tweet",
    "id": "2029995922529386760",
    "url": "https://x.com/llama_index/status/2029995922529386760",
    "twitterUrl": "https://twitter.com/llama_index/status/2029995922529386760",
    "text": "PDFs are the bane of every AI agent's existence: here's why parsing them is so much harder than you think 📄\n\nEvery developer building document agents eventually hits the same wall: PDFs weren't designed to be machine-readable. They're drawing instructions from 1982, not structured data.\n\n📝 PDF text isn't stored as characters: it's glyph shapes positioned at coordinates with no semantic meaning\n📊 Tables don't exist as objects: they're just lines and text that happen to look tabular when rendered\n🔄 Reading order is pure guesswork — content streams have zero relationship to visual flow\n🤖 Seventy years of OCR evolution led us to combine text extraction with vision models for optimal results\n\nWe built LlamaParse using this hybrid approach: fast text extraction for standard content, vision models for complex layouts. It's how we're solving document processing at scale.\n\nRead the full breakdown of why PDFs are so challenging and how we're tackling it: https://t.co/K8bQmgq7xN",
    "source": "Twitter for iPhone",
    "retweetCount": 12,
    "replyCount": 5,
    "likeCount": 96,
    "quoteCount": 3,
    "viewCount": 71796,
    "createdAt": "Fri Mar 06 19:01:58 +0000 2026",
    "lang": "en",
    "bookmarkCount": 120,
    "isReply": false,
    "inReplyToId": null,
    "conversationId": "2029995922529386760",
    "displayTextRange": [
      0,
      270
    ],
    "inReplyToUserId": null,
    "inReplyToUsername": null,
    "author": {
      "type": "user",
      "userName": "llama_index",
      "url": "https://x.com/llama_index",
      "twitterUrl": "https://twitter.com/llama_index",
      "id": "1604278358296055808",
      "name": "LlamaIndex 🦙",
      "isVerified": false,
      "isBlueVerified": true,
      "verifiedType": "Business",
      "profilePicture": "https://pbs.twimg.com/profile_images/1967920417760251904/0ytfduMQ_normal.png",
      "coverPicture": "https://pbs.twimg.com/profile_banners/1604278358296055808/1770092126",
      "description": "",
      "location": "",
      "followers": 109426,
      "following": 29,
      "status": "",
      "canDm": false,
      "canMediaTag": true,
      "createdAt": "Sun Dec 18 00:52:44 +0000 2022",
      "entities": {
        "description": {
          "urls": []
        },
        "url": {}
      },
      "fastFollowersCount": 0,
      "favouritesCount": 1499,
      "hasCustomTimelines": true,
      "isTranslator": false,
      "mediaCount": 1832,
      "statusesCount": 3754,
      "withheldInCountries": [],
      "affiliatesHighlightedLabel": {},
      "possiblySensitive": false,
      "pinnedTweetIds": [
        "2029767312195117278"
      ],
      "profile_bio": {
        "description": "AI Agents for document OCR + workflows\n\nLlamaParse: https://t.co/yQGTiRSfFL\nDocs: https://t.co/us6GCS14vD",
        "entities": {
          "description": {
            "hashtags": [],
            "symbols": [],
            "urls": [
              {
                "display_url": "cloud.llamaindex.ai",
                "expanded_url": "https://cloud.llamaindex.ai/",
                "indices": [
                  52,
                  75
                ],
                "url": "https://t.co/yQGTiRSfFL"
              },
              {
                "display_url": "developers.llamaindex.ai/python/cloud/",
                "expanded_url": "https://developers.llamaindex.ai/python/cloud/",
                "indices": [
                  82,
                  105
                ],
                "url": "https://t.co/us6GCS14vD"
              }
            ],
            "user_mentions": []
          },
          "url": {
            "urls": [
              {
                "display_url": "llamaindex.ai",
                "expanded_url": "https://www.llamaindex.ai/",
                "indices": [
                  0,
                  23
                ],
                "url": "https://t.co/epzefqPT9Z"
              }
            ]
          }
        }
      },
      "isAutomated": false,
      "automatedBy": null
    },
    "extendedEntities": {
      "media": [
        {
          "display_url": "pic.twitter.com/4O4C8hQ7Ml",
          "expanded_url": "https://twitter.com/llama_index/status/2029995922529386760/photo/1",
          "ext_media_availability": {
            "status": "Available"
          },
          "features": {
            "large": {
              "faces": [
                {
                  "h": 127,
                  "w": 127,
                  "x": 1275,
                  "y": 501
                }
              ]
            },
            "orig": {
              "faces": [
                {
                  "h": 128,
                  "w": 128,
                  "x": 1279,
                  "y": 503
                }
              ]
            }
          },
          "id_str": "2029995918804811776",
          "indices": [
            271,
            294
          ],
          "media_key": "3_2029995918804811776",
          "media_results": {
            "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARwr/ohj2oAACgACHCv+iUHbAQgAAA==",
            "result": {
              "__typename": "ApiMedia",
              "id": "QXBpTWVkaWE6DAABCgABHCv+iGPagAAKAAIcK/6JQdsBCAAA",
              "media_key": "3_2029995918804811776"
            }
          },
          "media_url_https": "https://pbs.twimg.com/media/HCv-iGPagAAJ-qL.jpg",
          "original_info": {
            "focus_rects": [
              {
                "h": 1150,
                "w": 2053,
                "x": 0,
                "y": 0
              },
              {
                "h": 1159,
                "w": 1159,
                "x": 894,
                "y": 0
              },
              {
                "h": 1159,
                "w": 1017,
                "x": 977,
                "y": 0
              },
              {
                "h": 1159,
                "w": 580,
                "x": 1195,
                "y": 0
              },
              {
                "h": 1159,
                "w": 2053,
                "x": 0,
                "y": 0
              }
            ],
            "height": 1159,
            "width": 2053
          },
          "sizes": {
            "large": {
              "h": 1156,
              "w": 2048
            }
          },
          "type": "photo",
          "url": "https://t.co/4O4C8hQ7Ml"
        }
      ]
    },
    "card": null,
    "place": {},
    "entities": {
      "hashtags": [],
      "symbols": [],
      "urls": [
        {
          "display_url": "llamaindex.ai/blog/why-readi…",
          "expanded_url": "https://www.llamaindex.ai/blog/why-reading-pdfs-is-hard?utm_source=socials&utm_medium=li_social",
          "indices": [
            959,
            982
          ],
          "url": "https://t.co/K8bQmgq7xN"
        }
      ],
      "user_mentions": []
    },
    "quoted_tweet": null,
    "retweeted_tweet": null,
    "isLimitedReply": false,
    "article": null
  },
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}