@jerryjliu0
The fundamental issue with PDF parsing is that PDFs are designed for display purposes. The internal representation of data is outputting shapes at specific coordinates on the page (e.g. "render this string at coordinate (84, 720) with this font") each displayed character could be not contiguous at all, there could be no font mapping back to unicode so you have no idea what the character is. Any PDF parser needs to magically reconstruct this random sequence of display coordinate data into semantically meaningful text, tables, and more. VLMs do help (screenshot the page and read it), but besides collapsing the metadata they still struggle in terms of accuracy and cost. note: parsing Word/Pptx as text representations so typically a bit easier too read. Our entire company at @llama_index is laser-focused on PDF parsing so we've been really trying to understand all the nuances of doc formats, especially PDFs 🙂 more notes on this coming soon