SwiftText | Cocoanetics

December 24, 2025

27

Over the course of the final 12 months, I’ve had fairly a couple of facet tasks that required some strategy to get textual content from quite a lot of sources, with code and frameworks present in numerous personal repos. Some time in the past, I felt an inkling to begin pulling these collectively into an open supply challenge. So this will likely be my Christmas present for you this 12 months.

SwiftText collects numerous methods of getting textual content — or, if doable, Markdown — from quite a lot of sources and locations.

Replace: … now Pictures, PDFs, Phrase DOCX and in addition HTML pages or URLs.

One such use case was to get pure textual content from financial institution statements for my funding portfolio, in order that I might parse the textual content and assemble a CSV file to add my holdings to Yahoo Finance.

Studying PDFs

For probably the most half these statements have been regular PDFs that had been programmatically created. The benefit of these is which you could get the precise textual content from choice ranges, identical to when you choose the textual content after which copy it to the pasteboard. That is the one kind of PDFs you may discover with vector knowledge. Basically these information are only a file of drawing info right into a vector context.

However there was an issue, as a result of a few of these statements have been scanned from paper. That is the opposite — much less helpful — kind of PDFs: these are primarily collections of bitmap photographs, one per web page. However fortunately we do have fairly succesful OCR capabilities on Mac and iOS within the type of the Imaginative and prescient framework.

With each PDF choice ranges in addition to textual content fragments from Imaginative and prescient you get rectangles with textual content. So I made it such that you simply solely need to ask a PDFPage for its textLines(). It should first try and get the textual content from the choice ranges and if it fails it can render the web page right into a 300 DPI bitmap after which OCR it, to nonetheless provide you with kind of the identical consequence. These textual content traces are comprised of these fragments which can be seemingly forming a line, regardless that there could be tabs or whitespace between them.

This was the state of this personal framework for the longest time. It noticed much more utilization in a receipt scanner I’m constructing for myself and in addition once I was requested by a pal to translate a number of PDFs, it was extraordinarily fortunate that I had a fast strategy to get the uncooked textual content from these PDFs to feed into ChatGPT. This opened my thoughts for the likelihood that this could be fairly helpful in agentic situations the place brokers have to get to the textual content of issues.

So the concept for SwiftText was born: it ought to be an open supply challenge that collects numerous types of getting textual content — or, if doable, Markdown — from quite a lot of sources and locations.

Studying DOCX

For PDFs I had already coated each sorts of PDF information, extracting the OCR for bitmap photographs was a easy train. There was a case the place I needed to get the pure textual content from a Phrase doc (DOCX) as a substitute of PDF. Granted, I might simply copy the textual content out of that, however my aim is to have that in a kind — a device — that I might use to automate such work sooner or later.

I had a take a look at how DOCX information are constructed: they’re only a ZIP archive of a few XML information. On the coronary heart there’s a doc.xml which accommodates the precise doc textual content. So I gave this process to Codex and with almost no additional enter from me it was capable of create a utility that might output the pure textual content from such a Phrase doc. Behind the scenes it makes use of XMLParser, so the one exterior dependency for that’s ZIPFoundation, as a result of to my information there isn’t any first-party ZIP studying functionality that matches this use case throughout Apple’s platforms.

Markdown has a slight edge over pure textual content as a result of it marks emphasis on particular phrases, tells us about headlines of various ranges, and in addition clearly buildings lists — numbered or bulleted. However my Codex agent additionally had no drawback pulling out this fashion info from the DOCX contents.

SwiftText comes with a demo CLI app that permits you to carry out OCR. This offers you Markdown for a Phrase file:

swift run swifttext docx file.docx --markdown

For PDF or bitmaps you do:

swift run swifttext ocr file

For the latter I do have experimental Markdown assist, nevertheless it’s been very difficult to get semantic info from these sorts of sources. I’ve the beginnings of a semantic parser — once more from Imaginative and prescient — which guarantees correct paragraphs, tables, and lists. However sadly at the moment evidently I couldn’t get it to work reliably. The issue with tables is that Imaginative and prescient appears to be very simply thrown off by some layouts, detects superfluous columns and what not. The most effective strategy right here would most likely be to take a look at traces which have textual content at all times on the similar x positions after which infer the desk construction from that. That is clear future work.

After all the best could be to only hand your information to ChatGPT — or some native Imaginative and prescient-enabled LLM — and ask for it to only provide the textual content. However with this determination you allow the realm of good determinism and construction. And likewise you begin to have prices of these tokens. There may be nonetheless one thing to be mentioned for a purely native resolution that leverages performance obtainable natively on Apple platforms. The existence of the Imaginative and prescient framework particularly will make it unattainable for this to ever be obtainable on different platforms. However alas, I can stay with solely with the ability to assist iOS and Mac with SwiftText.

Warning: Traits

This package deal has one other first for me: package deal traits.

With these — in the event you use Swift instruments 6.1 or larger — you’ll be able to import SwiftText as an umbrella module which itself accommodates SwiftTextOCR, SwiftTextPDF, and SwiftTextDOCX.

If I perceive that appropriately, sooner or later sooner or later SwiftPM will have the ability to omit exterior dependencies if they don’t seem to be wanted. Proper now they’re nonetheless being resolved and downloaded, though not compiled if not referenced by code. The one fast nicety is which you could merely import SwiftText in your code, and the required traits resolve what will get packaged into that for you.

That is an enchancment over the earlier methodology of getting separate imports for all targets/merchandise you need: import SwiftTextPDF and import SwiftTextDOCX (and maybe future traits like — dare I say — HTML).

Quo Vadis?

I’ve a couple of extra personal issues that I want to see transfer into SwiftText. I do have a functioning device that will get Markdown from HTML, which requires libXML. That is helpful for getting an LLM-friendly model of net pages.

Some net pages construct their content material with JavaScript — like e.g. OpenAI API documentation. I’ve bought an answer for that as nicely, leveraging WebKit which works by loading the online web page with WebKit and ready for the DOM to be full. Then it extracts the DOM’s HTML and parses that.

So these will likely be a number of the subsequent additions to this challenge. Then there’s in fact extra doc semantics. It will be nice to get correct Markdown tables from wherever. We’ll see about that. That may come extra shortly from Phrase than from PDFs as a result of XML is orders of magnitude extra structured than PDFs.

Conclusion

I’m excited to share SwiftText with the OSS group as a result of it has confirmed its price to me on many events. I might have waited till it’s much more polished however I used to be wanting to make my work right here public. I’ve some concepts for the longer term course of SwiftText and I invite you to get in contact with particular use instances the place enhancements may match with the spirit of SwiftText.

Replace, later the identical day….

As a result of Codex is de facto superb copying code between tasks whereas integrating it, I used to be ready so as to add my libXML-based HTMLParser in addition to the code to transform HTML to markdown. Get pleasure from!

Associated

Classes: Tasks

SwiftText | Cocoanetics

Studying PDFs

Studying DOCX

Warning: Traits

Quo Vadis?

Conclusion

Like this:

Associated

Related Articles

How topological surfaces strengthen magnetism – Physics World

Robots-Weblog | fruitcore robotics stellt HORST600 G2 und HORST800 G2 vor – neue Robotergeneration für mehr Leistung und Wirtschaftlichkeit

30.6% of US companies paid for Anthropic’s instruments in March, up from 24.4% in February; OpenAI’s US enterprise adoption remained almost flat MoM at...

LEAVE A REPLY Cancel reply

Latest Articles

How topological surfaces strengthen magnetism – Physics World

Robots-Weblog | fruitcore robotics stellt HORST600 G2 und HORST800 G2 vor – neue Robotergeneration für mehr Leistung und Wirtschaftlichkeit

30.6% of US companies paid for Anthropic’s instruments in March, up from 24.4% in February; OpenAI’s US enterprise adoption remained almost flat MoM at...

This new chip might slash information middle vitality waste

ChatGPT rolls out new $100 Professional subscription to problem Claude

ABOUT US