Choose OCR Software

One way or another, you need software to turn your raw scans into searchable PDFs. OCR software of some sort most likely came with your scanner, but if it didn’t—or if you’re not happy with its features or accuracy—you have oodles of other choices. This chapter provides an overview of major factors to consider when choosing Mac-compatible OCR software, along with a few specific suggestions for software to try (or avoid).

Determine Your Needs

I haven’t tried every scanner and every OCR app out there, but I’m going to go out on a limb and suggest that almost any combination of scanner and software can be made to yield acceptable results for most users. If you don’t want to agonize over the decision, the path of least resistance is to use whatever software came with your scanner.

However, you may be the sort of person who should look more deeply into the capabilities of OCR tools before jumping in if any of the following statements apply to you:

  • You need to scan in multiple languages. All the OCR apps I discuss here support English text, and most support at least a few other languages too. If you have documents in more than one language (and especially if your documents mix more than one language on a page), you’ll want OCR software that supports those languages, as I discuss in the next topic.
  • You want capabilities your existing OCR software lacks. Perhaps you’ve tried the software that came with your scanner and found it to be too slow, too cumbersome to use, or missing features you wish it had. If so, by all means look for a replacement!
  • You need more-advanced PDF processing features. Some of the OCR programs here do nothing but spit out a searchable PDF file, whereas others let you manipulate PDFs to your heart’s content. If you want fine-grained control over your searchable PDFs, look for such a program.

If you have none of those needs, feel free to skip to the next chapter, Configure Your Software. Otherwise, continue reading to learn about features to consider when evaluating OCR software.

Consider Important OCR Features

Comparing OCR apps for macOS is less of a science than an art—and a messy one at that. The information available on developers’ Web sites varies tremendously in scope and detail. Some have elaborate user man­uals, while others have only a brief how-to guide. Many offer downloadable demo versions, but some don’t. Developers use different terms to describe the same features, and have wildly divergent ideas about what constitutes a nicely usable interface. A feature that one devel­oper considers too obvious to mention may be a main selling point for another. And although most of these apps claim to have out­standing OCR accuracy, objective measurements are notoriously difficult to come by.

In short, it’s harder than one might imagine to evaluate OCR software without trying it out (and even then, results may be ambiguous). However, a few factors are worth looking for:

  • Accuracy: No OCR software is 100 percent accurate, but it’s been a long time since I used OCR software that didn’t come close enough to meet my basic searching and archiving needs. (Remember, if all you need to do with your PDFs is search them, occasional OCR mis­takes won’t affect your results much.) Nevertheless, because so many factors influence OCR accuracy—not the least of which is the quality of the raw scans that your scanner produces—it’s possible for two people to have dramatically different results with the same app and even the same document. So, my advice is to take develop­ers’ claims of accuracy with a teaspoon of salt.

    The best way to determine whether results are good enough for your needs is to try an OCR tool on freshly scanned documents from your scanner of choice. Then, select all the text in the PDF, paste it into an empty document in your favorite word processor, and run a spelling check (or skim for errors manually). If the terms you’re likely to search for are often incorrect, you might want to look for a different program.

  • Languages: If not all the documents that you’ll scan are entirely in English, pay attention to the software’s multilingual support. The first task of OCR software is to recognize individual characters. If a document contains characters that don’t appear in English (such as Ω or ø), your OCR software must know about those other character sets in order to interpret them properly—otherwise, they’ll be repre­sent­ed by the nearest equivalents in the English alphabet. Beyond that, nearly all modern OCR software uses language-specific dic­tion­aries and algorithms to improve its accuracy dramatically. If a certain group of pixels might be either torn or tcm, a dictionary can compare those two strings and conclude that, for English, “torn” is the much more likely interpretation.

    The same goes for other languages, including those that use the same alphabet as English: you’ll get much better results if the pro­gram knows about the rules of that particular language.

    Even with support for multiple languages, though, OCR software may need help to narrow things down. If you specify up front which language a document is in (something usually done in the software’s preferences), you’ll get better results than if the software has to guess.

    Often, you can specify both a primary language and one or more secondary languages to improve the odds that the software will use the right rules as needed. Even so, OCR software that does perfectly well with pages that are entirely in a single language may have trouble determining when languages change within a page. Once again, this is best judged by selecting some representative docu­ments from your collection, scanning them using the settings I cover in the next chapter, and then checking to see what results the OCR software provides. Be sure to verify that they are, in fact, searchable, too.

    OCR software sometimes handles non-Latin text in ways that make searching (and even copying text) problematic.

  • Automation support: Nearly all the OCR apps I’ve tested support at least a bare minimum level of automation—that is, there’s some way to configure them, perhaps in conjunction with your scanner’s included driver, such that newly scanned images are converted auto­matically into searchable PDFs without any need for manual intervention. (Sometimes this requires a bit of fiddling to set up, but I discuss that in the next chapter, Configure Your Software.) How­ever, in many cases, better still is the capability to automate PDF processing in more elaborate ways using AppleScript, a topic that I discuss in Automate OCR, later.
  • Handwriting support: Recognizing text that was produced by a (good-quality) printer is relatively easy for a computer; recogniz­ing handwritten text is considerably more challenging. A minority of OCR programs for the Mac claim to be able to recognize (neat, printed) handwriting to one degree or another, so if you need to scan lots of handwritten notes, be sure to look for that feature.
  • Business card support: Any OCR program can recognize the raw text on a business card, but some have additional intelligence that enables them to infer (with mixed results) which string of text is a name, which is a title, which is a phone number, and so on—and then to put all those pieces into their appropriate fields in a database record that you can then export to, or sync with, Contacts or another contact manager. If you scan lots of business cards, this capability can save you lots of manual effort.
  • Receipt processing: In much the same way as some OCR soft­ware can recognize the contents of a business card, other programs are designed to make sense of receipts—specifically looking for infor­­mation such as date, merchant name, sales tax, and total, and storing that information in a database you can use for tracking expenses, preparing tax returns, and performing similar tasks.
  • PDF editing: A few apps are designed mainly for creating, editing, annotating, optimizing, and otherwise transforming PDF files—with OCR merely being one of their many tricks. If you need such advanced features, then choosing one of these multifunctional apps may make your life easier.
  • Layout retention: Although the focus of this book is on creating searchable PDFs, sometimes you may need to convert a scanned image into an editable document you can alter in, say, Word or Excel (see the sidebar Converting Scans to Microsoft Word Format). Several OCR programs can do just that—create editable documents whose layouts closely resemble those of the originals, including graphics, tables, and even similar fonts. Although the end result won’t be as faithful to the look and feel of the scanned original as a searchable PDF, it will be much easier to work with if you need to do anything other than search, read, and copy the text.
  • Document management: Once you have a searchable PDF, how do you search it? The answer may be macOS’s system-wide Spotlight feature, in which case you can put the file in any conven­ient folder in the Finder—or use tags (see the sidebar just ahead). However, you may prefer an app that lets you catalog, cross-reference, and search files with far greater flexibility and precision. If so, look for an app that includes not only OCR but also document man­agement capabilities—or opt for a stand-alone document manager such as Yojimbo.

With those thoughts in mind, let’s look at the range of OCR programs that you can choose from. (I offer my recommendations after the list of apps, in Joe’s OCR Software Recommendations.)

Pick a Mac OCR Package

The number and variety of Mac apps that can produce a searchable PDF are growing constantly. As I said earlier, if whatever software was bundled with your scanner yields results you find acceptable, there’s no need to look further. This is especially true if you buy a Fujitsu ScanSnap scanner. Although the software bundles vary slightly by model, they all include ABBYY FineReader for ScanSnap, ScanSnap Receipt (an excellent OCR app for receipts—easily on par with, if not better than, Neat and Paperless, which I discuss ahead), and CardMinder (an excellent OCR app for business cards).

But if for any reason you’re unsatisfied with the software that comes with your scanner, you should find no shortage of choices.

In the first edition of this book, I described 21 OCR apps. That list went out of date almost immediately. And frankly, I consider only a fraction of those apps to be noteworthy. So, just as I did with scanners, I’ve relocated the information on OCR software to the online appendixes, where you can peruse features and prices at your leisure, and where I can more easily keep them up to date.

Here, I want to call your attention to just a few of those choices that I find particularly interesting for one reason or another. If an OCR tool you’re wondering about isn’t listed here, check the online appendixes.

Notable OCR tools for Mac include:

  • ABBYY FineReader Pro for Mac: Customized versions of this software are often bundled with scanners or embedded in other apps, but it’s available as a stand-alone product too. (FineReader Pro replaces an earlier version of the product, FineReader Express, which is still bundled with some scanners.) Although FineReader (in whichever form) is the most accurate OCR app I’ve tested, the stand-alone version offers little in the way of configurability, whereas some bundled versions (for example, the version included with DEVONthink Pro Office) give you more control over things like OCR accuracy and compression.
  • Acrobat Pro DC: I mentioned this before, but it bears repeating: even though Acrobat XI Pro was a nonstarter for anyone who wanted to automate OCR on newly scanned documents, Acrobat Pro DC makes the process easier than it has ever been before. Even manual OCR requires only one click. Acrobat Pro DC also includes a feature called ClearScan, which creates a custom font for the recog­nized text that closely approximates the fonts, styles, and layout of the original (though not in a form that can be edited in other apps) and then replaces the bitmap image with one that has a much lower resolution. This can save a tremendous amount of disk space, and for many documents, the fidelity is quite good, but if you plan to print the document again, it may or may not be close enough to the original for your needs. In addition, Acrobat Pro DC is available only by subscription (either alone or as part of Adobe Creative Cloud), and the price may seem unreasonable compared to other options unless you plan to use it for other purposes beyond OCR.
  • DEVONthink Pro Office: Primarily a document manager—and a fantastic one, at that—DEVONthink Pro Office lets you categorize, tag, sort, link, and search documents of all kinds with ease. And, not only does it come with an integrated version of ABBYY FineReader (see the description earlier in this list), it also has special hooks that let it receive scans directly and seamlessly from any Fujitsu ScanSnap scanner, with no special configuration required.
  • Evernote Premium: Evernote is the name of an app (available for macOS, Windows, and iOS, among others) and an accompany­ing cloud-based service that lets you save, search, and share docu­ments of many kinds—from notes to photos to PDF files—on nearly any device. The apps and the basic service are free, but a paid ser­vice called Evernote Premium ($69.99 per year) adds several options, including support for searching in PDFs created from scanned images or digital photos. (A lower-priced service called Evernote Plus does not include searching in PDFs.) The OCR conversion itself happens in the cloud—and you can search PDFs only within Evernote; you can’t save searchable PDFs to be used elsewhere.

    Evernote has lots of fans, including many who swear by it for their paperless office use—they upload their scanned files, let Evernote Premium make them searchable, and then automatically have access to them on all their devices. That’s handy, but every time I’ve tried Evernote, I’ve been disappointed. In particular, Evernote’s document management features are largely limited to tagging—you can have at most two levels of hierarchy (that is, “stacks” at the top level containing notebooks, which in turn contain notes), but you can’t create an arbitrary hierarchy with folders inside folders—and searching is far more limited than in, say, DEVONthink. If you prefer tags to folders (see the sidebar Tags vs. Folders), don’t need complex searches, and don’t mind the fact that you can search your PDFs only in Evernote, it might work better for you than for me.

  • Neat: There’s a bit of a story here, which is in fact slightly messy. Until early 2016, Neat bundled standalone software with its scan­ners. (The software was originally called NeatWorks, and was later renamed to Neat for Mac.) This app, like others of its kind, created searchable PDFs from scanned documents, whether created using Neat’s own scanners or those of another manufacturer. It also had features designed specifically for receipts and business cards as well as document management capabilities, making it quite versatile all around—with the qualification that it supported only English for OCR.

    However, like so many other companies, Neat has moved away from standalone software to a cloud-based app (also called Neat) that’s available only by subscription. Prices range from $79.99 per year to $249.99 per year. For that price, you get online storage, doc­ument management, and searching. But the catch is that the cloud app doesn’t actually create searchable PDFs. It lets you search the docu­ments you upload, and you can export non-searchable PDFs, but you can no longer save searchable PDFs locally. In other words, it’s designed to create reliance on the cloud and thus recurr­ing revenue for Neat. Because the price is so high and there’s no option for search­able PDFs, I would not recommend Neat.

  • Paperless: Somewhat along the lines of Neat but packaged as a conventional, standalone app, Mariner Software’s Paperless is a document management app that performs OCR on scanned doc­uments, with special treatment for receipts and other financial records, which can be decoded and entered into a database—and exported for use with Quicken or spreadsheet apps if you wish. Like DEVONthink Pro Office, it integrates directly with Fujitsu’s ScanSnap scanners. Unfortunately, Paperless, like Neat, supports OCR only in English.
  • PDFpen and PDFpenPro: When I need to make a quick edit to a PDF (such as superimposing a copy of my handwritten signature—see Sign Documents without Paper), I immediately turn to PDFpen, which makes this sort of activity simple. Among many other fea­tures, it offers OCR in 15 languages, solid AppleScript support, and the option to save files in Word format. The Pro version of PDFpen has extra features—it can convert Web pages into multi-page PDFs, create PDF forms, add a table of contents for easier navigation, export in additional formats (such as Excel, PowerPoint, and PDF/A), and even edit the OCR results.
  • Readiris for Mac: Readiris is one of the more advanced OCR programs. It recognizes text in over 130 languages, and it supports mixed character sets in a single sentence. The documentation em­phasizes that although Readiris recognizes handprinting, it doesn’t recognize handwriting. Readiris can also save documents in an editable form (such as Microsoft Word format) that preserves the original layout. And it has a proprietary compression system that claims to produce the smallest possible PDF files.
  • VueScan: VueScan is best known as software that provides sup­port for using old scanners (ones for which the manufacturers no longer release updated drivers). However, it supports many modern scanners too—as I write this, the tally is 3180 models from 35 man­u­facturers—including most flatbed and document scanner models by Brother, Canon, Epson, Fujitsu, and Visioneer. (Scanners from Doxie and Neat, interestingly, do not appear on the list.) VueScan comes in two editions, Standard and Pro. Both offer a long list of scanning features, but only the Pro version includes OCR. (English OCR support is built in, but you can download free files to add support for 32 additional languages.)

Joe’s OCR Software Recommendations

Of the apps listed in the online appendixes, I have experience with about half. All the Mac OCR tools I’ve tried have had adequate (if not always great) accuracy, but some are easier to use than others.

My preference is for a tool that works more or less invisibly behind the scenes. I like to configure things so that images from my scanner get the OCR treatment without interrupting my work or taking over my screen. A few OCR apps—Presto PageManager, OmniPage, Readiris, and VueScan—are what I think of as “old school,” in that their design assumes you’ll open the app, initiate a single-page scan from within it (typically, on a flatbed scanner), watch the OCR as it progresses, edit the final document, and then save it. There’s nothing wrong with any of that—and all those apps can be used in a much more automated, hands-off way—but I tend to gravitate toward apps with a more modern, minimalist approach.

I’ve been pleased with the results, interface, and flexibility of ABBYY FineReader—especially the versions of it included with Fujitsu’s ScanSnap scanners and DEVONthink Pro Office. It’s reasonably fast, recognizes text in multiple languages without any fuss (for several years I scanned about equal amounts of English and French text), and requires no interaction under normal circumstances.

If you need an uncommon feature, you should go for one of the tools that offers it. Otherwise, if you’re looking for strictly OCR, I’d lean toward ABBYY FineReader—either an embedded version or the stand-alone FineReader Pro. For PDF editing, I’d choose PDFpen over the much-more-expensive Acrobat Pro DC unless you specifically need some of the latter’s unique features. If document management is your focus, I’d go with DEVONthink Pro Office; and if you particularly need to deal with receipts and you don’t already have ScanSnap Receipts, I’d recommend Paperless over Neat.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.20.224.107