In today’s post we’ll turn a scan into a searchable pdf. We will start of with ordinary document scans and turn them into a sandwhich pdf. We will optimize the image files, combine them and write them to single pdf file, that allows text search. We will make use of advanced Google technology, so let’s get started…
Today you can choose between a video tut or an written article. It’s you to choose! :D
Make scans searchable step by step:
Install tesseract-ocr and image magick
Image magick is a command-line based tool for image processing. Image magick is pre-installed on most linux system. To check out whether and what version of image-magick is installed on your system. Type the following.
convert -v
If necessary install image magick from your distribution’s repositories. You could use programs like synaptic to do a quick search for the needed packages.
Tesseract ocr is a tool for optical character recognition and it’s command-line-based. Google maintaines this project and it has become the standard for open-source ocr tools over time. The most easiest way to install tesseract-ocr is through a package manager. If you want to compile tesseract yourself, check out this link.
It’s as easy as this.
Create some scans
First you need some scans. In most cases, you’ll create them on your own. Applications like simple scan make scanning really easy. But you can use any other application or tool you like. Save the output to an image file like .jpg. You could also try writing your individual scans to a single .tiff file.
Improve scans for ocr
When creating some scans you often run into issues like white backgrounds that tend to look grey, or black letters, that have fuzzy, colored edges. Further more you often have to deal with huge file sizes, that have long loading times, and are not very comfortable to view. Viewing the files itself, can also be a problem, as letters can “fade out” depending from the used colors.
Useful commands to improve scans
In this table I’ve compiled a number of commands that are useful to improve your scans. You can find some of them used in my video.
quiet | -suppress all warnings |
-normalize | Transform image to span the full range of colors. This means that our final image will contain totally white and totally black colors. |
-gamma 0.8,0.8,0.8 | Turns down mid-values of letters (if set smaller to one). Otherwise letters could become to thin. |
-monochrome | Creates 1-bit image only. Reduces color depth (black and white only) and as a result file size. |
-contrast | enhance or reduce the image contrast. |
-background-color | Set background color. Useful with images that have transparent areas. |
+dither -posterize 3 | Reduces the number of steps to 3 in every color channel without dithering. I recommend to use values between 2 and four. The background will become uniform. |
-level value | adjust the level of image contrast |
-sharpen | sharpen the image |
-depth | Set bit depth of image. A depth of 8 means your color will contain 256 (2^8) colors. |
Here’s an example:
I used level, normalize and monochrome to change the picture below into a clear document, with a small file size.
You can find the example source image here. The used command is pretty simple.
convert -quiet -level 0%,77% -normalize -monochrome *.jpg combo.tiff
“convert” will run image magick, “quiet” will suppress all warnings. I finally tell image magick to change the level settings, normalize the image and return a black-and-white-only image. (see above) By using the *.jpg prompt I tell image magick to use all files of type .jpg in my current directory and write them to a multipage .tiff file. Of course you could specify a name for your input images, instead of using a place holder. Tiff files can grow to a file size of 4GB and can be viewed using default applications like evince-viewer.
Advanced commands to improve scans
What if you wanted to get of, i mean really get of, the ground with your character recognized pdfs? What you need is textcleaner. It’s a pretty advanced script for image magick. You set different option and it’s optimized for handling scanned documents. Please also checkout Fred’s other scripts. Personally I like his textcleaner and tiltshift script best.
In some cases a curve-like approach can help as well. This stackoverflow thread pretty much describes what it is about. You can basically make use of Curves to tune the colors of your image.
Running tesseract ocr
Running tesseract couldn’t be easier. All you do is navigating to your directory, call tesseract, set the language flag, set the input and output file and wait. Depending on the size of your documents, you could clean your house meanwhile or have a nip from a cold ginger beer. A simple command would look like this:
tesseract -l deu combo.tiff output pdf
The language flag is set to German. If you want to learn about other supported languages, go tho this really helpful webpage. There are more advanced options, but I decided to drop them due to simplicity reasons. If you want to go crazy, with tesseract, visit their github repository.
Admire your result
You should give admiring your result top priority. I am using evince-viewer again, that is a standard Gnome application. This step is really most important and you should definitely do it, before parsing a multi-page document. Check out whether you are happy with the recognition quality. It might help writing the extracted text to a single text file instead of sandwich pdf.
If you realize, that you are unhappy with what your display spills out, try playing around with the image quality settings. You might have to change the resolution, the quality or rethink the arrangement of the prompts, as a different order can lead to totally different results. You could also spend a thought on training tesseract as well. Cleaning the pizza-finger covered scanning bed can also help, trust me ;-)
Et vioila.
Do you have any tips or questions, just leave us a comment then?
