Optical Character Recognition (OCR) is the process of converting printed text into a digital representation. It has a variety of practical applications – from digitizing printed books and creating electronic records of receipts, to license plate recognition and even cracking image-based captchas.
Tesseract is an open source project that can implement OCR. You can run this project on *Nix systems, Mac systems and Windows systems, but by using a library we can use it in PHP projects. The purpose of this tutorial is to teach you how to use it.
Install
Preparation
To keep things simple and consistent, we will use a virtual machine (this article uses Vagrant) to run the application. This will involve installing PHP and Nginx. We will install Tesseract to demonstrate the process respectively. If you want to install Tesseract yourself on an existing Debian-based system, you can skip the next section—or check out the README for installation instructions on other *nix, Mac systems, or Windows.
Configuring Vagrant
In order to configure Vagrant to follow this tutorial, complete the following steps. Or you can simply get the code from Github.
Enter the following command to download the Homestead Improved Vagrant configuration to a folder named orc:
git clone https://github.com/Swader/homestead_improved ocr
Place the following code in the Nginx configuration file Homestead.yml:
<ol class="dp-c"><li class="alt"><span><span>sites: </span></span></li><li><span> - map: homestead.app </span></li><li class="alt"><span> to: /home/vagrant/Code/Project/<span class="keyword">public</span><span> </span></span></li></ol>
was changed to:
<ol class="dp-c"><li class="alt"><span><span>sites: </span></span></li><li><span> - map: homestead.app </span></li><li class="alt"><span> to: /home/vagrant/Code/<span class="keyword">public</span><span> </span></span></li></ol>
Also add
to the hosts file<ol class="dp-c"><li class="alt"><span><span>192.168.10.10 homestead.app </span></span></li></ol>
Install Tesseract
The next step is to install Tesseract
Because Homestead Improved uses debian, we can use apt-get to install it after logging into the virtual machine using vagrant ssh. Simply run the following command:
<ol class="dp-c"><li class="alt"><span><span>sudo apt-get install tesseract-ocr </span></span></li></ol>
As mentioned above, there are other operating system-specific tutorials in the README.
Test and customize the installation
We will use the PHP wrapper, but before that we can test Tesseract from the command line.
First save this image sign.png
In the virtual machine, execute the following command to read text from the image
<ol class="dp-c"><li class="alt"><span><span>tesseract sign.png out </span></span></li></ol>
This will create a file in the current folder: out.txt which should have the word: CAUTION
Try nowsign2.jpg
<ol class="dp-c"><li class="alt"><span><span>tesseract sign2.jpg out </span></span></li></ol>
This time produces the word Einbahnstral’ie. Close but not correct—although the text in the image is quite clear, it fails to recognize the character ß.
In order for Tesseract to read strings properly, we need to install some new language files - in this case, German.
There is a comprehensive list of available language files here, but let’s just download the ones you need:
<ol class="dp-j"><li class="alt"><span><span>wget https:</span><span class="comment">//tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.deu.tar.gz</span><span> </span></span></li></ol>
Unzip:
<ol class="dp-c"><li class="alt"><span><span>tar zxvf tesseract-ocr-3.02.deu.tar.gz </span></span></li></ol>
Then copy the file to the following directory:
<ol class="dp-c"><li class="alt"><span><span>/usr/share/tesseract-ocr/tessdata </span></span></li></ol>
For example
<ol class="dp-c"><li class="alt"><span><span>cp deu-frak.traineddata /usr/share/tesseract-ocr/tessdata </span></span></li><li><span>cp deu.traineddata /usr/share/tesseract-ocr/tessdata </span></li></ol>
Now we execute the original command again but with –l
<ol class="dp-j"><li class="alt"><span><span>tesseract sign2.jpg out -l deu </span></span></li><li><span> </span></li><li class="alt"><span> “deu” 是德语的 ISO <span class="number">639</span><span>-</span><span class="number">3</span><span>码. </span></span></li></ol>
This time, the text should be Einbahnstraße correct).
Any language can be used by repeating the above process.
Configuration Application
We will use this library to use Tesseract with PHP.
We will build a minimalist web application: users upload images and view the OCR processing results. We will use Silex microframework to achieve this. Don't worry if you're not familiar with it, the app itself is simple.
Remember that all the code for this tutorial is available on Github.
The first step is to use Composer to install dependency files:
<ol class="dp-c"><li class="alt"><span><span>composer </span><span class="keyword">require</span><span> silex/silex twig/twig thiagoalessio/tesseract_ocr:dev-master </span></span></li></ol>
Then create three folders:
<ol class="dp-c"><li class="alt"><span><span>- </span><span class="keyword">public</span><span> </span></span></li><li><span>- uploads </span></li><li class="alt"><span>- views </span></li></ol>
We need to upload the form (viewsindex.twig):
<ol class="dp-c"><li class="alt"><span><span><html> </span></span></li><li><span> <head> </span></li><li class="alt"><span> <title>OCR</title> </span></li><li><span> </head> </span></li><li class="alt"><span> <body> </span></li><li><span> </span></li><li class="alt"><span> <form action=<span class="string">""</span><span> method=</span><span class="string">"post"</span><span> enctype=</span><span class="string">"multipart/form-data"</span><span>> </span></span></li><li><span> <input type=<span class="string">"file"</span><span> name=</span><span class="string">"upload"</span><span>> </span></span></li><li class="alt"><span> <input type=<span class="string">"submit"</span><span>> </span></span></li><li><span> </form> </span></li><li class="alt"><span> </span></li><li><span> </body> </span></li><li class="alt"><span></html> </span></li></ol>
Need a results display page (viewsresults.twig)::
<ol class="dp-c"><li class="alt"><span><span><html> </span></span></li><li><span> <head> </span></li><li class="alt"><span> <title>OCR</title> </span></li><li><span> </head> </span></li><li class="alt"><span> <body> </span></li><li><span> </span></li><li class="alt"><span> <h2>Results</h2> </span></li><li><span> </span></li><li class="alt"><span> <textarea cols=<span class="string">"50"</span><span> rows=</span><span class="string">"10"</span><span>>{{ text }}</textarea> </span></span></li><li><span> </span></li><li class="alt"><span> <hr> </span></li><li><span> </span></li><li class="alt"><span> <a href=<span class="string">"/"</span><span>>← Go back</a> </span></span></li><li><span> </span></li><li class="alt"><span> </body> </span></li><li><span></html> </span></li></ol>
Now create skeleton Silex app (publicindex.php):
<ol class="dp-c"><li class="alt"><span><span><php </span></span></li><li><span> </span></li><li class="alt"><span><span class="keyword">require</span><span> __DIR__.</span><span class="string">'/../vendor/autoload.php'</span><span>; </span></span></li><li><span> </span></li><li class="alt"><span><span class="keyword">use</span><span> Symfony\Component\HttpFoundation\Request; </span></span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span> = </span><span class="keyword">new</span><span> Silex\Application(); </span></span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span>->register(</span><span class="keyword">new</span><span> Silex\Provider\TwigServiceProvider(), [ </span></span></li><li><span> <span class="string">'twig.path'</span><span> => __DIR__.</span><span class="string">'/../views'</span><span>, </span></span></li><li class="alt"><span>]); </span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span>[</span><span class="string">'debug'</span><span>] = true; </span></span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span>->get(</span><span class="string">'/'</span><span>, </span><span class="keyword">function</span><span>() </span><span class="keyword">use</span><span> (</span><span class="vars">$app</span><span>) { </span></span></li><li><span> </span></li><li class="alt"><span> <span class="keyword">return</span><span> </span><span class="vars">$app</span><span>[</span><span class="string">'twig'</span><span>]->render(</span><span class="string">'index.twig'</span><span>); </span></span></li><li><span> </span></li><li class="alt"><span>}); </span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span>->post(</span><span class="string">'/'</span><span>, </span><span class="keyword">function</span><span>(Request </span><span class="vars">$request</span><span>) </span><span class="keyword">use</span><span> (</span><span class="vars">$app</span><span>) { </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// TODO</span><span> </span></span></li><li><span> </span></li><li class="alt"><span>}); </span></li><li><span> </span></li><li class="alt"><span><span class="vars">$app</span><span>->run(); </span></span></li></ol>
If you access the app in a browser, you should see a file upload form. If you are using Homestead Improved Vagrant, you can access the application through the link below.
<ol class="dp-c"><li class="alt"><span><span>http:</span><span class="comment">//homestead.app/</span><span> </span></span></li></ol>
The next step is to implement file upload. Silex makes this job very simple; $request contains a files component through which we can obtain any uploaded file, code:
<ol class="dp-c"><li class="alt"><span><span class="comment">// Grab the uploaded file</span><span> </span></span></li><li><span><span class="vars">$file</span><span> = </span><span class="vars">$request</span><span>->files->get(</span><span class="string">'upload'</span><span>); </span></span></li><li class="alt"><span> </span></li><li><span><span class="comment">// Extract some information about the uploaded file</span><span> </span></span></li><li class="alt"><span><span class="vars">$info</span><span> = </span><span class="keyword">new</span><span> SplFileInfo(</span><span class="vars">$file</span><span>->getClientOriginalName()); </span></span></li><li><span> </span></li><li class="alt"><span><span class="comment">// Create a quasi-random filename</span><span> </span></span></li><li><span><span class="vars">$filename</span><span> = sprintf(</span><span class="string">'%d.%s'</span><span>, time(), </span><span class="vars">$info</span><span>->getExtension()); </span></span></li><li class="alt"><span> </span></li><li><span><span class="comment">// Copy the file</span><span> </span></span></li><li class="alt"><span><span class="vars">$file</span><span>->move(__DIR__.</span><span class="string">'/../uploads'</span><span>, </span><span class="vars">$filename</span><span>); </span></span></li></ol>
As you can see, we generate random filenames to reduce filename conflicts - but in this application, it doesn't matter how we name the files. Once we have a local copy of the file, we can spawn an instance of the Tessearct library and analyze it:
<ol class="dp-c"><li class="alt"><span><span class="comment">// Instantiate the Tessearct library</span><span> </span></span></li><li><span><span class="vars">$tesseract</span><span> = </span><span class="keyword">new</span><span> TesseractOCR(__DIR__ . </span><span class="string">'/../uploads/'</span><span> . </span><span class="vars">$filename</span><span>); </span></span></li></ol>
Implementing OCR on an image is quite simple, we only need to call the method recognize().
<ol class="dp-c"><li class="alt"><span><span class="comment">// Perform OCR on the uploaded image</span><span> </span></span></li><li><span><span class="vars">$text</span><span> = </span><span class="vars">$tesseract</span><span>->recognize(); </span></span></li></ol>
Finally we display the results on the results page:
<ol class="dp-c"><li class="alt"><span><span class="keyword">return</span><span> </span><span class="vars">$app</span><span>[</span><span class="string">'twig'</span><span>]->render( </span></span></li><li><span> <span class="string">'results.twig'</span><span>, </span></span></li><li class="alt"><span> [ </span></li><li><span> <span class="string">'text'</span><span> => </span><span class="vars">$text</span><span>, </span></span></li><li class="alt"><span> ] </span></li><li><span>); </span></li></ol>
Try it on some images and see how it works. If you have difficulties, you can refer to this
一个实际的例子
让我们来看OCR一个更实用的例子。在本例中,我们尝试在图像中找到一个格式化的电话号码。
看看下面一幅图,上传到你的应用:
结果应该如下:
<ol class="dp-c"><li class="alt"><span><span>:ii‘i </span></span></li><li><span>Customer Service Helplines </span></li><li class="alt"><span> </span></li><li><span>British Airways Helpline </span></li><li class="alt"><span> </span></li><li><span>09040 490 541 </span></li></ol>
它没有挑出正文文本,这是我们能料到的,因为图片质量太差。虽然识别了号码但是也有一些“噪声”。
为了提取相关信息,有如下几件事我们可以做。
你可以让Tesseract 把它的结果限制在一定的字符集内,所以我们告诉它只返回数字型的内容代码如下:
<ol class="dp-c"><li class="alt"><span><span class="vars">$tesseract</span><span>->setWhitelist(range(0,9)); </span></span></li></ol>
但这样有个问题。它常常把非数字字符解释成数字而非忽略它们。比如“Bob”可能被解释称数字“808”。
所以我们采用两步处理。
第一步,我们可以用一个基本的正则表达式。可以用谷歌电话库来确定一个数字串是否是合法电话号码。
备注:我已在Sitepoint 写过关于谷歌电话库的内容。
让我们给谷歌电话库添加一个PHP 端口,修改composer.json,添加:
<ol class="dp-c"><li class="alt"><span><span class="string">"giggsey/libphonenumber-for-php"</span><span>: </span><span class="string">"~7.0"</span><span> </span></span></li></ol>
别忘了升级:
<ol class="dp-c"><li class="alt"><span><span>composer update </span></span></li></ol>
现在我们可以写一个函数,输入为一个字符串,尝试提取一个合法的电话号码
<ol class="dp-c"><li class="alt"><span><span class="comment">/**</span> </span></li><li><span><span class="comment">* Parse a string, trying to find a valid telephone number. As soon as it finds a</span> </span></li><li class="alt"><span><span class="comment">* valid number, it'll return it in E1624 format. If it can't find any, it'll</span> </span></li><li><span><span class="comment">* simply return NULL.</span> </span></li><li class="alt"><span><span class="comment">*</span> </span></li><li><span><span class="comment">* @param string $text The string to parse</span> </span></li><li class="alt"><span><span class="comment">* @param string $country_code The two digit country code to use as a "hint"</span> </span></li><li><span><span class="comment">* @return string | NULL</span> </span></li><li class="alt"><span><span class="comment">*/</span><span> </span></span></li><li><span><span class="keyword">function</span><span> findPhoneNumber(</span><span class="vars">$text</span><span>, </span><span class="vars">$country_code</span><span> = </span><span class="string">'GB'</span><span>) { </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Get an instance of Google's libphonenumber</span><span> </span></span></li><li class="alt"><span> <span class="vars">$phoneUtil</span><span> = \libphonenumber\PhoneNumberUtil::getInstance(); </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// Use a simple regular expression to try and find candidate phone numbers</span><span> </span></span></li><li><span> preg_match_all(<span class="string">'/(\+\d+)?\s*(\(\d+\))?([\s-]?\d+)+/'</span><span>, </span><span class="vars">$text</span><span>, </span><span class="vars">$matches</span><span>); </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Iterate through the matches</span><span> </span></span></li><li class="alt"><span> <span class="keyword">foreach</span><span> (</span><span class="vars">$matches</span><span> </span><span class="keyword">as</span><span> </span><span class="vars">$match</span><span>) { </span></span></li><li><span> </span></li><li class="alt"><span> <span class="keyword">foreach</span><span> (</span><span class="vars">$match</span><span> </span><span class="keyword">as</span><span> </span><span class="vars">$value</span><span>) { </span></span></li><li><span> </span></li><li class="alt"><span> try { </span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// Attempt to parse the number</span><span> </span></span></li><li><span> <span class="vars">$number</span><span> = </span><span class="vars">$phoneUtil</span><span>->parse(trim(</span><span class="vars">$value</span><span>), </span><span class="vars">$country_code</span><span>); </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Just because we parsed it successfully, doesn't make it vald - so check it</span><span> </span></span></li><li class="alt"><span> <span class="keyword">if</span><span> (</span><span class="vars">$phoneUtil</span><span>->isValidNumber(</span><span class="vars">$number</span><span>)) { </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// We've found a telephone number. Format using E.164, and exit</span><span> </span></span></li><li><span> <span class="keyword">return</span><span> </span><span class="vars">$phoneUtil</span><span>->format(</span><span class="vars">$number</span><span>, \libphonenumber\PhoneNumberFormat::E164); </span></span></li><li class="alt"><span> </span></li><li><span> } </span></li><li class="alt"><span> </span></li><li><span> } catch (\libphonenumber\NumberParseException <span class="vars">$e</span><span>) { </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Ignore silently; getting here simply means we found something that isn't a phone number</span><span> </span></span></li><li class="alt"><span> </span></li><li><span> } </span></li><li class="alt"><span> </span></li><li><span> } </span></li><li class="alt"><span> } </span></li><li><span> </span></li><li class="alt"><span> <span class="keyword">return</span><span> null; </span></span></li><li><span> </span></li><li class="alt"><span>} </span></li></ol>
希望注释能解释这个函数在干什么。注意如果这个库没能从字符串中解析出一个合法的电话号码它会抛出一个异常。这不是什么问题;我们直接忽略它并继续下一个候选字符。
如果我们找到一个电话号码,我们以E.164的形式返回它。这提供了一个国际化的号码,我们可以用来打电话或者发送SMS。
现在我们可以如下使用:
<ol class="dp-c"><li class="alt"><span><span class="vars">$text</span><span> = </span><span class="vars">$tesseract</span><span>->recognize(); </span></span></li><li><span><span class="vars">$number</span><span> = findPhoneNumber(</span><span class="vars">$text</span><span>, </span><span class="string">'GB'</span><span>); </span></span></li></ol>
我们需要给谷歌电话库提供一个提示来说明这个号码是哪个国家的。你也可以改成你自己的国家。
我们把所有的这些打包在一个新的路由中:
<ol class="dp-c"><li class="alt"><span><span class="vars">$app</span><span>->post(</span><span class="string">'/identify-telephone-number'</span><span>, </span><span class="keyword">function</span><span>(Request </span><span class="vars">$request</span><span>) </span><span class="keyword">use</span><span> (</span><span class="vars">$app</span><span>) { </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// Grab the uploaded file</span><span> </span></span></li><li><span> <span class="vars">$file</span><span> = </span><span class="vars">$request</span><span>->files->get(</span><span class="string">'upload'</span><span>); </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Extract some information about the uploaded file</span><span> </span></span></li><li class="alt"><span> <span class="vars">$info</span><span> = </span><span class="keyword">new</span><span> SplFileInfo(</span><span class="vars">$file</span><span>->getClientOriginalName()); </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// Create a quasi-random filename</span><span> </span></span></li><li><span> <span class="vars">$filename</span><span> = sprintf(</span><span class="string">'%d.%s'</span><span>, time(), </span><span class="vars">$info</span><span>->getExtension()); </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Copy the file</span><span> </span></span></li><li class="alt"><span> <span class="vars">$file</span><span>->move(__DIR__.</span><span class="string">'/../uploads'</span><span>, </span><span class="vars">$filename</span><span>); </span></span></li><li><span> </span></li><li class="alt"><span> <span class="comment">// Instantiate the Tessearct library</span><span> </span></span></li><li><span> <span class="vars">$tesseract</span><span> = </span><span class="keyword">new</span><span> TesseractOCR(__DIR__ . </span><span class="string">'/../uploads/'</span><span> . </span><span class="vars">$filename</span><span>); </span></span></li><li class="alt"><span> </span></li><li><span> <span class="comment">// Perform OCR on the uploaded image</span><span> </span></span></li><li class="alt"><span> <span class="vars">$text</span><span> = </span><span class="vars">$tesseract</span><span>->recognize(); </span></span></li><li><span> </span></li><li class="alt"><span> <span class="vars">$number</span><span> = findPhoneNumber(</span><span class="vars">$text</span><span>, </span><span class="string">'GB'</span><span>); </span></span></li><li><span> </span></li><li class="alt"><span> <span class="keyword">return</span><span> </span><span class="vars">$app</span><span>->json( </span></span></li><li><span> [ </span></li><li class="alt"><span> <span class="string">'number'</span><span> => </span><span class="vars">$number</span><span>, </span></span></li><li><span> ] </span></li><li class="alt"><span> ); </span></li><li><span> </span></li><li class="alt"><span>}); </span></li></ol>
我们现在有简单的API的基础—-也就是JSON响应-—我们可以用来作为一个简单的移动应用的后端,这款应用可以用来从一幅图中添加联系人,打电话。
总结
OCR有许多应用——并且很容易整合进你的应用超过你的预期)。本文中,我们安装了开源OCR包;并使用一个包装器库,把它整合进一个非常简单的PHP应用。我们只是触及到了所有可能性的表面,希望这能给你一些想法,帮你想想怎么在你自己的应用中使用OCR。
译文链接:http://www.codeceo.com/article/php-ocr-tesseract-get-text.html
英文原文:OCR in PHP: Read Text from Images with Tesseract