Optimal Character Recognition technology for Urdu language holds great significance in preservation and promotion of history and culture of its speakers, just like any other language.
By Tamseel Ahmad
Urdu has been widely used in Indo-Pak subcontinent since 1854, and history and culture of this geographical region is deeply embedded in Urdu language. Urdu OCR is one of the latest language processing technologies which can be used for enhanced access, searchability, and distribution of Urdu literature as well as other writings which hold historical significance. OCR technique uses computer vision to recognize text among images and then converts into machine-readable Unicode format. Therefore, using this technology whole Urdu literature (in public-domain) can be converted into Unicode text and uploaded into an online database or open-library. This will make it possible for a researcher or any literature-geek to find source of a given sentence or passage.
Having read this, you must be wondering that if all this is possible then why hasn’t been this done before? So, here’s the catch: Urdu language is a very difficult language for character recognition in view of its adjoined letter system and cursive writing script known as Nastalique. These characteristics lower the accuracy of text recognition of Urdu content. Nonetheless, this should not disappoint you as there are still some Urdu OCR software that are capable of Urdu text recognition with good accuracy despite the complexities involved. Moreover, with the progress in the fields of Artificial Intelligence and Machine Learning, this accuracy is bound to improve.
Table of Content
- Google Cloud Vision
- Google Docs / Google Keep / Google Lens
- Akhar 2016
- CLE Nastalique OCR
- Abby FineReader
- Texify Bot
- Bonus: Nadeem InPage PDF Text Extractor
- Cloud Vision Tools
So, here, we review all 7 Urdu OCR software that are available in 2022, with the most accurate one at the top of the list.
Cloud Vision uncompetitively holds the top place in OCR technology for Urdu language thanks to its high accuracy. It is able to recognize and extract hand-written Urdu text to a good extent, and with text composed in Nastalique, its accuracy is impeccable. Cloud Vision is a part of Google’s cloud computing platform GCP, and its success lies in Google’s high-end AI models and access to a large dataset.
Cloud Vision is not available as downloadable software; instead, it can be accessed from web-based Google Cloud Console or can be called via API. Therefore, using it can be a little bit difficult for non-tech savvy users, and a bit tedious even for tech-savvy users. To solve this problem, some Pakistani developers have made front-end tools for Cloud Vision that can be easily integrated with your Google Account, and then you can handle the whole process from these front-end tools. A list of these tools is given at the end of this article.
Pricing: Cloud Vision offers free text recognition for 1000 images/pages every month. After the free tier, each unit of 1000 images/pages costs $1.50 only, which is pretty reasonable given its top place.
If your OCR needs are simple, and Cloud Vision seems overwhelming for you, Google has another solution for you. Even though they are not basically OCR software, three products from Google Suite can perform text recognition and can be used as Urdu OCR, namely: Google Docs, Google Keep, and Google Lens. Keep and Lens work with images only, while Docs can also recognize text from PDF files even though its accuracy is higher when images are used instead.
It is unknown which OCR engine powers these products, most probably it is Cloud Vision. However, the most astonishing fact is that Docs, Keep, and Lens will give slightly different results with the same Urdu image. In a small test, it was found that in terms of accuracy, Google Keep stands first, then Google Docs, then Google Lens. However, this may vary by the images used. Overall, results are best when an image is uploaded in PNG format. Here’s how you can use them to extract text from an image:
Google Keep: Upload an image to Google Keep or capture an image containing text with the Google Keep mobile app. A new note will be formed containing the image. Click the three vertical dots (⋮) at the bottom of the note to see its options, and click “Grab Image Text”. Text in Unicode format will be added to the note, which can be copied.
Google Docs: Upload the desired image to Google Drive. When uploaded, right-click the image and select Open with> Google Docs. Google Docs will open in a new tab and the text will be given automatically below the image.
Google Lens: In the Google app on your phone, click the small camera icon (📷) at the right end of the search bar. Take an image or select one from gallery. After Lens processes the image, select the “Text” option from the bottom layout. The text on the image will appear highlighted and it could be then selected and copied. On PC, you can upload the image to Google Photos, and there you will have an option to search the image with Google Lens.
Pricing: Google Docs, Google Keep, and Google Lens are available for free.
3. Akhar 2016
India has an Urdu-speaking population of more than 50 million, and that is why work on digitization of Urdu language is also underway in India. In 2016, a research center in Punjabi University, Patiala developed a word processor named “Akhar”, which supported Gurmukhi, Hindi, Shahmukhi (Urdu script), and English language. Notably, it also had an OCR feature for Gurmukhi, English, and Urdu. The OCR feature was later discontinued in Akhar 2021.
Akhar 2016 has very good accuracy with computer-composed text. But it is not very useful for hand-written text as its accuracy falls much low. Moreover, for Urdu OCR to work, the image should have a minimum 300 DPI resolution, and text should not be multi-columned.
Pricing: Akhar 2016 can be downloaded for free from its website.
The need for an Urdu OCR was recognized in Pakistan as early as 2006. Many computer scientists researched on it and many theses were written. Practical work started in 2012 when Centre for Language Engineering (CLE) located in KICS, University of Engineering and Technology, Lahore started a project for developing a proper Urdu OCR software. The project completed in 2014 with the creation of CLE Nastalique OCR.
CLE Nastalique OCR is not built from scratch instead it is developed by modification of an open-source OCR engine Tesseract. CLE OCR comes with many limitations, as only text composed in Noori Nastalique font can be recognized. Moreover, the image should be deskewed and have at least 300 DPI resolution. Accuracy is fairly good if images maintain these high standards. Apparently, there was no further improvement or update of this software and for this reason, this OCR is not as useful as it should be.
Pricing: CLE Nastalique OCR can be accessed via different methods. Using CLE NLP webservices, text can be recognized from individual images for free. For bulk images, one can use CLE Urdu OCR API which costs PKR 0.5 for image less than 2MB and PKR 0.25 per MB for an image greater than 2MB. Moreover, there is a packaged software which can be purchased from CLE website for 15,000 PKR or 250 USD.
Tesseract is an open-source OCR engine popular for its great accuracy. Tesseract was developed by HP in the 1980s and was made open-source in 2005. Since 2006, Google has been sponsoring its further development. Tesseract supports more than 100 languages including Urdu and has the ability to intelligently recognize text from complex layouts.
Tesseract does not have a workable recognition of Urdu text in its original form; however, its beauty lies in it being an open-source OCR engine, which means it can be manually trained or modified for improved text recognition. After proper training, Tesseract has been able to recognize Urdu text. For developers interested in Urdu OCR technology, it can provide a good base for development of an Urdu OCR, as CLE Nastalique OCR also uses Tesseract engine.
Tesseract’s source code is available on Github, and its package installer can be downloaded from here. Since it does not have a graphical user interface (GUI), using it can be technically difficult for people other than developers. VietOCR is a popular front-end for Tesseract which can make it easier to use.
Pricing: Since Tesseract is open-source, it is available for free.
Abby FineReader is a PDF editing software that also has an OCR feature. Currently, its OCR does not have support for Urdu language. However, some experts experimented on training this software for Urdu text recognition by selecting Arabic language. The process was a bit complicated, and the results achieved were not completely accurate, however, they were able to fairly recognize Urdu hand-written text using Abby FineReader. The most advantageous thing is that the OCR tool can be manually trained by experts to improve its accuracy.
The experiments were conducted on version 8 by Mr. Alvi Amjad (detailed here), and on version 12 by Mr. Zaheer Abbas (detailed here), while its latest version available is 15 which also supports Persian language. Therefore, it is hoped that Abby FineReader may be available as a solution for Urdu text recognition in the future.
Pricing: Abby FineReader v15 can be purchased for a considerable cost of 199€ (approx. 40,000 PKR).
7. Texify Bot/ Matnyaar
Texify Bot is a Telegram bot which extracts text from an image sent to it. The OCR feature supports four languages: Persian, English, Russian, and French. Owing to the similarity between Urdu and Persian, it also gives good results for an image containing Urdu text. A Persian OCR Matnyaar is the software working behind this telegram bot, which is also available as an Android app, and a web application.
Pricing: For a few images, Texify Bot can be used for free, but for a large number, Matnyaar package of varying prices should be bought.
Despite not being an Urdu OCR software, Naseem InPage PDF Text Extractor makes its place in this article because of its unique functionality. As the name suggests, it can extract Urdu text directly from a PDF, if the PDF was made using InPage. Nowadays, MS Word is mostly used for composing Urdu documents, for which this software won’t work. However, InPage had experienced much popularity in past and is still used for many purposes, especially for composing books. In a situation where you have a PDF exported from InPage, this software can extract all the text within minutes and can save you from a lot of headache.
Naseem PDF Text Extractor v1 had a serious limitation of not being able to convert multi-page PDF documents. This limitation was removed in its version 2 named as Nadeem PDF Texter v2 (seems a typo, but it is as it is).
Pricing: This software is available for free.
Cloud Vision Tools
Having read about all available options, you must be thinking about giving Cloud Vision a try because of its top ranking. As mentioned above, using Cloud Vision from GCP console can be complicated plus time-consuming. To ease this task, some front-end tools can be used. I have not personally tested all of them, but I mention their sources and download links here so that interested people can test them on their own.
- Image2Text by Jasim Muhammad
- Urdu Kaatib by Muhammad Umar
- Reekhta Downloader by Saroosh
- Reekhta Downloader by Falsafi
Currently, the options for Urdu OCR technology are limited and there is a dire need to expand range of such services as well as to improve the current available technologies. Public as well as private institutes working on promotion of Urdu language can play their part in this field which will prove fruitful for the generations to come.