I am extremely happy to be volunteering my time for a noble cause with @RepublishB. They are republishing old scriptures which would be lost in time otherwise. The idea is to crowd-source many many old scriptures lying around in people's home and republish them. ++
I am in a team of great engineers that are working to automate the entire digitisation process from parsing image files to getting unicode representation of the text in those images. We will be using a combination of computer vision and machine learning to automate this.
Here is an example - This image is from Valmiki Ramayana (VR). The text corpus on the top (center) are the actual verses from VR. The text in the columns are टीकाs (commentary) by a scholar. The goal is to identify these two things distinctly before OCRing them separately.
This is end result. Now that we've identified these sections distinctly, we will proceed to OCRing them by cropping out the regions in the boxes.

The algorithm is developing and should scale for pages with other arrangement of text.

Thank you @shri_v sir for thinking of me.
You can follow @Paimaamu.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.