Abstract: Due to the recent rapid expansion in the number of digitised historical files, this is vital. It provides efficient methods for information extraction and statistics retrieval to allow access to data. It makes use of optical character recognition to convert document images into textual representations (OCR). OCR techniques today frequently do not belong in the historical domain. Additionally, they typically need a substantial volume of annotated documents. This paper will therefore show you a few ways to allow OCR on past data.
Authentic, hand-labeled coaching information should be added to the image. OCR with all features OCR and page structure analysis, which comprises text blocking and line segmentation, are the device's two primary functions. While the OCR approach is based on a convolution neural network, our delineation method uses on recurrent neural network. Both approaches are state-of-the-art in the concerned field. developed a novel authentic dataset for the Protonium Portal for OCR.
This information, which is openly available on this corpus, will be used to evaluate all suggested strategies. We illustrate it using some actual examples of annotated data so that both categorization and OCR tasks may be carried out. The experiment seeks to achieve this. If your information is limited, decide how to accomplish it properly in a satisfactory manner. We also demonstrate that the rating we conducted is on par with or superior to the results of certain modern systems. The study's findings demonstrate how to create a successful OCR engine for historical documents even in the lack of substantial training data.

PDF | DOI: 10.17148/IJARCCE.2022.11709

Open chat
Chat with IJARCCE