Abstract: Pattern identification in texts refers to the identification of repeating texts from set of sentences. Patterns are automatic discovery of regularities present in data through the use of computer algorithms. There is limited research carried out for such identification of patterns. The input to the system is first gathered and is then cleaned to remove the noisy elements present in the data. After cleaning the data, the similarity of the elements present in data is identified. The similar elements are grouped into segments and these segments are then analyzed to check whether repeating elements are present in the data. From this data, the necessary repeating insights are extracted which are the resulting pattern. The detection of patterns of any real world entity or substances of text or any other source is a difficult task for humans as well as for machines. It may be a time-consuming task if the detection of such patterns are done by the human. Also, human supervision is unable to deal with large quantities of data as there will be 'n' number of patterns. Therefore, automatic identification of such repeating texts has become an urgent need. For identifying patterns, context of text accompanying repeating sentences is very useful. In this work, pattern identification of text in semantics level is addressed by using ontology. After identifying similar sentences, the Sequence-to-sequence model is developed to identify patterns present from set of sentences given as input to the system.
Keywords: Pattern, Seq2seq, Ontology, Domain-Specific Words.
| DOI: 10.17148/IJARCCE.2020.9642