← Back to VOLUME 3, ISSUE 7, JULY 2014
This work is licensed under a Creative Commons Attribution 4.0 International License.
Automatic Template Extraction from Heterogeneous Web Pages
KUNAL KUMAR KUNDAN, SONALI RANGDALE Department of Information Technology, Siddhant, College of Engineering, Pune, India Professor, Department of Information Technology, Siddhant College of Engineering, Pune, India
Downloads: Download PDF
π 42 viewsπ₯ 1 download
Abstract: In this paper, we will enlist the process of extracting template from heterogeneous Web Pages. Extracting structured information from semi-structured machine readable web pages automatically plays a major role these days, so some websites are using common templates with contents to populate the data for good productivity, Where WWW is the major resource for extracting the information. The problem here is for machines, the templates in the web pages are considered to be harmful since they degrade the performance of web applications due to irrelevant terms in the Template. As a result, the performance of the entire system degrades. Template Detection technique can be used to improve the performance of search engine as well as for classification of web documents. In this paper, we present algorithms to extract templates from a very large number of web pages that are getting generated from heterogeneous templates. Using the similarity of template structures in the document, we can cluster the web documents so that the template for each cluster will be extracted simultaneously.
Keywords: Template extraction, clustering, minimum description length principle.
Keywords: Template extraction, clustering, minimum description length principle.
How to Cite:
[1] KUNAL KUMAR KUNDAN, SONALI RANGDALE Department of Information Technology, Siddhant, College of Engineering, Pune, India Professor, Department of Information Technology, Siddhant College of Engineering, Pune, India, βAutomatic Template Extraction from Heterogeneous Web Pages,β International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE)
