πŸ“ž +91-7667918914 | βœ‰οΈ ijarcce@gmail.com
International Journal of Advanced Research in Computer and Communication Engineering
International Journal of Advanced Research in Computer and Communication Engineering A monthly Peer-reviewed & Refereed journal
ISSN Online 2278-1021ISSN Print 2319-5940Since 2012
IJARCCE adheres to the suggestive parameters outlined by the University Grants Commission (UGC) for peer-reviewed journals, upholding high standards of research quality, ethical publishing, and academic excellence.
← Back to VOLUME 3, ISSUE 7, JULY 2014

Automatic Template Extraction from Heterogeneous Web Pages

KUNAL KUMAR KUNDAN, SONALI RANGDALE Department of Information Technology, Siddhant, College of Engineering, Pune, India Professor, Department of Information Technology, Siddhant College of Engineering, Pune, India

πŸ‘ 42 viewsπŸ“₯ 1 download
Share: 𝕏 f in ✈ βœ‰
Abstract: In this paper, we will enlist the process of extracting template from heterogeneous Web Pages. Extracting structured information from semi-structured machine readable web pages automatically plays a major role these days, so some websites are using common templates with contents to populate the data for good productivity, Where WWW is the major resource for extracting the information. The problem here is for machines, the templates in the web pages are considered to be harmful since they degrade the performance of web applications due to irrelevant terms in the Template. As a result, the performance of the entire system degrades. Template Detection technique can be used to improve the performance of search engine as well as for classification of web documents. In this paper, we present algorithms to extract templates from a very large number of web pages that are getting generated from heterogeneous templates. Using the similarity of template structures in the document, we can cluster the web documents so that the template for each cluster will be extracted simultaneously.

Keywords: Template extraction, clustering, minimum description length principle.

How to Cite:

[1] KUNAL KUMAR KUNDAN, SONALI RANGDALE Department of Information Technology, Siddhant, College of Engineering, Pune, India Professor, Department of Information Technology, Siddhant College of Engineering, Pune, India, β€œAutomatic Template Extraction from Heterogeneous Web Pages,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE)

Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 International License.