← Back to VOLUME 15, ISSUE 5, MAY 2026
This work is licensed under a Creative Commons Attribution 4.0 International License.
Retrieval-Augmented Generation for Smarter Web Scraping and Synthesis: A Cache-Aware Multimodal Framework for Web Intelligence
Rishith Poojary, Aryan Rana, Aditya Magar, Dev Raval, Pravin Shinde
π 22 viewsπ₯ 6 downloads
Abstract: Most web pages today do not deliver data in straightforward ways. Content often appears only after scripts run, making early snapshots incomplete. Each visit might show a slightly altered layout. Identical details - like pricing or bylines - sit inside unpredictable tag arrangements depending on the site. Tools relying on fixed rules, such as CSS paths or XPath, function reliably until design changes occur. A minor update may disrupt what once worked without warning. Our system, WebRAG, takes another path altogether. Viewed as a form of information gathering, web scraping here builds on three distinct aspects per webpage: raw textual elements, structural markup shaped by creators, alongside how content visually appears during browsing - these together anchor a generation process firmly within up-to-date materials instead of outdated datasets. When assessing prior visits, a caching mechanism checks if underlying structures have shifted, skipping repeated processing where little or nothing differs from earlier versions. Each result includes documented origins detailing location online, specific document fragments involved, along with reliability estimates tied to recovery quality. Testing occurred using WebRAGBench - a collection exceeding five thousand hand-labeled examples drawn from news outlets, shopping platforms, and knowledge-focused websites. Performance surpassed rule- driven methods plus standard text-based pipelines when measuring correctness in fetching data, fidelity in pulling out answers, and overall response speed. A single 2.8Γ boost in throughput came just from caching. Where performance still lags becomes clear when examining system limits - this points toward promising paths forward.
Keywords: Retrieval-Augmented Generation, Web Scraping, Large Language Models, Multimodal Embeddings, Information Extraction
Keywords: Retrieval-Augmented Generation, Web Scraping, Large Language Models, Multimodal Embeddings, Information Extraction
How to Cite:
[1] Rishith Poojary, Aryan Rana, Aditya Magar, Dev Raval, Pravin Shinde, βRetrieval-Augmented Generation for Smarter Web Scraping and Synthesis: A Cache-Aware Multimodal Framework for Web Intelligence,β International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2026.155120
