Abstract: Statistical modelling has been fundamental to Natural Language Processing (NLP), providing scalable, data-driven solutions beyond traditional rule-based methods. This paper surveys key statistical models, including n-gram models, Hidden Markov Models, and Conditional Random Fields, as well as advanced methods like Bayesian models and Latent Dirichlet Allocation, which reveal hidden structures in text. We explore their applications across tasks such as part-of-speech tagging, named entity recognition, machine translation, and text classification. The paper also reviews evaluation metrics like perplexity, BLEU, and F1-score, and discusses challenges such as data sparsity and limitations in capturing long-range dependencies. A comparison with neural-based approaches highlights scenarios where statistical models remain preferable, particularly for interpretability and low-resource settings. We conclude by recommending hybrid statistical-neural models to achieve effective, interpretable, and efficient NLP solutions.

Keywords: Natural Language Processing, Statistical Modelling, N-gram, Hidden Markov Model (HMM), Conditional Random Field (CRF), Latent Dirichlet Allocation (LDA), Sequence Labelling, Machine Translation, Language Modelling, Probabilistic Methods.


PDF | DOI: 10.17148/IJARCCE.2025.14433

Open chat
Chat with IJARCCE