The repository contains notebooks created for collecting and preprocessing the corpus of diary entries and for experiments on creating DL and ML models for predicting gender, age groups of authors and the time period of text creation. The work was carried out within the framework of a master's thesis on the topic "Automatic identification of sociolinguistic data based on the texts of diaries of the "Prozhito" project".
This study is dedicated to creation a sociolinguistic profile of text authors through the prediction of hidden demographic attributes, such as gender, age, and the time period of text creation, using machine and deep learning methods. The research material consists of diary entries from the "Prozhito" project – a digital archive of personal documents. The goal of the study is to select the most accurate algorithms for predicting gender, age groups, and the time of text creation based on the analysis of diary entries.
The study employed modeling methods for algorithm development, experiments to test model efficacy, comparisons of different approaches, statistical and sociolinguistic analysis, and the scientific description method. The research identified significant correlations between linguistic features and demographic attributes of the authors and demonstrated high model accuracy, especially using logistic regression and recurrent neural networks combined with a CNN1D architecture. The practical significance of the work lies in the development of models for predicting demographic attributes that can be applied in various fields, from sociology to marketing and forensic examinations. The results are important for programs aimed at preserving historical and cultural texts and contribute to a deeper understanding of linguistic variations and social differences.