Collection of Brazilian Blogspot Posts
Author: Henrique D. P. dos Santos, Vinicius Woloszyn, and Renata Vieira
Abstract: Diary-like content expressing authors personal experiences and sentiments over a variety of topics is generated every day and made available on the Internet. This rich content can be used for psychological analysis and knowledge discovery regarding human related issues in several ways. This paper presents the creation of a Brazilian Portuguese corpus, using blog posts, for personal stories analyses and detection. We present an analysis of psycholinguistic categories across personal story and non-story posts, discussing their similarities and differences. We also study the use of these psycholinguistic categories as classifying features. Then we describe the evaluation of several machine learning approaches and the process of applying them to identify personal stories on the basis of our dataset. Finally, we investigate the main topic-related polarity of personal narratives posts.
Keywords: Corpus, Natural Language Processing, Personal Story, Psycholinguistic, Social Media.
Complete Reference: Henrique D. P. dos Santos, Vinicius Woloszyn, and Renata Vieira. 2017. Portuguese Personal Story Analysis and Detection in Blogs. In Proceedings of WI ’17, Leipzig, Germany, August 23-26, 2017, 7 pages. DOI: 10.1145/3106426.3106517
https://github.com/heukirne/brazilian-blog-dataset/blob/master/blogs_stats.ipynb
https://github.com/heukirne/brazilian-blog-dataset/blob/master/countries.json
http://www.inf.pucrs.br/linatural/blogset-br (4.7 GB, 7.4M posts)
https://github.com/heukirne/brazilian-blog-dataset/raw/master/corpus.csv.gz (1K Posts)
This project belongs to NLP Group at PUCRS, Brazil