diff --git a/paper.txt b/paper.txt new file mode 100644 index 0000000..8abddc7 --- /dev/null +++ b/paper.txt @@ -0,0 +1,38 @@ +Title: Unsupervised Text Segmentation via Deep Sentence Encoders: a first step towards a common framework for text-based segmentation, summarization and indexing of media content. + +Abstract: +In this paper we present a new algorithm for text segmentation based on deep sentence encoders and the TextTiling algorithm. We will describe how text segmentation is an essential first step in the re-purposing of media content like TV newscasts and how the proposed methodology can add value to other subsequent tasks involving such media products thanks to the features extracted for segmentation. We present experiments on Wikipedia and transcripts from CNN 10 news show and the results of the proposed algorithm will be compared to other approaches. Our method shows improvement over other unsupervised methods and it gives results that are competitive with supervised approaches without the need for any training data. Finally, we will give examples of how to re-purpose the encoded sentences, so to highlight the re-usability of the extracted sentence embeddings for tasks like automatic summarization, while showing how these tasks depend on the segmentation process. + +Introduction: The possibility of re-using media products from different sources such as television, radio, etc. is very important for modern broadcasters, as users move more and more towards Internet-based, interactive platforms [1, 2]. These platforms facilitate the consumption of media contents in forms that are different from the original product: a portion of a news broadcast corresponding to a single story, for example, could be returned to a user in response to the users’ query or detected interests. To do so, the programme would need to be divided into smaller units based on the topical content of such units [3]. Given a single, long text document like the transcript of a news broadcast, linear text segmentation, also referred to as topic segmentation, refers to the task of dividing it into smaller, topically coherent segments [4]. As mentioned, the task is a first and essential step for the retrieval of relevant information such as a single news story inside a newscast [5]. Similarly, the individuation of these segments is crucial for other applications like automatic summarization and discourse analysis [6]. +Various techniques have been proposed during the years, both with the purpose of segmenting multimedia contents like news broadcasts1 [7, 8] or other contents such as business meetings [9] and newspaper articles [10]. +Popular approaches include the use of lexical similarity [11], Hidden Markov Models [12, 13], Latent Dirichlet Allocation based topic models [14, 15] and Latent Semantic Analysis [16, 17]. More recent works have focused on supervised approaches with discriminative models like Support Vector Machines [18], Neural Networks [19, 20], conditional random fields [21] or some combination thereof [22]. +While leading to better results, supervised approaches have the problem that they depend on the training data supplied and this often leads to problems of transferability of knowledge for the segmentation task, whereas supervised models might severely underperform in the case in which training data is not available for a specific domain [23]. +In addition, the segmentation step is just the first of a larger pipeline that might include summarization, semantic search or segment labelling. Solutions based on topic modelling, for example, have the advantage over other task-specific approaches of providing additional, useful information at no additional cost for related, subsequent tasks like story units’ tagging [14]. +Given these considerations, this work proposes a simple, unsupervised approach that takes advantage of recent developments in transfer learning for NLP to obtain features for segmentation that can easily be readapted for later uses. The next section introduces some relevant works in topic segmentation from which this research originated. We then present experimental results on two different datasets and, finally, we give an example of using our framework for segmentation and extractive summarization. + +Algorithm: The proposed methodology closely follows the original TextTiling algorithm by [6]. Given the flexible nature of what can be included in the blocks to be compared inside the algorithm, various alternatives have in fact been proposed starting from this very same algorithm but using tf-idf weighting [24] or LDA [15]. This last approach demonstrated how the use of information more directly related to topic could dramatically improve the TextTiling approach. Recent work on neural-based sentence embeddings has shown how pre-training on multiple tasks deep neural networks can generate embeddings capturing lexical, discourse and topical structure [46]. This gives us a valid reason to experiment with some popular neural sentence encoders to obtain sentence representations to be compared in the TextTiling algorithm. +The general form of the algorithm is the same as TextTiling and its variants, but for the extraction of sentence +embeddings and their use in computing the similarity scores between adjacent blocks. It consists in the following +steps: +1. Extract sentences via a sentence tokenizer. In our case, we used the widely used and publicly +available PUNKT tokenizer from NLTK python library [47]. +2. For each sentence 𝑠τ°€, extract the relative embedding 𝑒τ°€ ∈ Rτ° via the chosen sentence encoder, +where n is the dimensionality of the numeric vector representing the sentence (i.e. the sentence +embedding). According to the chosen window parameters w compute π‘ π‘π‘œπ‘Ÿπ‘’(𝑖) relative to sentence 𝑠τ°€ as the cosine +similarity between the average3 of the embeddings in the two adjacent blocks of sentences having 𝑠τ°€ as the rightmost sentence of the left block. Formally, for each position i we compute π‘ π‘π‘œπ‘Ÿπ‘’(𝑖) = +𝒃𝒍(τ°€)βˆ™π’ƒπ’“(τ°€) , where 𝒃𝒍(𝑖) = βˆ‘τ°ƒ 𝒆 τ°‚τ°„τ°ƒτ°…τ°†τ°‡τ°ˆ τ°‚ +τ°‚τ°„τ°ƒτ°‡τ°ˆ τ°‚ and 𝒃𝒓(i) = βˆ‘τ°ƒτ°‡τ°† 𝒆 . +‖𝒃𝒍(τ°€)‖‖𝒃𝒓(τ°€)β€– +4. For each position i compute a depth score, as follow 𝑑𝑠(𝑖) = τ°Šτ°‹ (π‘ π‘π‘œπ‘Ÿπ‘’(𝑙) + π‘ π‘π‘œπ‘Ÿπ‘’(π‘Ÿ) βˆ’ 2π‘ π‘π‘œπ‘Ÿπ‘’(𝑖)). In +this context, π‘ π‘π‘œπ‘Ÿπ‘’(𝑙) is found iteratively by comparing the scores on the left of π‘ π‘π‘œπ‘Ÿπ‘’(𝑖) until a score at index l is found such that π‘ π‘π‘œπ‘Ÿπ‘’(𝑙 βˆ’ 1) < π‘ π‘π‘œπ‘Ÿπ‘’(𝑙) > π‘ π‘π‘œπ‘Ÿπ‘’(𝑙 + 1). The same is done for finding π‘ π‘π‘œπ‘Ÿπ‘’(π‘Ÿ), whereas this time the peak is found on the right of π‘ π‘π‘œπ‘Ÿπ‘’(𝑖). +5. If the number k of required segments is known, return the k boundaries having highest depth score. Else, return all boundaries that fall above a pre-defined threshold p. The algorithm per se is agnostic of what sentence encoder is used and, apart from the sentence encoder, it relies on just two hyperparameters, namely the window value w and (just if the number of segments is unknown) the threshold parameter p. Here, these two parameters and which sentence encoder works the best are found by optimising an objective metric on some held out data, but these parameters can also be pre-set on the basis of alternative considerations (e.g. runtime of the algorithm). + +3.2 Sentence Encoders +The choice of the sentence encoder to be used is likely to have a strong effect on the performance of the +proposed system. For this reason, we experimented with three different popular encoders all of which have their +reported strength and weaknesses. Such encoders are: +ο‚· Universal Sentence Encoder (USE): in 2018, Google Research released two task-agnostic +4 +sentence encoders under the name of universal sentence encoder [40]. Specifically, here we use just one of the two encoders that were released, namely the deep averaging network (DAN), further described in [41]. Despite the simplicity of this method, this encoder has proved to be quite effective, while not relying on the transformer architecture4. +ο‚· STSb-BERT base (SBERT): This sentence encoder is based on the base version of BERT [39] and improves over it by using additional training strategies so that the sentences that are supposed to be semantically similar have vectors closer to each other [43]. The resulting sentence embeddings outperformed previous sentence encoders (including universal sentence encoders) on the standard SentEval framework [48]. STSb-BERT is publicly available via the sentence_transformers python library released by UKP lab5. The same library has been used also for the third sentence encoder described below. +ο‚· Paraphrase-xlm-r-multilingual-v1 (Para-xlm): This model derives from RoBERTa [49], a version of BERT having a different pre-training strategy that has been shown to make the model more robust and better than simple BERT in many tasks. The RoBERTa model is pre-trained on a dataset of paraphrases, then the number of its parameters are reduced by using knowledge distillation. This encoder has also the advantage of being able to produce embeddings for more than 50 languages thank to the additional knowledge distillation process applied to it and described in [44].