Desenvolvimento de técnicas baseadas em redes complexas para sumarização extrativa de textos

Antiqueira, Lucas

doi:10.11606/D.55.2007.tde-26042007-145428

Home

Facilities

Master's Dissertation

DOI

https://doi.org/10.11606/D.55.2007.tde-26042007-145428

Document

Master's Dissertation

Author

Antiqueira, Lucas (Catálogo USP)

Full name

Lucas Antiqueira

Institute/School/College

Instituto de Ciências Matemáticas e de Computação

Knowledge Area

Computer Science and Computational Mathematics

Date of Defense

2007-02-27

Published

São Carlos, 2007

Supervisor

Nunes, Maria das Graças Volpe (Catálogo USP)

Committee

Nunes, Maria das Graças Volpe (President)
Oliveira Junior, Osvaldo Novais de
Rino, Lucia Helena Machado

Title in Portuguese

Desenvolvimento de técnicas baseadas em redes complexas para sumarização extrativa de textos

Keywords in Portuguese

Inteligência Artificial
Processamento de Línguas Naturais
Redes Complexas
Sumarização Automática

Abstract in Portuguese

A Sumarização Automática de Textos tem considerável importância nas tarefas de localização e utilização de conteúdo relevante em meio à quantidade enorme de informação disponível atualmente em meio digital. Nessa área, procura-se desenvolver técnicas que possibilitem obter o conteúdo mais relevante de documentos, de maneira condensada, sem alterar seu significado original, e com mínima intervenção humana. O objetivo deste trabalho de mestrado foi investigar de que maneira conceitos desenvolvidos na área de Redes Complexas podem ser aplicados à Sumarização Automática de Textos, mais especificamente à sumarização extrativa. Embora grande parte das pesquisas em sumarização tenha se voltado para a utilização de técnicas extrativas, ainda é possível melhorar o nível de informatividade dos extratos gerados automaticamente. Neste trabalho, textos foram representados como redes, das quais foram extraídas medidas tradicionalmente utilizadas na caracterização de redes complexas (por exemplo, coeficiente de aglomeração, grau hierárquico e índice de localidade), com o intuito de fornecer subsídios à seleção das sentenças mais significativas de um texto. Essas redes são formadas pelas sentenças (representadas pelos vértices) de um determinado texto, juntamente com as repetições (representadas pelas arestas) de substantivos entre sentenças após lematização. Cada método de sumarização proposto foi aplicado no córpus TeMário, de textos jornalísticos em português, e em córpus das conferências DUC, de textos jornalísticos em inglês. A avaliação desse estudo foi feita por meio da realização de quatro experimentos, fazendo-se uso de métodos de avaliação automática (Rouge-1 e Precisão/Cobertura de sentenças) e comparando-se os resultados com os de outros sistemas de sumarização extrativa. Os melhores sumarizadores propostos referem-se aos seguintes conceitos: d-anel, grau, k-núcleo e caminho mínimo. Foram obtidos resultados comparáveis aos dos melhores métodos de sumarização já propostos para o português, enquanto que, para o inglês, os resultados são menos expressivos.

Title in English

Development of techniques based on complex networks for extractive text summarization

Keywords in English

Artificial Intelligence
Automatic Summarization
Complex Networks
Natural Language Processing

Abstract in English

Automatic Text Summarization has considerably importance in tasks such as finding and using relevant content in the enormous amount of information available nowadays in digital media. The focus in this field is on the development of techniques that allow someone to obtain the most relevant content of documents, in a condensed way, preserving the original meaning and with little (or even none) human help. The purpose of this MSc project was to investigate a way of applying concepts borrowed from the studies of Complex Networks to the Automatic Text Summarization field, specifically to the task of extractive summarization. Although the majority of works in summarization have focused on extractive techniques, it is still possible to obtain better levels of informativity in extracts automatically generated. In this work, texts were represented as networks, from which the most significant sentences were selected through the use of ranking algorithms. Such networks are obtained from a text in the following manner: the sentences are represented as nodes, and an edge between two nodes is created if there is at least one repetition of a noun in both sentences, after the lemmatization step. Measurements typically employed in the characterization of complex networks, such as clustering coefficient, hierarchical degree and locality index, were used on the basis of the process of node (sentence) selection in order to build an extract. Each summarization technique proposed was applied to the TeMário corpus, which comprises newspaper articles in Portuguese, and to the DUC corpora, which comprises newspaper articles in English. Four evaluation experiments were carried out, by means of automatic evaluation measurements (Rouge-1 and sentence Precision/Recall) and comparison with the results obtained by other extractive summarization systems. The best summarizers are the ones based on the following concepts: d-ring, degree, k-core and shortest path. Performances comparable to the best summarization systems for Portuguese were achieved, whilst the results are less significant for English.

WARNING - Viewing this document is conditioned on your acceptance of the following terms of use:
This document is only for private use for research and teaching activities. Reproduction for commercial use is forbidden. This rights cover the whole data about this document as well as its contents. Any uses or copies of this document in whole or in part must include the author's name.

AntiqueiraDissertacaoRevisada.pdf (2.82 Mbytes)

Publishing Date

2007-05-08

Derived works

WARNING: The material described below relates to works resulting from this thesis or dissertation. The contents of these works are the author's responsibility.

ANTIQUEIRA, L, et al. A complex network approach to text summarization [doi:10.1016/j.ins.2008.10.032]. Information Sciences [online], 2009, vol. 179, n. 5, p. 584-599.
ANTIQUEIRA, L, e NUNES, M G V. Complex Networks and Extractive Summarization. In MSc Contest - 9th International Conference on Computational Processing of the Portuguese Language (PROPOR) [CD-ROM], 9, Porto Alegre, RS, 2010. Best MSc Dissertation Award.