Núcleo Interinstitucional de Lingüística Computacional
An Interinstitutional Center for Research and Development in Computational Linguistics

Ongoing projects

Text summarization

sucinto - summarization for clever information access - investigation and exploration of multi-document summarization strategies for providing a more feasible and intelligent access to on-line information from news agencies

NILC-WISE - Web Interface for Summary Evaluation - an online and easy to use interface for running ROUGE (Lin, 2004) for evaluating summaries

Models for summary coherence evaluation - a set of implemented models for summary coherence evaluation, following several approaches, from traditional entity grids to discourse grids. See the PhD thesis of Marcio de Souza Dias for more information.

Summarization extension to Google Chrome - extension for on-line news summarization, based on RSumm system

RC-4 multi-document summarizer - based on the best RST & CST-based summarization strategy proposed by Cardoso (2014)

RCT-4 multi-document summarizer - based on the best RST & CST & subtopics-based summarization strategy proposed by Cardoso (2014). Notice that the difference of this summarization method in relation to the above one is the inclusion of subtopic segmentation and treatment.

Text-summary alignment - tool that includes a set of methods for aligning texts and their multi-document summaries, as developed by Agostini et al. (2014)

TextTiling for Portuguese - topical segmentation tool adapted to news texts in Brazilian Portuguese, based on the work of Hearst (1997)

ViSum - a visualization system for multi-document summarization (described by Lima, 2013)

CSTSumm - a multi-document summarizer based on CST information (see README.txt in the rar file)

RSumm - a multi-document summarizer based on the relationship maps proposed by Salton et al. (1997)

Sentence ordering program - program for ordering sentences in a multi-document summary (given the source-texts)

Corpus of automatic multi-document summaries with linguistic errors - a corpus of automatic multi-document summaries (for the texts of CSTNews corpus) produced by 4 different summarizes with varied performances, manually annotated with linguistic errors. See the readme file for more details.

OpiSums-PT - a corpus of (extractive and abstractive) opinion summaries (170, in total) for reviews of books (13 reviews) and electronic products (4 reviews), written in Brazilian Portuguese

CSTNews - a corpus with 50 clusters of news texts - in Portuguese - with their multi-document summaries, as well as several discourse and semantic annotations

TeMário 2006 - 150 news texts and the corresponding human summaries, which complement the original TeMário corpus, resulting in a corpus of 250 texts for summarization purposes

GEI - Ideal Extracts Generator for Brazilian Portuguese - given the source text and its corresponding manual (human) summary, GEI generates the ideal extract (which is the juxtaposition of sentences from the source text that best correlate with the sentences of the manual summary) using Salton's cosine measure

DMSumm - Discourse Modeling SUMMarizer

NeuralSumm - NEURAL network for SUMMarization (for scientific texts) - with tools for training the system with new data, if necessary

GistSumm - GIST SUMMarizer

Text and discourse analysis

CSTNews interface - access to 50 clusters of news texts and their multidocument summaries, with texts annotated according to the Cross-document Structure Theory

CSTTool - a semi-automatic edition tool for annotating texts according to the Cross-document Structure Theory

CSTParser - a state-of-the-art CST discourse parser for Portuguese, using both symbolic and machine learning techniques (see Maziero, 2012)
--> Its stand-alone (offline) version (with some adaptations in relation to the online version) is also freely available for use

LIWC - Linguistic Inquiry and Word Count is a text analysis software program that calculates the degree to which people use different categories of words across a wide array of texts. The available resource is a version of its dictionary for Brazilian Portuguese language. See the original project here and the Brazilian version here. The corresponding publication for Portuguese may be found here.

Newshead - an on-line tool for searching and clustering related news

DiZer - DIscourse analyZER for Brazilian Portuguese (mainly for Computer Science domain)

DiZer 2.0 - an on-line version of DiZer, which is easily adaptable and portable to different text types/genres and languages

RSTeval - tool for discourse parsing evaluation, following Marcu (2000) evaluation method - the tool is able to compare RST trees (automatically or manually produced), producing precision and recall numbers

Syntax-based text segmentation tool aiming at producing elementary discourse units for discourse parsing - it uses the parser PALAVRAS (Bick, 2000) for analyzing the input text and, then, applies syntactical segmentation rules

CorpusTCC - corpus of 100 Brazilian Portuguese scientific texts (from Computer Science domain - introduction sections of theses), marked by Marcu's RSTTool (using this relation set), used for developing DiZer

RST Toolkit - utility programs for processing RST files, offering several computational facilities for both computational and linguistic purposes

RhetDB - Rhetorical Database - an edition environment for handling the rhetorical analyses produced by Daniel Marcu's RSTTool; it offers several computational facilities for both computational and linguistic purposes
(this is an old version of the software; for better and more advanced features, use RST Toolkit above)

RHETALHO corpus annotated with Daniel Marcu's RSTTool, its annotation protocol and the relation set; this corpus consists of forty texts - 20 from Computer Science domain and 20 from the on-line newspaper Folha de São Paulo (7 from Cotidiano Section, 7 from Mundo Section and 6 from Science Section) annotated by 2 humans experts in RST


Finished projects

Text simplification

Tools and resources available at PorSimples webpage

Text mining and information extraction

Tools and resources available at Sickle Cell Anemia Project webpage

Machine translation

VisualTCA - an on-line tool for sentence alignment visualization

Trapezio - Translation Post-Editor

Neologism detection

Neologism detection tool - a tool for detecting possible neologisms in Portuguese

There is also an old version of the program: filtering program - looking for words in a text that are not contained in dictionaries. Some pre-processed dictionaries you can try - dictionary for Brazilian Portuguese, REPENTINO and Unitex-PB.

Grammatical formalisms

Redutor - software tool for reduction between DCG and LFG

Redutor 2 - software tool for reduction between DCG, LFG and GPSG


Useful softwares

Lemmatizer for Portuguese - based on the MXPOST part of speech tagger and UNITEX dictionaries for Portuguese, this tool produces the lemmas of the words of a text stored in a plain text file. The source code is also provided. For more details, see the readme.pdf file or contact Erick G. Maziero (the developer of the system).

TeP 2.0 - on-line version of a thesaurus por Brazilian Portuguese

NCLEANER trained model for Portuguese - a trained model to be used with NCleaner (Evert, 2008) for cleaning web pages in Portuguese. The model was trained with 184 texts from several online sources, as Terra, UOL, BBC, Exame, Estadão, IG, R7, Zero Hora, G1, JB Online, and O Globo, among others.

SENTER for Portuguese and for English

In command line, execute the following: senter.exe myfile.txt
The segmented text will be stored in a file with the same name + ".seg" (for instance, myfile.txt.seg) with one sentence per line. The input file must be a plain text file.

Naive-Bayes classifier for Windows (Delphi source code included)

Pre-processing program: substituting numbers, sites and e-mails by generic concepts in texts

NASP (see NASP++ below) - a tool for aiding in word sense annotation of nouns in Portuguese, using Princeton Wordnet as sense repository

NASP++ - an improved version of NASP (see above), with more facilities (e.g., the underlying generation of ontologies for the annotated words) and adapted to other part of speech tags

MulSEN - a multilingual version of NASP (see above)