Núcleo Interinstitucional de Lingüística Computacional
An Interinstitutional Center for Research and Development in Computational Linguistics

Ongoing projects

Text summarization

sucinto - summarization for clever information access - investigation and exploration of multi-document summarization strategies for providing a more feasible and intelligent access to on-line information from news agencies

TeMário 2006 - 150 news texts and the corresponding human summaries, which complement the original TeMário corpus, resulting in a corpus of 250 texts for summarization purposes

GEI - Ideal Extracts Generator for Brazilian Portuguese - given the source text and its corresponding manual (human) summary, GEI generates the ideal extract (which is the juxtaposition of sentences from the source text that best correlate with the sentences of the manual summary) using Salton's cosine measure

DMSumm - Discourse Modeling SUMMarizer

NeuralSumm - NEURAL network for SUMMarization (for scientific texts) - with tools for training the system with new data, if necessary

GistSumm - GIST SUMMarizer

Text and discourse analysis

CSTNews interface - access to 50 clusters of news texts and their multidocument summaries, with texts annotated according to the Cross-document Structure Theory

CSTTool - a semi-automatic edition tool for annotating texts according to the Cross-document Structure Theory

Newshead - an on-line tool for searching and clustering related news

DiZer - DIscourse analyZER for Brazilian Portuguese (mainly for Computer Science domain)

DiZer 2.0 - an on-line version of DiZer, which is easily adaptable and portable to different text types/genres and languages

RSTeval - tool for discourse parsing evaluation, following Marcu (2000) evaluation method - the tool is able to compare RST trees (automatically or manually produced), producing precision and recall numbers

Syntax-based text segmentation tool aiming at producing elementary discourse units for discourse parsing - it uses the parser PALAVRAS (Bick, 2000) for analyzing the input text and, then, applies syntactical segmentation rules

CorpusTCC - corpus of 100 Brazilian Portuguese scientific texts (from Computer Science domain - introduction sections of theses), marked by Marcu's RSTTool (using this relation set), used for developing DiZer

RST Toolkit - utility programs for processing RST files, offering several computational facilities for both computational and linguistic purposes

RhetDB - Rhetorical Database - an edition environment for handling the rhetorical analyses produced by Daniel Marcu's RSTTool; it offers several computational facilities for both computational and linguistic purposes
(this is an old version of the software; for better and more advanced features, use RST Toolkit above)

RHETALHO corpus annotated with Daniel Marcu's RSTTool, its annotation protocol and the relation set; this corpus consists of forty texts - 20 from Computer Science domain and 20 from the on-line newspaper Folha de São Paulo (7 from Cotidiano Section, 7 from Mundo Section and 6 from Science Section) annotated by 2 humans experts in RST

Text mining and information extraction

Tools and resources available at Sickle Cell Anemia Project webpage

 

Finished projects

Text simplification

Tools and resources available at PorSimples webpage

Machine translation

VisualTCA - an on-line tool for sentence alignment visualization

Trapezio - Translation Post-Editor

Neologism detection

Neologism detection tool - a tool for detecting possible neologisms in Portuguese

There is also an old version of the program: filtering program - looking for words in a text that are not contained in dictionaries. Some pre-processed dictionaries you can try - dictionary for Brazilian Portuguese, REPENTINO and Unitex-PB.

Grammatical formalisms

Redutor - software tool for reduction between DCG and LFG

Redutor 2 - software tool for reduction between DCG, LFG and GPSG

Other useful softwares

Naive-Bayes classifier for Windows (Delphi source code included)

Pre-processing program: substituting numbers, sites and e-mails by generic concepts in texts

TeP 2.0 - on-line version of a thesaurus por Brazilian Portuguese

SENTER for Portuguese and for English

  • how to use it:

In command line, execute the following: senter.exe myfile.txt
The segmented text will be stored in a file with the same name + ".seg" (for instance, myfile.txt.seg) with one sentence per line. The input file must be a plain text file.