|
Ongoing projects
Text summarization
sucinto - summarization for clever information access - investigation and exploration of multi-document summarization strategies for providing a more feasible and intelligent access to on-line information from news agencies
TeMário 2006 - 150 news texts and the corresponding human summaries, which complement the original TeMário corpus, resulting in a corpus of 250 texts for summarization purposes
GEI - Ideal Extracts Generator for Brazilian Portuguese - given the source text and its corresponding manual (human) summary, GEI generates the ideal extract (which is the juxtaposition of sentences from the source text that best correlate with the sentences of the manual summary) using Salton's cosine measure
DMSumm - Discourse Modeling SUMMarizer
NeuralSumm - NEURAL network for SUMMarization (for scientific texts) - with tools for training the system with new data, if necessary
GistSumm - GIST SUMMarizer
graphical interface, for Portuguese and English
Text and discourse analysis
CSTNews interface - access to 50 clusters of news texts and their multidocument summaries, with texts annotated according to the Cross-document Structure Theory
CSTTool - a semi-automatic edition tool for annotating texts according to the Cross-document Structure Theory
Newshead - an on-line tool for searching and clustering related news
DiZer - DIscourse analyZER for Brazilian Portuguese (mainly for Computer Science domain)
DiZer 2.0 - an on-line version of DiZer, which is easily adaptable and portable to different text types/genres and languages
RSTeval - tool for discourse parsing evaluation, following Marcu (2000) evaluation method - the tool is able to compare RST trees (automatically or manually produced), producing precision and recall numbers
Syntax-based text segmentation tool aiming at producing elementary discourse units for discourse parsing - it uses the parser PALAVRAS (Bick, 2000) for analyzing the input text and, then, applies syntactical segmentation rules
CorpusTCC - corpus of 100 Brazilian Portuguese scientific texts (from Computer Science domain - introduction sections of theses), marked by Marcu's RSTTool (using this relation set), used for developing DiZer
RST Toolkit - utility programs for processing RST files, offering several computational facilities for both computational and linguistic purposes
RhetDB - Rhetorical Database - an edition environment for handling the rhetorical analyses produced by Daniel Marcu's RSTTool; it offers several computational facilities for both computational and linguistic purposes
(this is an old version of the software; for better and more advanced features, use RST Toolkit above)RHETALHO corpus annotated with Daniel Marcu's RSTTool, its annotation protocol and the relation set; this corpus consists of forty texts - 20 from Computer Science domain and 20 from the on-line newspaper Folha de São Paulo (7 from Cotidiano Section, 7 from Mundo Section and 6 from Science Section) annotated by 2 humans experts in RST
Text mining and information extraction
Tools and resources available at Sickle Cell Anemia Project webpage
Finished projects
Text simplification
Tools and resources available at PorSimples webpage
Machine translation
VisualTCA - an on-line tool for sentence alignment visualization
Trapezio - Translation Post-Editor
Neologism detection
Neologism detection tool - a tool for detecting possible neologisms in Portuguese
There is also an old version of the program: filtering program - looking for words in a text that are not contained in dictionaries. Some pre-processed dictionaries you can try - dictionary for Brazilian Portuguese, REPENTINO and Unitex-PB.
Grammatical formalisms
Redutor - software tool for reduction between DCG and LFG
Redutor 2 - software tool for reduction between DCG, LFG and GPSG
Other useful softwares
Naive-Bayes classifier for Windows (Delphi source code included)
Pre-processing program: substituting numbers, sites and e-mails by generic concepts in texts
TeP 2.0 - on-line version of a thesaurus por Brazilian Portuguese
SENTER for Portuguese and for English
how to use it:
In command line, execute the following: senter.exe myfile.txt
The segmented text will be stored in a file with the same name + ".seg" (for instance, myfile.txt.seg) with one sentence per line. The input file must be a plain text file.