The project
Supplementary material
See a demonstration
Next appointments
Related projects
Interesting links
Contact us

During the years, the CSTNews corpus has been annotated by groups of computational linguists from NILC. The corpus is constantly undergoing some verification procedures and other annotations. The corpus is avaiable below for download and use for research purposes. For the publications describing each annotation, see the "Publications" page.


Differences to the previous version (version 4.0)
- inclusion of new (manually produced) 5 abstracts and 5 extracts for each cluster
- inclusion of the senses (according to Princeton Wordnet) for verbs
- inclusion of verb and (10% most frequent) noun ontologies (according to Princeton Wordnet) for each cluster, for the clusters of each category, and for the whole corpus
- inclusion of a folder named "For all the clusters", with some available annotations for all the clusters in the corpus
- correction of an annotation bug in the RST tree of the document 1 in cluster 34
- correction of bugs in the subtopic annotation in clusters 3 and 37

For each cluster inside the corpus, the following information is available:

  • a folder named "Textos-fonte", with the original source texts (in .txt format) and their titles (in _titulos.txt format) - each file name identifies the numbers of the document and the cluster, the source agency, as well as day, month, year and local time information for the news, whenever these data were available during corpus compilation
  • a folder named "Textos-fonte segmentados", with the original source texts with sentence boundaries delimited by new line characters
  • a folder named "Sumarios", with the following: the manual summary of each document in the cluster (in _sumario_humano.txt format for each document) with some information (the gist of the text, the size of the text in number of words, the intended size of the summary - corresponding to 30% of the source text, the summary, and the actual size of this summary) provided by the human summarizer (in _datos.txt format), the original manual multi-document summary for the cluster (in _sumario_humano.txt format for each cluster) and its corresponding manual extractive summary (in _extrato_humano.txt format), an automatic multi-document summary produced by CSTSumm system (in _sumario_automatico_CSTSumm.txt format for each cluster) and a version of it with sentences manually ordered (in _sumario_automatico_CSTSumm_ordenado_manuamente.txt format for each cluster), and new (manually produced) 5 multi-document abstacts and 5 multi-document extracts in the "Novos sumários" folder (divided in the subfolders "Abstracts" and "Extratos")
  • a folder named "Expressoes temporais", with the temporal expressions manually identified and normalized (with XML tags) for each document according to Baptista et al. (2008) proposal
  • a folder named "RST", with the RST annotation of each document using RSTTool produced by Michael O'Donnell - the documents that were used for computing annotation agreement have their corresponding cluster folder name followed by "-concordanciaRST" string (and there is a folder named "concordancia" inside the RST folder with the evaluated files)
  • a folder named "CST", with the CST annotation of each cluster (for every possible pair of documents in each cluster) using CSTTool - the clusters that were used for computing annotation agreement have their corresponding folder name followed by "-concordanciaCST" string (and there is a folder named "concordancia" inside the CST folder with the evaluated files)
  • a folder named "dls", with subfolders "noun" and "verb", with the source texts with their (10% most frequent) nouns and (all) verbs accompanied by their corresponding Princeton Wordnet synset identification numbers (in the .dls files) and general XML files for all the source texts in the cluster, showing the details of the word sense annotation (as the possible translations of the Portuguese words to English, whether they were manually or automatically translated, the available synsets and the selected one); this annotation was completely manual; for each cluster, there is also a XML file with the corresponding verb and noun ontologies composed by the selected synsets in the Princeton Wordnet
  • a folder named "CX_Tópicos", with one file for each source text, containing its manual subtopic segmentation (in the 't' xml-like tag) as well as the keywords (in the "label" attribute) that represent the corresponding subtopic (right above the xml-like tag), as well as an unique identifier for each subtopic (in the "top" attribute) so that it is possible to look for other occurrences of the same subtopic in the other texts in the cluster (since they are also referenced by the same unique identifier); each folder also comes with a "notasCX.txt" file, which stores information regarding the list of passages belonging to each subtopic, the number of sentences and words of each subtopic, and the presence of each subtopic in the corresponding manual (abstractive) multi-document summary sentences; finally, there is a "_agrupamento_manual.txt" file in each cluster, which summarizes the distribution of subtopics in the texts (in each line, the first column indicates the id of the subtopic, the second column indicates the id of the document, and the third column indicates the id of the sentence that belongs to the indicated subtopic)
  • a folder named "Analise_sintatica", with xml files for each source text and its title with the corresponding syntactical analyses, which were automatically produced by the PALAVRAS parser (Bick, 2000)
  • a folder named "Alignment", with a txt file with a xml-like annotation indicating the source text sentences that were aligned to each (manually created) multi-document summary sentence, as well as the relationship type of each alignment and the human judges that indicated it
  • a folder named "Aspectos", with a txt file with the multi-document manual summary with its sentences annotated according to the aspects they present; aspects, in this sense, are related to the information that the sentences convey, e.g., WHAT, WHERE and WHEN information about some event (based on the TAC proposal for guided summarization task)


NILC - Interinstitutional Center for Computational Linguistics