[Humanist] 24.221 new on WWW: the Manually Annotated Sub-Corpus

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Sun Jul 25 21:54:43 CEST 2010


                 Humanist Discussion Group, Vol. 24, No. 221.
         Centre for Computing in the Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org



        Date: Sat, 24 Jul 2010 18:29:14 +0100
        From: Nancy Ide <ide at cs.vassar.edu>
        Subject: MASC data and annotations available for download



MANUALLY ANNOTATED SUB-CORPUS (MASC)
-------------------------------------------------------------
http://www.anc.org/MASC

Version 1.02 (July 2010) available for download
-----------------------------------------------------------

MASC 1.02 contains 82K words of contemporary written and spoken American
English across a broad range of genres. The entire corpus is annotated for
logical structure, tokens (3 versions) and part of speech (2 versions),
sentence boundaries, noun chunks, verbchunks, Penn Treebank syntax, and
named entities. Other annotations include FrameNet frames and frame elements
and Opinion; annotations for TimeBank, PropBank, HPSG, co-reference, event,
and Discourse are in process. All MASC annotations are manually-produced or
hand-validated.

MASC 1.02 also includes a separate "sentence corpus" including 1000
sentences for each of 50 words, manually annotated for WordNet 3.1* senses
by several taggers and including inter-annotator agreement statistics.
One-hundred of the 1000 sentences for each word are currently being
annotated for FrameNet frames and frame elements. WordNet and FrameNet
annotations for an additional 50 words are forthcoming.

All MASC annotations are distributed in the ISO TC37 SC4 GrAF standoff
format. The ANC2Go web application can be used to obtain the annotations in
a number of other formats, including in-line XML (XCES), token/pos, simple
NLTK, and CONLL formats. Tools to import and export GrAF annotations into
and out of GATE and UIMA are also available for download from the MASC and
ANC websites.

ALL MASC DATA AND ANNOTATIONS ARE FREELY DISTRIBUTED FOR RESEARCH AND COMMERCIAL USE.

The full MASC, to be released in fall, 2011, will contain 500K words of data
with annotations. MASC 2, available in December, 2010, contains an
additional 140K words with annotations. 

MASC 1 and 2 texts are available for separate download to enable others to
annotate the data and contribute the annotations to this community-developed
resource. MASC 3 texts will be available this fall.

========================================================== 
We invite contributions of linguistic annotations of any portion of MASC data, 
in any format. We also invite contributions of unencumbered texts for 
inclusion in MASC and/or the Open American National Corpus.
=========================================================+

Please consult the MASC website (http://www.anc.org/MASC) or contact
anc at anc.org for additional information. See also:

Ide, Nancy; Baker, Collin; Fellbaum, Christiane; and Passonneau, Rebecca
(2010). MASC: A Community Resource For and By the People. Proceedings of the
48th Annual Conference of the Association for Computational Linguistics,
Uppsala, Sweden. http://aclweb.org/anthology-new/P/P10/P10-2013.pdf

References to additional MASC publications, including inter-annotator
agreement studies, are available on the MASC website at
http://www.anc.org/MASC/Publications.html.


-----------------------------------------------------------------------------------------------------------* Yes, we mean 3.1. Please see the MASC website or the ACL 2010 paper cited above.






More information about the Humanist mailing list