[Humanist] 24.398 new Hebrew corpus

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Sun Oct 10 21:53:36 CEST 2010


                 Humanist Discussion Group, Vol. 24, No. 398.
         Centre for Computing in the Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org



        Date: Mon, 11 Oct 2010 06:36:10 +1100
        From: Willard McCarty <willard.mccarty at mccarty.org.uk>
        Subject: new Hebrew corpus


> Subject: 	Correction about the new Hebrew corpus
> Date: 	Fri, 8 Oct 2010 21:07:50 +0100
> From: 	Justin Parry <ootkaman at yahoo.com>

[Corrected announccement --WM]

We are pleased to announce /hebrewCorpus/, a new corpus that is
available for free online. This corpus presents a variety of searchable
texts in Hebrew. Sources include the Tanach, the Mishnah, nine Israeli
newspapers, some early and modern fiction, subtitles from movies,
spontaneous, everyday conversations from the the Corpus of Spoken 
Israeli Hebrew, academic journals, sessions of the Knesset, Wikipedia, 
and a few others.

All of these texts add up to over 150 million words.

These texts are not tagged, since the morphological ambiguity of Hebrew
makes doing so problematic, but the program does use part of speech
filters that try to predict the part of speech based on structure and
affixes. The program also uses regular expressions, which greatly
enhance the searchability of the texts. Detailed instructions and a
tutorial for the corpus are provided on the site.

We invite all Hebrew teachers, students, and scholars interested in
using a search tool to study Hebrew to explore this resource. If you
know of anyone not on this mailing list that may be interested, we
invite you to forward this message to them.

To begin using the corpus, go to http://hebrewcorpus.nmelrc.org . Click
on register for free, and add your name and e-mail address to begin
searching. You can also log in as a guest, but this is problematic since 
many users may log in as guest at the same time. You can input your 
query in Hebrew or in transliteration, and remember to choose the
subcorpus in which you are interested, and in most cases "string" in the 
part-of-speech column.

There is also a mailing list that provides updates and tips on the
corpus; contact Justin Parry at ootkaman at yahoo.com to be added to it.

This corpus was developed with funding from the National Middle East
Language Resource Center (NMELRC). More information about this center
can be found at http://www.nmelrc.org/.





More information about the Humanist mailing list