[Humanist] 24.398 new Hebrew corpus
Humanist Discussion Group
willard.mccarty at mccarty.org.uk
Sun Oct 10 21:53:36 CEST 2010
Humanist Discussion Group, Vol. 24, No. 398.
Centre for Computing in the Humanities, King's College London
Submit to: humanist at lists.digitalhumanities.org
Date: Mon, 11 Oct 2010 06:36:10 +1100
From: Willard McCarty <willard.mccarty at mccarty.org.uk>
Subject: new Hebrew corpus
> Subject: Correction about the new Hebrew corpus
> Date: Fri, 8 Oct 2010 21:07:50 +0100
> From: Justin Parry <ootkaman at yahoo.com>
[Corrected announccement --WM]
We are pleased to announce /hebrewCorpus/, a new corpus that is
available for free online. This corpus presents a variety of searchable
texts in Hebrew. Sources include the Tanach, the Mishnah, nine Israeli
newspapers, some early and modern fiction, subtitles from movies,
spontaneous, everyday conversations from the the Corpus of Spoken
Israeli Hebrew, academic journals, sessions of the Knesset, Wikipedia,
and a few others.
All of these texts add up to over 150 million words.
These texts are not tagged, since the morphological ambiguity of Hebrew
makes doing so problematic, but the program does use part of speech
filters that try to predict the part of speech based on structure and
affixes. The program also uses regular expressions, which greatly
enhance the searchability of the texts. Detailed instructions and a
tutorial for the corpus are provided on the site.
We invite all Hebrew teachers, students, and scholars interested in
using a search tool to study Hebrew to explore this resource. If you
know of anyone not on this mailing list that may be interested, we
invite you to forward this message to them.
To begin using the corpus, go to http://hebrewcorpus.nmelrc.org . Click
on register for free, and add your name and e-mail address to begin
searching. You can also log in as a guest, but this is problematic since
many users may log in as guest at the same time. You can input your
query in Hebrew or in transliteration, and remember to choose the
subcorpus in which you are interested, and in most cases "string" in the
There is also a mailing list that provides updates and tips on the
corpus; contact Justin Parry at ootkaman at yahoo.com to be added to it.
This corpus was developed with funding from the National Middle East
Language Resource Center (NMELRC). More information about this center
can be found at http://www.nmelrc.org/.
More information about the Humanist