|
New Tools for Jewish Linguistics
Fall 2008
Introduction
For specialized scholars of Jewish
linguistics, as well as for general
researchers who are fascinated by
Jewish languages, online access to
the existing and growing network of
basic resources that are maximally
representative of a particular
language or language body is of
great value. These resources can
range from unanalyzed sound
recordings to fully transcribed and
annotated text corpora; from
dictionaries to the various
manifestations of web-based “social
media.” Even though many of these
tools and projects are not yet fully
accessible on the Web or remain in
various stages of development
because of staffing, funding, and
technological issues, in the
following pages I would like to call
attention to their existence and
potential benefits. One of the best
places to start is the Jewish
Language Research Website, which serves as a
resource for those studying Jewish
linguistics from either an individual
or a comparative perspective.
Annotated Corpora
Computer corpora are bodies of
computer-readable texts or extracts
of written or spoken text that are
used for language and linguistic
research. Annotated corpora provide
scholars with very useful tools for
language and linguistic research.
Added to the raw text are
annotations that describe the
linguistic aspects such as
morphology, syntax, tone, etc.
Benjamin Hary and others have
described how Modern Hebrew is
underrepresented in corpus
linguistics in an article, “Designing
CoSIH: The Corpus of Spoken
Israeli Hebrew” (International
Journal of Corpus Linguistics: 6:2
(2002): 171-197). Work is now
being done to fill the gaps since the
start of the new millennium. The
Mila Knowledge Center for
Processing Hebrew at the Technion
maintains a collection of Modern
Hebrew annotated texts at its
website. These
have been organized structurally
using Extended Markup Language
(XML), a commonly used
technology for turning raw or free
text into analyzable data, and
annotated. Similarly, Tsvi Sadan
[also known as Tsuguya Sasaki] of
Bar-Ilan University and Jan. H.
Kroeze of the University of Pretoria
have effectively validated and
demonstrated the use of XML as an
available tool to transform raw
linguistic data into a usable
databank for Hebrew linguistic data
in their work. Tsvi Sadan has described the use of XML in building a Hebrew corpus in his article, “Building an Annotated Corpus and a Lexical Database of Modern Hebrew in XML” (Kyoto University Linguistic Research: 23 (2004): 17-45). See also Jan H. Kroeze’s June 2006 paper, “Building and Displaying a Biblical Hebrew Linguistics Data Cube Using XML” (presented at the Israeli Seminar on Computational Linguistics (ISCOL) Conference, Haifa, Israel).
In 1994, Beatrice Santorini of the
University of Pennsylvania built a
machine-readable parsed and
annotated corpus of Yiddish texts.
Treebanks are language resources
that provide annotations of natural
languages at various levels of
syntactic structure: at the word
level, the phrase level, and the
sentence level. The Mila Center has
recently released Hebrew Treebank
Version 2.0.
Unannotated Corpora
Unfortunately, carefully annotated
corpora are only available for a small
number of Jewish languages.
Because of copyright issues affecting
corpus building, scholars sometimes
are forced to turn to machinereadable
text collections that are
free and open content. Several
online text corpora currently are
available for Hebrew language
research and are still being
expanded, such as the Hebrew
Wikisource and Eliezer Ben-Yehuda
Project. Wikisource is a sister
project to Wikipedia that aims to
create a free library of primary
source texts, and translations of
source texts in any language.
HebrewWikisource was the first Wikisource non-
English language domain. Project
Ben-Yehuda’s goal
is to make freely accessible on the
Web the classics of Hebrew
literature.
At the recent “2008 Czernowitz
Yiddish Language International
Centenary Conference” held from
August 18-22, 2008 in Czerniivisti,
Ukraine, Dr. Cyril Aslanov explored
how Wikipedia might be able to
provide a window “of visibility” on
Yiddish and other such languages.
Yiddish Wikipedia
contains more than five thousand
articles, providing access to the
usage of Yiddish language in the
twentieth century.
Dictionaries
Several Hebrew dictionaries exist on
the Web. Maagarim, the Historical
Dictionary Project (HDP), is the
research arm of the Academy of the
Hebrew Language. It aims to
“encompass the entire Hebrew
lexicon throughout its history”; that
is, to present every Hebrew word in
its morphological, semantic, and
contextual development. This fee-based
resource requires registration.
Rav-Milim has been issued by the
Melingo Company on the Web in a
subscription-based edition.
The online version offers a variety
of features that are not possible in
the print version.
The company has also issued Morfix
Dictionary, a freely available, online
Hebrew-English and English-
Hebrew dictionary.
Morfix is more than just a
dictionary or translating tool. It is
also an important and effective tool
for searching the web. The Morfix
Dictionary sits within the Morfix
Search Engine, enabling efficient,
cross-language morphological
searching of websites in Hebrew
and English.
Hebrew Wiktionary is part of a
multilingual, free dictionary and
thesaurus, being written
collaboratively by people from
around the world. Entries may be
edited by anyone.
Yiddish Dictionary Online is a Yiddish-English, English-
Yiddish dictionary with English
words and phrases and their Yiddish
equivalents, with both Hebrew
script and romanized spelling, the
approximate pronunciation in
northern and southern Yiddish, part
of speech, and plural versions. It
offers word search and alphabetical
browsing, rhyming tables, and a few
grammatical tables. Authorship of
this site cannot be determined and
remains unknown.
The Comprehensive Aramaic
Lexicon, hosted by the Hebrew
Union College in Cincinnati, aims
to create a lexicon of all Aramaic
words from 900 BCE up until the
early Middle Ages.
The resource consists of a database
section with facilities allowing for
concordance, dictionary, dialect,
and lexicon searches, and a
searchable, updated bibliography.
Audio and Sound Collections
The aim of linguistic sound archives
is to provide a comprehensive
record of the linguistic practices
characteristic of a given speech
community. Much has been written
about the problems of providing
long-term preservation and access
to the analog and digital materials
that make up these archives. As a
first step toward making these
materials more visible to the
scholarly and outside communities,
libraries and institutions that house
these research collections are
publishing their holdings on the
Internet and bringing varying
amounts of the collections online.
(Note: This article does not include
sound archives or repositories that
focus on historic recordings of
ethnomusicological or liturgical
interest.)
The website Eydes: Evidence of
Yiddish Documented in European
Societies is devoted to archiving
the dialects, folklore, customs, and
life experiences of east and central
European Jewry. This project is a
spinoff of the Language and
Cultural Atlas of Ashkenazi Jewry
(a decades-long project that was
launched at Columbia University by
Uriel Weinreich). Within the scope
of the project are more than six
thousand hours of tape recording
taken from 603 separate locales.
Also available is an interactive map
with audio clips of regional
differences in dialect.
Dr. Isabelle Barierre at the Yeled
V’Yalda Multilingual Development
and Education Research Institute has been
researching how children develop in
different cultural and linguistic
settings. Over the past three years
she and her team have been
recording the interactions of a
Yiddish-speaking Hasidic boy with
his mother, and hope to publish this
corpus soon.
In the 1980s, Dr. Gertrud
Reershemius of the University of
Aston collected a corpus of spoken
Yiddish in Israel. These recordings
are now housed at the
Phonogrammarchiv, which is part of
the Oesterreichische Akademie der
Wissenschaften in Vienna. These
recordings are slowly being
digitized and made available.
SemArch, a project located in the
department of Semitic linguistics at
the University of Heidelberg, is
establishing a digital archive of
audio documents. Its aim is to archive in
digitized form all existing
recordings of Semitic dialects and
languages and to make them
accessible in an Internet database.
Professor Geoffrey Khan of
Cambridge University is directing a
project that aims to produce a
dialect atlas of the surviving North
Eastern Neo-Aramaic dialects. It
will be a Web-based, free-access
catalogue of northeastern Neo-
Aramaic languages (Jewish and
Christian), searchable by linguistic
and grammatical criteria. For the
moment, however, researchers can
only access an information page.
Members of the staff at the School
of Oriental and African Studies,
University of London (SOAS) are
working with Eli Timan, a native
speaker of Iraqi Judeo-Arabic, to
document the modern spoken
language in the form of audio and
video recordings made with
speakers in London, Toronto, and
Israel. Using ELAN annotation
software, Timan has put together a
sizeable corpus of partially
transcribed recordings, some with
time-aligned transcriptions and
English translations. Later this year
or next, a website will be launched
that will have illustrative materials,
texts, sound files, images, and
possibly some video.
In the public domain, Librivox provides free audiobooks
in sixteen languages. The
number in Hebrew is still small but
growing.
Of the Jewish languages and dialects
that have been described and
documented, many are now extinct
in their spoken form. The UNESCO
Red Book on Endangered Languages:
Europe and a website
produced by Beth Hatefutsoth, the
Nahum Goldmann Museum of the
Jewish Diaspora, have identified
those Jewish languages for which a
few speakers remain. It is
incumbent that scholars employ
every effort to record and
document the last speakers before
these languages become fully
extinct.
Tools for the Twenty-first
Century
Professor Joshua Fishman has noted in an article, “Language Planning For "The Other Jewish Languages in Israel: An Agenda for the Beginning of the 21st Century,” the dearth of contemporary written texts from Jewish languages such as Judeo-Arabic, Judeo-Persian, and others (Language Problems and Language Planning: 24 (2000): 215-231).
Although
historic and older texts in these
languages exist in libraries and
archives around the world, scholars
researching them will find little in
the way of Web-based or born-digital
texts except for those that
exist within digitized publications
such as dissertations, monographs,
and serials. These last resources,
which really exist as extensions of
print media, have historically been
well described, analyzed, and
documented by scholars of Jewish
languages. To take fullest advantage
of the analytical possibilities offered
by the computer, an electronic text
must first be encoded accurately and
consistently, and, even better;
include some kind of textual markup.
Many of the above-mentioned
materials cannot be used effectively
for computerized linguistic analysis
because of problems of transcription
and transliteration, and production
quality. As the capabilities and
quality of optical character
resolution (OCR) improve and
render these texts machine-readable,
scholars of Jewish languages may be
able to adapt new methods of
linguistic analysis to these bodies of
texts.
A project is underway at Université
Michel de Montaigne Bordeaux 3
under the direction of Soufiane
Rouissi and Ana Stulic to create an
electronic edition of a historic
Judeo-Spanish text that will serve as
a paradigm for corpus building in
the context of a collaborative
computer-based environment.
Some linguists are exploring the use
of blogs, discussion groups, and
other manifestations of Web-based
social media as a source of language
data. There has been a rapid
increase in the number of Yiddish
blogs in the past decade. A
directory of Yiddish blogs is found
at the Tapuz portal.
Ladino is very much
alive among members of the online
discussion group “Ladinomunita,”
which has members from all over
the world. Also available for
the members of this group is a
Ladino audio voice chat room on
the Internet using the services of
Paltalk, the “Salon de Mohabet” as
the participants call it.
Researchers are looking at today’s
use and infusion of Hebrew and
Yiddish words into European and
Latin American languages. Sarah
Benor describes how she has used
data from Anglo-Jewish websites
such as www.hashkafah.com and
www.heebmagazine.com in
examining what she refers to
“Jewish American English” in her
forthcoming article, “Do American
Jews Speak a ‘Jewish Language’? A
Model of Jewish Linguistic
Distinctiveness” (Jewish Quarterly
Review). She has mounted Jewish
English: Distinctive Lexicon (beta
version) on the Jewish Language
Research Wiki.
Conclusion
Computerization is playing an
increasing role in the study and
development of tools and resources
for Hebrew and other Jewish
languages. Collaborative research
and cooperation between
individuals, institutions, and
government bodies will, in large
part, determine how successful and
indeed indispensible digital
technologies will become for Jewish
linguistics. One hopes that these
efforts will succeed so that a new
generation of tools and applications
will soon be readily accessible to all.
Heidi Lerner is the Hebraica/Judaica cataloguer at Stanford University Libraries.
|