[Humanist] 27.848 predominance of English in NLP?

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Thu Mar 6 09:09:53 CET 2014


                 Humanist Discussion Group, Vol. 27, No. 848.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org



        Date: Wed, 5 Mar 2014 10:36:08 +0100
        From: Simon Hengchen <shengche at ulb.ac.be>
        Subject: Predominance of English in entity extraction and disambiguation


Dear colleagues,

Our research group is currently working on the evaluation of Named Entity
Recognition (NER) within a multilingual historical corpus. Whilst English
is the main focus of most research, a lot has been done for other languages
(Spanish, Dutch, German, etc) and language-independent systems have been
the focus of initiatives (for example, the CoNLL-2002 shared task) within
the Natural Language Processing (NLP) domain. Nonetheless, most
freely-available and easily usable NER services (through APIs, for example
via OpenRefine  http://openrefine.org/  and the NER
extension<https://github.com/RubenVerborgh/Refine-NER-Extension>)
focus on English and, even if they advertise being able to recognize and
process other languages, often fail at it. An example of that is when
submitting a French corpus to the service Zemanta, the term "avant" (
*before*) is being disambiguated with the URL http://www.avantmusic.net,
referring to a US musician. Other URLs are available (Wikipedia, DBpedia,
the Twitter or Last.fm pages of the artist) but refer to the same wrongly
disambiguated entity. Even if the services do extract and disambiguate the
entities correctly, the URIs used for the disambiguation are mostly in
English, and often do not have an equivalent in the source language.

In this context we are currently developing a historical overview of how
the predominance of English has impacted NLP and particularly entity
extraction and disambiguation for non-English corpora. The current usage of
knowledge bases such as Freebase for disambiguation within NER services
really points out this issue, but some of you probably have interesting
literature on this topic.

We are wondering whether you are aware of initiatives aiming to avoid such
problems, and would love to have your input.

Kind regards,

Simon Hengchen
PhD Student
Département des Sciences de l'Information et de la Communication - ULB CP123
Université Libre de Bruxelles
Avenue F.D. Roosevelt, 50 | B-1050 Brussels
http://homepages.ulb.ac.be/~shengche





More information about the Humanist mailing list