[Humanist] 29.206 parsing bibliographical reference lists

Humanist Discussion Group willard.mccarty at mccarty.org.uk
Tue Aug 11 08:25:12 CEST 2015


                 Humanist Discussion Group, Vol. 29, No. 206.
            Department of Digital Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist at lists.digitalhumanities.org

  [1]   From:    Amir Simantov <wawina at gmail.com>                          (49)
        Subject: Re:  29.204 parsing bibliographical reference lists

  [2]   From:    Marten_Düring <m.duering at zoho.com>                       (48)
        Subject: Re:  29.204 parsing bibliographical reference lists

  [3]   From:    Anil Srivastava <anil.srivastava at sriban.com>              (37)
        Subject: Re:  29.204 parsing bibliographical reference lists

  [4]   From:    "Allen B. Riddell" <abr at ariddell.org>                     (37)
        Subject: Re: [Humanist] 29.204 parsing bibliographical reference
                lists



--[1]------------------------------------------------------------------------
        Date: Mon, 10 Aug 2015 08:42:52 -0500
        From: Amir Simantov <wawina at gmail.com>
        Subject: Re:  29.204 parsing bibliographical reference lists
        In-Reply-To: <20150810065734.68AD7693C at digitalhumanities.org>


Thanks you very much Desmond for your take on it and the time you have
taken to write it down in detail.

Unfortunately, my original email was cut and an important part was omitted
somehow (just after the link to the example page, maybe because it was
long?) I will am pasting it here. As you see from this second part of the
email - the problem is not with the overhaul structure of the data items
but rather with the very line of a specific source... Actually, my plan
regarding how to deal with the overhaul structure is very much like that
you have proposed ("great minds think alike"... Just kidding). Anyway, I
really want to know what tools or libraries are generally used if the
source line IS structured.

*>> HERE STARTS THE OMITTED PART OF THE ORIGINAL EMAIL >>*

WHAT I NEED

I have been investigating this subject, and I have gathered a bunch of
tools and code libraries. I will test them, of course. However, I usually
import structured data (old relational databases or structured files, such
as JSON or XML); this is my first time dealing with linear bibliographic
reference lists.

WHAT I ASK

Here is where I ask for help: I would like to hear from the community (you
guys!) about your experience with a task like mine:

   -

   Could anyone who has used such a tool/library share his or her
   experience?
   -

   Is there any tool/library that you would recommend?
   -

   Do you know of any specific DH projects that have involved such parsing?
   -

   Any other tips ...?

NOTES

   1.

   Apart from parsing the reference line into its metadata (title, author,
   pages, etc.) It would be great if the parsing results include a value that
   says what kind of reference it is (article, book, in book, booklet,
   proceedings, in proceedings, PhD thesis, Master thesis, conference, etc.).
   2.

   Coding language or running platforms are not important.
   3.

   Locating the reference lists is not an issue at all, nor is separating
   each list into individual reference lines. Only the parsing of a single
   reference line is the issue.

Thanks,

Amir Simantov

TopDownUp.com

*<< HERE **ENDS** THE OMITTED PART OF THE ORIGINAL EMAIL <<*

----

Thanks again,
Amir


--[2]------------------------------------------------------------------------
        Date: Mon, 10 Aug 2015 09:52:09 +0200
        From: Marten_Düring <m.duering at zoho.com>
        Subject: Re:  29.204 parsing bibliographical reference lists
        In-Reply-To: <20150810065734.68AD7693C at digitalhumanities.org>


Hi Amir,

I agree with Desmond but you may want to try out http://anystyle.io/ a parser which uses machine learning and works reasonably well.

Best,

Marten

--
Dr. Marten Düring 
http://martenduering.com
http://historicalnetworkresearch.org

---- On Mon, 10 Aug 2015 08:57:34 +0200 Humanist Discussion Group <willard.mccarty at mccarty.org.uk> wrote ---- 

 Humanist Discussion Group, Vol. 29, No. 204. 
 Department of Digital Humanities, King's College London 
 www.digitalhumanities.org/humanist 
 Submit to: humanist at lists.digitalhumanities.org 
 
 
 
 Date: Mon, 10 Aug 2015 05:19:02 +1000 
 From: Desmond Schmidt <desmond.allan.schmidt at gmail.com> 
 Subject: Re: 29.198 end of digital humanities? parsing bibliographical reference lists? 
 In-Reply-To: <20150809063134.E5CDC6921 at digitalhumanities.org> 
 
 
Hi Amir, 
 
looking at this data there is no structure except for some rudimentary 
formatting. No standard library routine is going to be able to parse it 
correctly. You'll have to write your own parser, but it will be hard 
because it is written to be read by humans. For example, what will you do 
with: 
 
Sections translated in Pfad; Nyanaponika 
Edited with Ajitamitra's commentary. Sarnath 1991 
 
There seem to be references to other works embedded in them. Maybe a lookup 
table would work. But it ain't going to be easy. I'd start with something 
that would parse the more complete entries, like 
 
Harunaga Isaacson, "Citations from the Ratnavali and Bodhicittavivarana in 
the Abhayapaddhati", SII 21, 1997, 55-58; 22, 1999, 55-58 
 
Write something to split it into a hierarchy of sections and lines, and 
then match each line against a particular pattern. If it matches, then add 
it to a table of "finished" entries. Then gradually add more patterns until 
you've got most of it. Then add the hardest ones by hand. 
 
Desmond Schmidt 
University of Queensland 
 


--[3]------------------------------------------------------------------------
        Date: Mon, 10 Aug 2015 08:11:42 -0400
        From: Anil Srivastava <anil.srivastava at sriban.com>
        Subject: Re:  29.204 parsing bibliographical reference lists
        In-Reply-To: <20150810065734.68AD7693C at digitalhumanities.org>


Dear Amir,

I am very interested in the solution you come up with because we are also working with Drupal and trying to bibliographic information from unstructured text into a structure bibliography.

Incidentally, we are working with IBM Watson with a similar approach—getting Watson, as cognitive assistant, to produce structured and rule-based bibliography from unstructured text.

Sincerely, Anil

Anil Srivastava
+1 240-463-3686
anil.srivastava at sriban.com

>>                 Humanist Discussion Group, Vol. 29, No. 198.
>>            Department of Digital Humanities, King's College London
>>                       www.digitalhumanities.org/humanist
>>                Submit to: humanist at lists.digitalhumanities.org
>> 
>>  [1]   From:    Amir Simantov <wawina at gmail.com>
>>  (45)
>>        Subject: Parsing Bibliographic Reference Lists
> ...
>> 
>> --[1]------------------------------------------------------------------------
>>        Date: Tue, 28 Jul 2015 07:42:04 -0500
>>        From: Amir Simantov <wawina at gmail.com>
>>        Subject: Parsing Bibliographic Reference Lists
>> 
>> 
>> Dear scholars and information technologists,
>> 
>> I am a software developer, and I am currently looking for a tool or library
>> to parse bibliographic reference lists for a client of mine.
>> 
>> MY TASK
>> 
>> I need to import data from a website with static HTML pages into Drupal,
>> the content management system I most often use. Part of the data are
>> references lists. I need to parse each reference into its metadata parts,
>> that is, author, book title, journal, pages, etc., according to its type
>> (article, book, etc). An example of a page containing reference lists to be
>> parsed can be found here




--[4]------------------------------------------------------------------------
        Date: Mon, 10 Aug 2015 08:47:51 -0400
        From: "Allen B. Riddell" <abr at ariddell.org>
        Subject: Re: [Humanist] 29.204 parsing bibliographical reference lists
        In-Reply-To: <20150810065734.68AD7693C at digitalhumanities.org>


It might be a bit involved for the present case, but there is a
wonderful NYTimes labs blog post on structured parsing (of cooking
recipes):

Extracting Structured Data From Recipes Using Conditional Random Fields
http://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/

Best wishes,

Allen

> >                  Humanist Discussion Group, Vol. 29, No. 198.
> >             Department of Digital Humanities, King's College London
> >                        www.digitalhumanities.org/humanist
> >                 Submit to: humanist at lists.digitalhumanities.org
> >
> >   [1]   From:    Amir Simantov <wawina at gmail.com>
> >   (45)
> >         Subject: Parsing Bibliographic Reference Lists
> ...
> >
> > --[1]------------------------------------------------------------------------
> >         Date: Tue, 28 Jul 2015 07:42:04 -0500
> >         From: Amir Simantov <wawina at gmail.com>
> >         Subject: Parsing Bibliographic Reference Lists
> >
> >
> > Dear scholars and information technologists,
> >
> > I am a software developer, and I am currently looking for a tool or library
> > to parse bibliographic reference lists for a client of mine.
> >
> > MY TASK
> >
> > I need to import data from a website with static HTML pages into Drupal,
> > the content management system I most often use. Part of the data are
> > references lists. I need to parse each reference into its metadata parts,
> > that is, author, book title, journal, pages, etc., according to its type
> > (article, book, etc). An example of a page containing reference lists to be
> > parsed can be found here









More information about the Humanist mailing list