Simple introduction to the data

From Hacathon

Jump to: navigation, search

Detailed documentation on the APIs will be released when access to the APIs is opened to event registrants in the week starting the 7th of January.

Joke.jpg

Contents

Overview

Franchisepill.jpg

For the hacathon, we’re providing direct access to the APIs that the website is built on, plus two helper services to simplify getting to the images of the newspapers. For some contributors, using the beta website to discover the material and then the image server to grab copies of the images will be sufficient, others will want to build on the other APIs for interactive applications.

Newspaper data model

The newspapers are organised into separate titles. Each title has many issues, and each issue has a number of pages. Issues also have articles. Each level has some basic metadata assigned to it (eg name of title, date of issue, number of page, category of article).

The APIs

The APIs exposed during the hacathon give access to the library’s digitised historical newspapers and associated metadata. The data is provided in 3 strands through 3 separate APIs.

Article Index (SOLR)

This gives access to the full text of every article. SOLR allows you to perform full text search of the article’s textual content and metadata related to the article, such as information about illustrations, the page, issue and date on which the article was published, word counts and article author and subject information. This will be the most useful source of data for most contributors.

Images (IIP Image Server)

We are providing access to the scanned images of the original newspapers. These are available at a range of resolutions with the API allowing you to choose specific subsections of a digitised page, enabling the extraction of high-resolution images of individual articles, advertisements or notices. Many of the jewels of this collection come in the form of historical advertisements and images providing a window on to the problems, products and marketing approaches of the day. You’ll find adverts for all kinds of stuff, including clothes, technological developments and medicines.

Images can also be requested via two helper services, which make it simpler to get an image of a page or an article without an intermediate step to translate the ids given in the metadata into the page codes which are needed to get the image files from the image server.

Structure (CouchDB)

For the more technically minded we are also providing access to a CouchDB instance. This database provides structural metadata about all of the articles, describing the relationships between Projects (Newspaper titles), Issues (individual scanned newspapers), Pages and Articles. This is the API you would want to use for implementing a browse view of the content. Furthermore, this database provides a method for extracting the layout XML about a digitised page. This gives structural and layout information for each page, using a coordinate system to define bounding boxes. It contains all of the text captured through OCR and provides bounding coordinates to a high level of granularity, covering the page, the articles, individual sentences and words. Using this, the extraction of relevant portions of the page image can be automated to retrieve only the desired image segments from the image server.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox