Exercise: Collect ANP OCR data
To be able to play with the Frame generator, collect OCR files from the ANP-collection.
- Define a basic jSRU search query that defines a matching subset for your research.
- Perform this query for various periods in time.
- Use the resolver to get the OCR (~ 3 per selected era).
- Use ‘Save as…’ in your browser to store the result.
- Note: Please apply the .xml-extension to the saved file!
About the ANP typoscript collection
The ANP dataset consists of about 1.5 million digitized typoscripts from radio news broadcasts between 1937 and 1984. Available through the Delpher website as Radiobulletins collection.
KB offers the data under (semi) open licenses:
- CC0-license for the metadata
- CC-BY-NC-ND-licenses for images and full-text objects
Cheatsheet
Using jSRU
Basic jSRU example queries
The base URL for the search API ishttp://jsru.kb.nl/sru/sru.
-
With the operationparameter set to explain, some general information about the service can be obtained:
http://jsru.kb.nl/sru/sru?operation=explain -
A simple request to querythe ANP collection (x-collection) with a single keyword:
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-collection=ANP&query=nobelprijs -
By default, the first 20 search results are returned. This number can be adjusted with the maximumRecordsparameter:
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-collection=ANP&query=nobelprijs&maximumRecords=100 -
If you want to start viewing the result set from a particular record onwards, you can use the startRecordparameter:
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-collection=ANP&query=nobelprijs&startRecord=100
CQL query syntax
SRU uses CQL (Contextual Query Language), as its query language. Some examples:
-
To search a string consisting of multiple words, enter the string between double quotes as the query parameter value:
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-collection=ANP&query=%22nobelprijs+literatuur%22 -
For an OR- or AND-query consisting of two terms, simply use the words ORor ANDbetween the search terms, surrounded by spaces:
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-collection=ANP&query=nobelprijs+AND+literatuur -
Wildcards in the form of an asterisk *are supported when appearing in the middle or at the end of a keyword:
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-collection=ANP&query=nobelpr* -
To restrict the search to a particular field, add the field name before the search term. This option is only available for specific fields for the ANP collection, such as dateand volgnummer:
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-collection=ANP&query=date=01-01-1960 -
To restrict the search to a specific period in time, use the withinsyntax in combination with the datefield:
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-collection=ANP&query=date within "01-01-1960 01-01-1961"
Please note: Queries have to use properly encoded special characters. For example, a space should be replaced by %20 or + and a double quotation mark with %22. Most browsers will automatically take care of this encoding, but if you run into problems you can get your query encoded at the URL Encoding Reference. When entering double quotes, use straight quotes", not curly quotes“ ”.
ANP resolver links
-
Persistent identifier in the form of a resolver link:
http://resolver.kb.nl/resolve?urn=anp:1973:10:18:44:mpeg21 -
Image can be retrieved by adding the :imagesuffix:
http://resolver.kb.nl/resolve?urn=anp:1973:10:18:44:mpeg21:image -
OCR can be retrieved by adding the :ocrsuffix:
http://resolver.kb.nl/resolve?urn=anp:1973:10:18:44:mpeg21:ocr -
ALTO file can be retrieved using the :altosuffix:
http://resolver.kb.nl/resolve?urn=anp:1973:10:18:44:mpeg21:alto
Advanced techniques
Using jSRU faceting
- Faceting search results by year:
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-collection=ANP&query=nobelprijs&maximumRecords=0&x-facetname=periode&x-facetprefix=1&x-facets=indexes:ANPfacets:periode
ThemaximumRecordsparameter has been set to 0 here, so that only the facetted results are present in the response. Thex-facetprefixparameter can take values from 0 to 3, resulting in different temporal resolutions of the facet (where 0=decade, 1=year, 2=month, and 3=day). Thex-facetnameandx-facetsparameters are needed to indicate the particular facet requested.
- Facets per month, filtered to decade 1950-1959:
http://jsru.kb.nl/sru/sru?operation=searchRetrieve&x-collection=ANP&query=nobelprijs&maximumRecords=0&x-facetprefix=2&x-facetname=periode&x-facets=indexes:ANPfacets:periode&x-filter=periode+exact+%220%2F1950-1959%2F%22
Retrieving metadata from KB-MDO
-
Provides an overview of all 'sets':
http://services.kb.nl/mdo/oai?verb=ListSets -
Return the first 'batch' of identifiers of MPEG21 DIDL records of the ANP set:
http://services.kb.nl/mdo/oai?verb=ListIdentifiers&set=anp&metadataPrefix=didl -
Using a fromdate to collect newer records only:
http://services.kb.nl/mdo/oai?verb=ListIdentifiers&set=anp&metadataPrefix=didl&from=2008-10-01 -
Use the resumptionTokento collect the next batch of identifiers:
http://services.kb.nl/mdo/oai?verb=ListIdentifiers&resumptionToken=anp!2008-10-03T11:11:50.639Z!!didl!3932926 -
Retrieve one specific MPEG21 DIDL record:
http://services.kb.nl/mdo/oai?verb=GetRecord&identifier=anp:anp:1937:07:08:8:mpeg21&metadataPrefix=didl -
Return the first 'batch' of MPEG21 DIDL records of the ANP set:
http://services.kb.nl/mdo/oai?verb=ListRecords&set=anp&metadataPrefix=didl
Some tips for further exploring:
A simple tool for harvesting metadata from MDO or any other OAI-PMH repository that runs on both Macs and Linux is oai2linerec. Try for example./oai2linerec.sh -s anp -p didl -b http://services.kb.nl/mdo/oai -o output.txt.
A convenient Python based wrapper for accessing both jSRU and KB-MDO can be found at http://lab.kb.nl/tool/python-api