Introduction

In 2022 and 2023 two researchers started working with the KB web collection for the first time in the history of our web collection. This triggered the need for analysing the web-data within the WARC container files we hold. The result of this: datasets with hyperlinks. On this page you can find an example of the kind of dataset we used for their research. As example a kb.nl harvest was used.

Analysing hyperlinks in the KB web collection

The first time extracting hyperlinks was done by myself in preparation for the arrival of Researcher in Residence Karin the Wild back in 2022. You can read how I extracted the hyperlinks and worked with them in Gephi in the blog series ‘Analysing hyperlinks in the KB web collection’.

I learned a lot from this first experiment. The main thing I discovered was that I wanted to have the tags as well as the hyperlink in my dataset. They tell a lot about the type of hyperlink you extract and are needed if you want to do a thorough analysis. I also learned that the way you crop your hyperlink is very important for the outcome of your hyperlink analysis. I took those experiences and used them when Jesper Verhoef became our Researcher in Residence in 2023. Changes to our original extraction script improved the dataset.

Fast forward to the end of 2022 and </em><a href="https://www.kb.nl/en/news/kbs-xs4all-web-collection-unesco-world-heritage-list"><em>this collection was recognized as UNESCO heritage</em></a><em>. But… Was Kees right?

Example how a hyperlink is embedded in a website (and therefore in a WARC file). Source: No Longer XS4ALL.

The dataset

For those interested I made a sample set based on a crawl from the website of kb.nl:

Target Instance ID	38712298
Target Name	Koninklijke Bibliotheek
Schedule start	06/07/2024 18:57:00
URLs downloaded	175.115	limit: 700.000
Data downloaded	21.03 GB	limiet: 50 GB
Elapsed time	12:14:54:33
Seeds	https://inschrijven.kb.nl/ https://galerij.kb.nl/ https://www.kb.nl/ https://collecties.kb.nl/	Bold = primary seed
Excluded from harvest	.field_categories. http://acroeng.adobe.com/.* .f%5B0%5D=. .\.rss. ^[^/]+://[^/](youtube). ^[^/]+://[^/](facebook). ^[^/]+://[^/](google).
Important to note	Harvest finished by operator

Table 1. Metadata from the kb.nl harvest.

This target has four seeds, but the harvest was cut short by a quality assurance officer. This means only a part of the selected seeds were harvested. Because kb.nl is the primary seed, the harvester started with that seed and did not come around to the other before it was manually stopped.

Result

You can find the whole dataset in csv under the ‘examples’ tab. But here (table 1) is a little sneak peek. In the datasets I used you first have the ‘source’ website: the website on which the link was found. After this comes the ‘target’: the website to which the link refers. The third column is the weight: how many times the source website refers to the ‘target’ website. Lastly we have the type of URL. As mentioned above in the beginning this was only anchor or embedded. After updating the extraction script we now find the specific tag which surrounds the link. In the KB dataset it is mostly ‘a’ for anchor link. But you can also find some images and frames in there.

Source	Target	Weight	Type_URL_v2
kb.nl	webggc.oclc.org	167154	a
kb.nl	collecties.kb.nl	166452	a
kb.nl	youtube.com	166288	a
kb.nl	delpher.nl	155586	a
kb.nl	webwinkel.kb.nl	152806	a

Table 2. Example of the data found inside the kb.nl_2024_links dataset.

It is important to note that this dataset does not reflect the kb.nl website as a whole. As mentioned above the harvest was stopped manually. But even if the harvest was completed, it only shows what we have been managed to archive with Heritrix and WCT and what we managed to extract with our WARC-link-extraction script.

Link web with kb.nl referring to a number of other websites. Some lines are thicker because kb.nl refers to them more often

Figure 1: Example of what you can do with the dataset. In this case it shows the websites kb.nl most often refers to.

When using this dataset we ask you to cite it as follows;

I. Geldermans, Link analysis sample set (version 1, 28-08-2024) KB Lab, the Hague. https://lab.kb.nl/dataset/link-analysis-sample-set.

Examples

The preprocessed link analysis dataset with data based on the kb.nl harvest as described in the introduction tab.

Link analysis sample dataset kb.nl

Document

2024_KB.csv (17.64 KB)