Link analysis sample set

    Introduction
    Body

    In 2022 and 2023 two researchers started working with the KB web collection for the first time in the history of our web collection. This triggered the need for analysing the web-data within the WARC container files we hold. The result of this: datasets with hyperlinks. On this page you can find an example of the kind of dataset we used for their research. As example a kb.nl harvest was used. 

    Analysing hyperlinks in the KB web collection

    The first time extracting hyperlinks was done by myself in preparation for the arrival of Researcher in Residence Karin the Wild back in 2022. You can read how I extracted the hyperlinks and worked with them in Gephi in the blog series ‘Analysing hyperlinks in the KB web collection’. 

    I learned a lot from this first experiment. The main thing I discovered was that I wanted to have the tags as well as the hyperlink in my dataset. They tell a lot about the type of hyperlink you extract and are needed if you want to do a thorough analysis. I also learned that the way you crop your hyperlink is very important for the outcome of your hyperlink analysis. I took those experiences and used them when Jesper Verhoef became our Researcher in Residence in 2023. Changes to our original extraction script improved the dataset. 

    Fast forward to the end of 2022 and&nbsp;</em><a href="https://www.kb.nl/en/news/kbs-xs4all-web-collection-unesco-world-heritage-list"><em>this collection was recognized as UNESCO heritage</em></a><em>. But… Was Kees right? 

    Example how a hyperlink is embedded in a website (and therefore in a WARC file). Source: No Longer XS4ALL

     

    The dataset

    For those interested I made a sample set based on a crawl from the website of kb.nl:

    Target Instance ID 38712298  
    Target Name Koninklijke Bibliotheek  
    Schedule start 06/07/2024 18:57:00  
    URLs downloaded 175.115 limit: 700.000
    Data downloaded 21.03 GB limiet: 50 GB
    Elapsed time 12:14:54:33  
    Seeds

    https://inschrijven.kb.nl/ 

    https://galerij.kb.nl/

    https://www.kb.nl/ 

    https://collecties.kb.nl/

    Bold = primary seed
    Excluded from harvest

    .*field_categories.*

    http://acroeng.adobe.com/.*

    .*f%5B0%5D=.*

    .*\.rss.*

    ^[^/]+://[^/]*(youtube).*

    ^[^/]+://[^/]*(facebook).*

    ^[^/]+://[^/]*(google).*

     
    Important to note Harvest finished by operator  

    Table 1. Metadata from the kb.nl harvest.

    This target has four seeds, but the harvest was cut short by a quality assurance officer. This means only a part of the selected seeds were harvested. Because kb.nl is the primary seed, the harvester started with that seed and did not come around to the other before it was manually stopped. 

    Result

    You can find the whole dataset in csv under the ‘examples’ tab. But here (table 1) is a little sneak peek. In the datasets I used you first have the ‘source’ website: the website on which the link was found. After this comes the ‘target’: the website to which the link refers. The third column is the weight: how many times the source website refers to the ‘target’ website. Lastly we have the type of URL. As mentioned above in the beginning this was only anchor or embedded. After updating the extraction script we now find the specific tag which surrounds the link. In the KB dataset it is mostly ‘a’ for anchor link. But you can also find some images and frames in there. 

    Source Target Weight Type_URL_v2
    kb.nl webggc.oclc.org 167154 a
    kb.nl collecties.kb.nl 166452 a
    kb.nl youtube.com 166288 a
    kb.nl delpher.nl 155586 a
    kb.nl webwinkel.kb.nl 152806 a

    Table 2. Example of the data found inside the kb.nl_2024_links dataset.

    It is important to note that this dataset does not reflect the kb.nl website as a whole. As mentioned above the harvest was stopped manually. But even if the harvest was completed, it only shows what we have been managed to archive with Heritrix and WCT and what we managed to extract with our WARC-link-extraction script.

    Afbeelding
    Image
    Link web with kb.nl referring to a number of other websites. Some lines are thicker because kb.nl refers to them more often
    Bijschrift

    Figure 1: Example of what you can do with the dataset. In this case it shows the websites kb.nl most often refers to. 

    Citaat

    When using this dataset we ask you to cite it as follows;

    I. Geldermans, Link analysis sample set (version 1, 28-08-2024) KB Lab, the Hague. https://lab.kb.nl/dataset/link-analysis-sample-set

    Inhoudsblokken
    Body

    The preprocessed link analysis dataset with data based on the kb.nl harvest as described in the introduction tab. 

     

    Link analysis sample dataset kb.nl