Historical growth of the KB web archive

    Introduction
    Body

    Welcome to the dataset page of the KB web archive. The dataset page and its analyses will be updated every summer. 

    Very short introduction about the KB web archive & web archives in the Netherlands

    The KB started selecting a small collection of websites in 2006, archiving (or in web archive terminology: harvesting) these websites early 2007. The web archive represents a ‘reasoned selection’ of the Dutch domain: Every particular website has been selected manually, focussing on the Dutch language, culture and history. The collection includes websites containing innovative content representing current trends on the Dutch domain. The KB actively searches for popular websites or websites that are currently relevant for the Dutch society. 

    The KB is not the only heritage institution in the Netherlands to archive selections of the Dutch web. According to our information, web archiving started in The Netherlands by the Dutch Documentation Centre of Political Parties (DNPP) in the year 2001. This organisation harvests websites of Dutch political parties, politicians and political movements. There are also some specialized archives that run web archiving projects, like the Groningen Archives (collects websites about the region Groningen) and the Netherlands Institute for Sound and Vision (collects a selection of websites of the Dutch broadcast organisations). 

    For more information about the different web archives in the Netherlands please visit the Register Webarchieven website [Dutch]. 

    Web archiving process

    KB primarily uses (and co-develops) the Web Curator Tool in combination with webcrawler Heritrix developed by the Internet Archive. The Web Curator Tool (WCT) is a tool for managing selective web harvesting for non-technical users. The underlying database is a Postgresql database.

    In the WCT the curator creates a Target Record for each website selected for harvesting. A Target Record may consist of one or multiple domains (seeds). An example is the KB Target Record which contains the seeds:

    The KB Lab website has its own Target Record. In most cases a Target Record contains one seed (as shown in the pie chart ‘Seeds per Target Record’).

    Afbeelding
    Image
    Seeds per Target Record 2021-08
    Body

    In the Target Record (TR) specific settings will be determined such as schedule frequency (when and how often a website is harvested); the Target Record may be enriched with basic metadata like a subjectcode or a collection label. It is with the help of these data that I will try to give a global overview of how the KB web archive has grown over time. All analyses will currently be based on data as from 2009 until 2020. The only exception is the analysis of the schedule frequencies because that data is not available for 2020. That is why the analysis ends in 2021. 

    Afbeelding
    Image
    Workflow
    Body

    Based on Target Record parameters the Web Curator Tool communicates with the webcrawler Heritrix. WCT supplies the relevant information to the crawler: When to harvest the website and how. In case of the KB website it will (among many other things) tell Heritrix to harvest two seeds:

    According to Target Record parameters WCT will also tell Heritrix to set the data limit at 10 Gigabyte and ensure the website will be harvested once every six months. Heritrix will then harvest the website on the set date(s) and time, putting every resource it finds within scope in a container-file, which is called a WARC file. This WARC file is stored in the archive; it is used to reconstruct the archived website in a Wayback Machine so that users can view it. The WARC file also contains a lot of metadata about when each resource (webpage, stylesheet, PDF, etc.) was harvested and its volume. In the Web Curator Tool a Target Instance is created each time a Target Record instructs Heritrix to harvest a website. In the Target Instance record you can also find information about the result of the harvest and the log files. 

    However, as mentioned earlier, for the moment I am mostly interested in the Target Record information that tells Heritrix what to do.

    Source

    The Web Curator Tool, primarily a workflow management tool, does not retain old versions of the Target Record information. So when a curator changes the schedule frequency (for example), the information of the former schedule frequency can no longer be found in that specific data field. To map the history of the web archive I therefore had to look for other sources that contained information about old Target Record information. The needed information was found inside the selection-overviews. These overviews are made every couple of months. The most recent one is available on the website of the KB. Here users can find an overview of the current web archive collection

    Although not all selection-overviews survived over time, I found enough original lists (print-outs from the WCT database) to continue this project. However, the available information required some pre-processing necessary prior to analysis. For example, Target Records appeared twice or thrice in the original overviews, so redundant records needed to be removed while keeping all relevant information. If a Target Record had two schedules it would appear twice in the original list. I made a second column (named schedule 2) and transferred the information to one of the records and removed the other. This way I made sure that no information was lost. This process of deduping and cleaning (removing double or triple Target Records whilst securing all relevant information) was done for nearly all original lists found. 

     

    Selection overview - data available

    X              data is available

    V              data is not there, but has a valid reason

    Creation date selection overview: year-month State (TR) Schedule (TR)** Collection (TR)*** Subject (TR) Date selected (TR)
    2009-10 X X V X X
    2010-10 X X V X X
    2011-05 * X X V X X
    2012-10 X X V X X
    2013-10 X X V Started X X
    2014-12 X Not available X X X
    2015-03 * X X X X X
    2016-09 X X X X X
    2017-10 X X X X X
    2018-10 X X X X X
    2019-04 * X X X X X
    2020-10 X Only 1 schedule available for each TR X X X
    2021-04 * X X X X X

     

    * Because not all original lists survived the months that were analysed differ per year. For the sake of consistency I tried to find original lists generated in or close to the month of October. But as you can see in the table, this was not always possible. Please keep this in mind when you view the diagrams below.

    ** Another notable information gap was with the schedule information that is missing (2014) or incomplete (2020). Consequently both 2014 and 2020 are missing in the analysis of schedule frequency. To compensate for the missing data I used data from April 2021.  

    *** In the case of the collection information, the KB started to create special web collections in 2013. So it makes sense that the first time this information appears is in 2014.

     

    I will explain each type of data in detail in their corresponding analysis, but here is an example of the Target Record data found in the original lists (after pre-processing). 

    Subject 01
    Name Koninklijke Bibliotheek
    Primary seed https://www.kb.nl/ 
    Date selected 2007-10-31 11:24
    State Approved
    Scheme 57 15 6 1/6 ? *
    Scheme 2 * [ empty ]
    Scheme 3 * [ empty ]
    Subject NL-blogosfeer

    * Column added during pre-processing.

     

    Analyses

    Before discussing the analyses, it is important to stress that I have chosen for (mostly) bar charts. This is because of the irregularities in month and because schedule data is missing. I discovered that bar charts visualised these deficiencies best.

    Total number of Target Records

    Afbeelding
    Image
    Total Target Records
    Body

    Goal: To give a picture of the growth of the number of Target Records. As of 2020 we have a little over 20.000 Target Records in the web archive that are being harvested or have been harvested. 

    Data used: I have used the column ‘date selected’ from an original list from 2021 (April). This column contains the date and time a Target Record was made. From this information I selected the year and counted how many Target Records were created in a particular year. So this graph shows for each year all the Target Records that are there up to (all of) 2020. Below you can see the number of new Target Record created each year.

    Afbeelding
    Image
    Target Record created
    Body

    State

    Goal: Now we know how many Target Records in total there are, it is interesting to know which ones were being actively harvested. Not all Target Records are active. Some have ended. 

    Data used: For this I used all original lists mentioned above. The column used, was ‘State’. 

    Afbeelding
    Image
    State
    Body

    In the bar chart ‘State’ you see three types of Target Records:

    1. Active
    2. Ended
    3. Completed

    With active Target Records, for the most part, Heritrix still attempts to harvest the seeds. For more information about how ‘active’ an active Target Record is, see the analysis of the schedule frequency below. 

    Completed Target Records have finished their schedules. No new harvest has been scheduled (yet). A Target Record can be completed in one year and be active in the next if it is given a new schedule by a curator.

    Ended Target Records have stopped and are no longer harvested (even if they have a schedule). A curator has to end a Target Record manually and it can be finished for several reasons. The most common reason is that the website is no longer online. A website can be completely gone but it is also possible that it has migrated to a new domain. If the latter is the case and if the domain has significantly changed, a new Target Record is made. Otherwise the hyperlink is just updated within the existing Target Record (and it remains active). A new Target Record is also created if several websites merge into one new website. This sometimes happens with project websites. The project first has its own website, but when completed it becomes a part of an existing website of the organisation behind it. It also happens when different businesses merge. One or multiple Target Records are then stopped and continued as a new Target Record. A final reason for a Target Record to stop is because a website (or part of a website) was archived for a special collection. Normally the KB archives whole websites, but for special collections (the Coronavirus COVID-19 collection for example) we also archive specific parts of websites for a shorter amount of time. These seeds get their own Target Records and can be stopped if the special collection or event has ended. 

    Schedule frequency

    Goal: Let’s take a closer look at the Target Records which are still active. Within this group I wanted to give an overview of how often they are harvested and if this has changed over time. A Target Record can have multiple schedules.

    Data used: I used the original lists from 2009 until 2013, 2015 until 2019 and 2021. As mentioned above, in 2014 this data was missing and in 2020 it was incomplete. For 2021 I am again using the list from April because 2020 was incomplete. I only used schedules from Target Records that had an active state. 

    Pre-processing: At the start of this page there was an example of a schedule ‘57 15 6 1/6 ? *’. Schedules in Heritrix are written as a cron. Because type information was missing I manually gave each schedule a type by looking for strings that indicated their frequency (such as ‘/6’ for twice a year and ‘/3’ for quarterly). 

    Analyses

    1. Combination of schedules a Target Record has (one, two or three schedules).
    2. Overview of the amount of Target Records with multiple schedules (two or three).
    3. Schedule frequency of all schedules total. Analysis of all available schedules. Keep in mind that a Target Record may have multiple schedules as shown in analyses one and two.
    4. Custom schedules. Analysis of all types of custom schedules from April 2021.
    5. ‘To Date’. Analysis of schedules with an end date. To what collection do they belong and have the Target Records they belong to really ended?
    6. Schedule frequency of all schedules without ‘Yearly’. Analysis of all available schedules except the group ‘Yearly’. Keep in mind that a Target Record can have multiple schedules as shown in analyses one and two. 
    Afbeelding
    Image
    Number of schedules for each Target Record
    Afbeelding
    Image
    Target Records with multiple schedules
    Afbeelding
    Image
    Schedule frequency total
    Body

    In the bar chart ‘Schedule frequency total’ you’ll see several types of schedules:

    • Yearly
    • Every six months
    • Quarterly
    • Bi-Monthly (every other month)
    • None
    • Custom

    Custom schedules is a heterogeneous group of schedules. Within this group there are daily, weekly and monthly schedules. There are also schedules that run for only a particular year. This means that after this year, though the state of a Target Record is still active, the schedule is inactive and the Target is no longer harvested. 

    Another group that has the state ‘active’, but is not being harvested, is the group ‘None’ with no schedule at all.

    Afbeelding
    Image
    Custom schedules 2021-04
    Body

    The last challenge is that some schedules have a ‘To Date’, an end date that can be manually specified. Because this information was not in the original lists I could only analyse how many there were in 2021. There are little over 300 schedules with a ‘To Date’ in an active Target Record in July 2021 (according to database analysis). Most of these are ‘Quarterly’ schedules from the Society critical websites collection. Target Records belonging to this collection received two schedules: A quarterly schedule with a ‘To Date’ (ending in Q4 2020) and a yearly schedule with a ‘Start date’ (starting in 2021) but without ‘To Date’. These Target Records are therefore still active. 

    It is important to note that some schedules have a ‘To Date’ because these schedules appear active in this analysis, but actually aren’t because their schedule has ended. 

    Afbeelding
    Image
    To Date analysis collection 2021-07
    Body

    As you can see, harvest frequency has changed significantly over time. At the start of the web archive Target Records were harvested mostly ‘Quarterly’ or ‘Every six months’. This changed in 2015-2016 to ‘Yearly’ as most used schedule frequency. You can see the decline in ‘Quarterly’ and ‘Every six months’ schedules more clearly if you leave the yearly schedules out of the analysis.

    Afbeelding
    Image
    Schedule frequency without yearly
    Body

    Special Collections

    The KB started assembling special collections for the web archive in 2013 in similar fashion as the UK Web Archive. A Target Record may belong to multiple special collections. It can be given a collection label when it is created, but also receive a collection label retrospectively. The Wadden collection, for example, started in 2017. The curator searched for already existing Target Records that belonged to the collection and added the collection tag. This does not happens for all collections though. 

     

    The special collections are (with English translation and explanation if necessary):

    Dutch title English    translation Closed * Collection Explanation Further reading

    100e Vierdaagse

     

    The Four Days Marches, 100 year anniversary Y

    In 2016 it was the 100 year anniversary of the yearly walking event at Nijmegen.

     

     

    200 jaar Koninkrijk

     

    Bicentennial of the Kingdom of the Netherlands Y

    From 2013 until 2015 the Netherlands celebrated their independence and founding of the constitution in 1815. 

     

     

    Ambassades en consulaten

     

    Embassies and consulates  

    Web collection of Dutch embassies and consulates in other countries. Cause for this collection was the possibility that some embassies (and their websites) would disappear (2014). This didn’t happen. What did happen in 2017, was that all websites merged into one big website. 

     

     

    Bedrijf- en productschappen

     

    No English equivalent ** Y

    1 January 2015, all ‘bedrijfs- en productschappen’ were dissolved. 

     

     

    Caribisch Nederland

     

    Caribbean Netherlands  

    Websites related to or from the Caribbean Netherlands.

     

     

    Carnaval

     

    Carnival  

    Websites about Dutch Carnival celebrations.

     

     

    Chinees Nederland

     

    Chinese Netherlands  

    Web collection about the Chinese immigrant community in the Netherlands.

     

    KB Lab page - web collection Chinese Netherlands

    Coronavirus COVID-19

     

       

    Web collection about the (ongoing) corona pandemic. It is also a Dutch contribution of Dutch websites to the international IIPC collection. ***

     

    IIPC collection – Novel Coronavirus (COVID-19) – Dutch contribution

    Dienst Landelijk Gebied

     

    Service Rural Area Y

    Was a part of the ministry of economic affairs. Its purpose was to develop rural areas. Organisations and websites were dissolved 1 March 2015.

     

     

    Internetarcheologie

     

    Internet archaeology  

    Collection of early internet websites (older than the year 2001).

     

     

    Kloosters

     

    Monasteries  

    Also websites from orders, congregations and abbeys in the Netherlands. 

     

     

    LGBT

     

       

    Collection of websites about and by the LGBT community in the Netherlands. Made in collaboration with IHLIA, a heritage organisation  involved in collecting information about the LGBTI community and making it accessible.

     

    IHLIA about collaboration between IHLIA and the KB [Dutch]

    Maatschappijkritische websites

     

    Society critical websites  

    Socially critical websites from or about the Netherlands in the period 1993 – 2017.

     

    Collection description Society critical websites on Zenodo [Dutch]

    Nederland in de Eerste Wereldoorlog

     

    The Netherlands during World War One Y

    Dutch contribution of Dutch websites as a member library of the International Internet Preservation Consortium (IIPC), to provide a transnational perspective on the war's centennial commemoration. ***

     

    IIPC collection – World War I Commemoration – Dutch contribution

    NL-blogosfeer

     

    Dutch blogosphere  

    Collection of Dutch weblogs.

     

    KB Lab page – Web collection NL-blogosfeer

    Profvoetbal 2015/16

     

    Professional soccer Y

    Collected in 2015-2016.

     

     
    Refo500   Y

    Commemorating 500 years (Protestant) Reformation in 2017.

     

     
    Sinterklaas Sinterklaas, Saint Nicholas  

    Local feast that takes place each year at end of November until 5 December.

     

     
    Tocht naar Chatham (1667)  Raid on the Medway Y

    Attack by the Dutch Navy on English warships in 1667.

     

     
    Wadden Wadden Sea  

    Websites concerning the Wadden; an island and coast area in the north of the Netherlands. Made in collaboration with Tresoar, the repository of the history of Fryslân (a Dutch province). 

     

     
    XS4ALL Dutch website provider named XS4ALL  

    Collection of websites          from the provider XS4ALL.

     

    Collection description XS4ALL-homepages on Zenodo [Dutch]

     

    * Collections that have ended (2021). No new Target Records will be added. 

    ** A ‘bedrijfschap’ was a sectoral organisation under public law in the Netherlands of companies that worked in the same industry. A ‘Productschap’ was a sectoral organisation under public law in the Netherlands of companies that worked with (or processed) the same raw material in successive stages. Both mostly occurred in the agriculture sector. 

    *** All collections mentioned here are collections in the KB archive, but the KB also contributes (in selecting websites) to transnational web collections such as the Olympic Summer 2012. These collections are defined and archived by the IIPC. The Netherlands during World War One and the COVID-19 collections are exceptions as they also became a collection in the KB repository. For more information about the IIPC collections the KB contributes to, visit the web archive selection page at the KB website

     

    Data used: I used all collection tags available and simply counted them. All original lists were used because the data was complete in all lists. All states were used as well. Please keep in mind that there is some inconsistency in month as explained at the start of this article. 

    For analysis I have divided the collection into three groups. Collections that have less than one hundred Target Records, collections that have between one hundred and a thousand Target Records and collections that have more than one thousand Target Records (as of 2020). 

    The group larger than one thousand Target Records only contains the XS4ALL and Internetarcheologie collections. This is partly because lots of XS4ALL websites were older than the year 2001 and therefore also received the Internetarcheologie tag. This also explains the growth in the Internetarcheologie collection. 

    Afbeelding
    Image
    Collections larger than 1000
    Afbeelding
    Image
    Collections 100 - 1000
    Afbeelding
    Image
    Collection smaller than 100
    Body

    Subject

    When creating a Target Record, the curator also adds a subject code. A Target Record can have a maximum of 2 subject codes. 

    Data used: I used all subject codes available and simply counted them. All original lists were used because the data was complete in all lists. All states were used as well. 

    Afbeelding
    Image
    Number of subject codes
    Body

    Top 10 most used subject codes (2020)

    Keep in mind that a Target Record may have 1 or 2 subject codes as shown in the analysis here above. There are 33 subject codes in total. This is the top 10 most used subject codes in October 2020.

    Afbeelding
    Image
    Line-graph showing the top 10 most used subjectcodes
    Body
    Top 10 Subject code Subject text
    1 23 Urban planning, architecture, art, photography, film, radio, television
    2 31 History, biography
    3 24 Spare time, sports and games
    4 1 General
    5 19 Technology, industry * , crafts
    6 6 Law, goverment administration
    7 4 Sociology, statistics
    8 9 Commerce
    9 3 Religion, theology
    10 19 Medicine 

     

    * The Dutch term used for industry is ‘nijverheid’. It indicates small scale workshops and industry at home as well as large industry. 

     

    Target Instances

    At the start of this article I mentioned that I was mostly interested in Target Records. However I have also taken a crack at counting the number of harvests there are for each year. For this the Target Instances that were selected are those that harvested more than one page, excluding the Target Records that are crawled daily. This resulted in a nice first impression, but it is not yet as clean as I’d like. It still contains some failed harvests. 

    Goal: Give a rough impression how many succesful harvests there have been for each year. If you combine them you have the total amount of harvests available for research. 

    Data used: In this case I made my own query on the Web Curator Tool Postgresql database and received the information directly from there. 

    Afbeelding
    Image
    Total number of Target Instances per year
    Body

    Technical Description

    This is a short overview of the techniques the KB web archive used to archive the websites.

     

    Date Web Curator Tool Heritrix
    September 2007 - August 2019 WCT 1.14.4 Heritrix 1
    August 2019 - August 2020 WCT 2.0.2 Heritrix 3.3.0-SNAPSHOT
    August 2020 - May 2021 WCT 3.0.0-SNAPSHOT Heritrix 3.3.0-LBS2016-02
    May 2021 – present day WCT 3.0.0 Heritrix 3.3.0-LBS-2016-02

     

    Researching the web archive

    As there is no legal deposit in the Netherlands and due to copyright reasons, the collection can only be studied in the reading rooms of the library. Visit the KB website for more information about visiting the web archive

     

    Further reading

    For more information about which websites are in the current collection visit web archive selection page on the KB website.

    For more general information about the KB web archive and the web archiving team visit the web archive page on the KB website.

     

    Citaat

    When using this dataset we ask you to cite it as follows;

    I. Geldermans, Historical growth of the KB web archive (version 1, 01-10-2021) KB Lab, the Hague. https://lab.kb.nl/dataset/historical-growth-kb-web-archive.

    This article is governed by the Creative Commons Attribution 4.0 International Licence. Please use this attribution: ©2021 National Library of the Netherlands/Iris Geldermans, CC BY 4.0.