Early 2019 former conservator Digital Born Collections Kees Teszelszsky raised the alarm: XS4ALL websites were at significant risk of disappearing from the live web. The cause? KPN announced the end of its subsidiary company XS4ALL. Multiple colleagues within the KB (myself included) rose to the challenge and started archiving these websites under supervision of a collection specialist. Fast forward to the end of 2022 and this collection was recognized as UNESCO heritage. But… Was Kees right? What happened to the XS4ALL websites on the live web? Join me as I investigate how the collection quietly started disappearing from the live web.
I started looking into the XS4ALL websites in June 2023. The reason was that a colleague pointed out that KPN announced that it would stop hosting ‘home. kpn .nl’ pages. This reminded me of the XS4ALL domain it also hosts. As I was involved in building the collection I was interested in what had happened to it on the live web, so I started investigating.
Do we still archive it?
First I checked how the collection was build and whether it was being archived by us. As of June 2023 the collection comprised 3.261 websites. Because the KB has a selective collection each website has its own database entry called a Target Record. This is why counting which ones are still active is an easy process. The collection was built during 2019/2020, with most websites archived in 2021 (graph 1). After that year the active portion of the collection started to dimmish a little. This is because of our QA process: when a website is no longer online the Target Record will be closed and the website is no longer harvested. Within 2021 and 2022 84 websites had been closed. Not that much considering the thousands we had archived. So far so good.
What is actually archived?
Next I had a look at the harvest results. Because we have a selective archive, where we archive one website at a time, I can also easily check the metadata of the harvest result. For instance I can check the amount of bytes harvested. This is one of the criteria we use for our web collections quality assurance (QA). Four years of performing QA has taught our team that when a website is smaller than 1 MB, it should be checked whether it is still online. In these cases it is likely that the website is either offline or it has been migrated to another domain. However, this parameter applies to current modern websites. XS4ALL websites are much older, mostly dating from 1993 until – 2005, and therefore smaller. Sometimes the website only consists of one or two pages! This is why for the XS4ALL websites I maintained a limit of 1 KB for QA. In this case (as experience teaches us) it is 99% likely to be offline. I plotted the results in a graph, and started getting worried…
I found the difference between 2019/2020 and 2021/2022 already pretty notable. The number of unsuccessful harvests increased significantly (in percentages). But in first 6 months of 2023 more than half of the harvested websites failed to harvest successfully. In absolute numbers this is about a third of the whole collection. I verified this on the live web and found that 95% of the websites with less than 1 KB harvested were indeed gone.
Logfiles: Http(s) response code
End 2023/early 2024 we received a research proposal to study our XS4ALL collection. To assist the researcher I returned to this project after a short break. Up until this moment I was only looking into metadata from our web archiving software the Web Curator Tool but I felt this was not enough to aid the researcher in understanding the quality of our XS4ALL collection. As I had determined: a lot of websites were still being harvested while they were no longer online. I pondered if it was possible to get a better picture of the quality of each harvest.
A colleague of mine had (on my request) previously examined a group of almost 19.000 XS4ALL homepages on the live web. From this group almost 55% returned the well-known 404 (Page not found) response matching my conclusions based on the metadata of 2023. This inspired me: what if we could get the status response from the WARC-files (where the actual archived data is stored)? This would give us a better view of the state of a website during harvesting!
Once again I turned to a teammate of mine and told him of my ideas. For practical reasons we used the logfiles (the crawl.log) of crawler Heritrix instead of the WARC files. My colleague created a list with all harvested URLs from the XS4ALL collection based on WCT data. Next he extracted the first 20 lines of the craw.log, where the response code was found, and matched it against the URL’s from the WCT data. Now we had the response code of every harvested URL for each archived version! With these results I could once again analyse the quality of the collection.
Page not found…..
So again: in 2019/2020 everything was fine. Most of the websites were successfully harvested. This is important as it was the period where most websites were harvested for the first time as we were building the collection. In 2021 serious problems began to emerge: about 10% of the URL’s returned a 403 – forbidden response, meaning the website was inaccessible. In 2022 the problems became more severe and diverse. Almost 40% of the websites were not harvested because of multiple http response states. Besides a 403 response, websites started vanishing due to a 404 (not found) or a 302 (temporary moved) response. The 302 response was an interesting one. Close reading of this group revealed that when the domain was a x.home.xs4all.nl/ URL, it usually redirected to a https://www.xs4all.nl/unknownuser/xs4all/x URL. You then got the page: ‘Oops, page not found’ 404 page from the main XS4ALL website.
By 2023, the collection's state on the live web significantly deteriorated. The websites on the live web which were unreachable in 2022 (because of the 302 and 403) now returned the general 404 – Not Found response code. Marking the end of the websites. More than half of the websites we tried to harvest had disappeared from the live web.
But…. Why?
What could be the reason for the disappearance of these website? Is KPN responsible? Since 2022 KPN has been integrating XS4ALL in its own infrastructure. Starting from 2023 XS4ALL began using KPN technology and from then on creating a new XS4ALL homepage became impossible. This aligns with KPN's cessation of hosting its homepages (http:// home. kpn .nl/ websites). Or is it because of the new competitor, “Freedom Internet”? Founded on November 11, 2019, in response to KPN's takeover of XS4ALL, customers might have moved there, ending their XS4ALL accounts.
Whatever the reason: I am very happy that Kees sounded the alarm back in 2019. Because even though many websites are now disappearing from the live web, the KB still has an extensive collection of XS4ALL websites safely stored in its archive.
Related research
When Online Content Disappears
38% of webpages that existed in 2013 are no longer accessible a decade later & Methodology
By Athena Chapekis, Samuel Bestvater, Emma Remy and Gonzalo Rivero.
https://www.pewresearch.org/data-labs/2024/05/17/when-online-content-di…
https://www.pewresearch.org/data-labs/2024/05/17/methodology-link-rot/