I recently attended a collaborative session on documentation in the research software domain. The session was part of a meetup of research software engineers in the Netherlands (NL-RSE https://nl-rse.org/) that took place online on March 13 2020. The session was facilitated by Emmy Tsang and Hakim Achterberg and had approximately 20 participants .
We discussed several topics regarding documentation. What makes documentation good or bad? What are challenges in writing good documentation? The objective was to learn from each other, and crowdsource a list of experiences and resources.
It turns out, there is a lot to be said about this topic. In this blog, I try to summarise our collaborative findings. At the end, I will provide the list of resources that were suggested, that can serve as starting points for further reading.
In the research domain, many software projects are open source. In theory, this allows for reviewing, replication and expansion of the research, which is crucial in this domain. In practice however, the software is quite often poorly documented. The software is usually tailored for the specific project or data format, and reusing the code requires a lot of perseverance, at the least.
In the domain of Open Data, a distinction is made between 'open by default' and 'open by design' (e.g. read https://data.govt.nz/blog/open-by-design/). The same distinction can hold in Open Source. Simply putting the source code online as is, with some minor in-line comments to sort of document it, is 'open by default' and often done in (smaller) (research) projects. There are many reasons why this is customary, but it relates poorly to the demand of replicability. Making source code 'open by design' requires us to think about future usage of the code: what contextual information is needed in order to replicate experiments? How can the same code be used for different datasets and in different set-ups? This means that we need to put effort not only into (re)structuring code, but moreover into documentation.
In some sense, documentation writing is much like any other writing. The basic requirements for good documentation that we agreed on during the meetup, are the same ones you learn at any (academic) writing course:
● Having a good introduction is crucial. One needs to be able to understand what the software is about and in what context it operates, to determine whether it is relevant for their use case.
● Documentation should have a clear and useful structure. The (sub)sections should be well-defined and have self-explanatory names.
Of course, documentation is also different from many other pieces of writing. For one, people typically do not read all documentation from top to bottom. They rather search for a specific piece of information, for instance on installation or certain functionality.
● Browsing the document to find information on a function or version should be easy, so a table-of-contents, index and/or search function is essential.
Another thing that distinguishes documentation from other writing, is that there is no single target audience: documentation aims to inform end users as well as developers - developers using the package as a building block, and those who maintain it or develop it further. Technical documentation is typically aimed at the latter group, but will probably not cater to the first ones. Documentation aimed at first time users should not be forgotten. It seems wise to make it really easily accessible, like CLI --help, and a README.md and/or Github Wiki.
Documentation is not one single thing, it comes in different types. The following list was put forward in the discussion (original description at https://www.divio.com/blog/documentation):
● Tutorials: aimed at completely new users in order to familiarize themselves with a package
● How to's: aimed at users that want to start implementing their own thing
● Reference: the documentation you use to find the details once you know what functions to use
● Discussion: for background information and more for power users/interested users
Another distinction that can be made is with regards to the audience:
● Beginners (or browsers): e.g. should I install this software?
Side remark: installation documentation is often recursive: calls upon installation guide of different libraries and tools, that may not be maintained (both the tools and the documentation).
● Intermediate: e.g. how do I use this function or achieve this particular task?
● Advanced: e.g. how to mount this on docker?
Because of all the different use cases, not all documentation may be put in the same place: technical documentation may be provided mostly in Python docstrings, a quick instruction offered in CLI --help, and an installation guide might be best positioned in a README.md. This does mean it is challenging to maintain all the documentation though. Of course, the size, scope and nature of the project will also determine the amount and type of documentation that is needed.
Many developers I've encountered are reluctant to write documentation. Not everyone likes to write, and it can be quite time-consuming, which not all managers acknowledge either. And of course, being good at writing software does not automatically make one a good writer too. Aiming documentation at end users especially also requires a certain 'theory of mind': what background knowledge can you assume? This can be hard to judge for a developer who's been intensively involved in a project.
This begs the question: Should writing documentation be left to developers, or is it a profession in its own right? One solution could be to have a dedicated 'Documentation engineer' or technical writer. Using such a title acknowledges the effort for both management and developers, and could urge hesitant developers to at least cooperate.
An interesting approach is also the creation of automatic tools (beyond docstrings) that help build and maintain documentation. Check out the resources listed at the end of this blog. A question that arose was whether continuous integration could be applied to documentation. Also, it was suggested that tests can be an interesting starting point for technical documentation.
One may want to user test or 'friction log' (see e.g. https://lkloh.github.io/2018/07/12/friction-logs.html) the interaction with the documentation to find out what works and what doesn't. It might even be possible to automatically assess what parts of documentation users struggle with, although this is probably feasible only for bigger projects with many users.
Another approach that was mentioned was literate programming. Documentation writing while writing code is recommended, but involves lots of 'switching tasks', which can be challenging. The idea of Literate programming as a way to write your code and your documentation at the same time. See https://en.wikipedia.org/wiki/Literate_programming and an implementation at https://entangled.github.io/entangled/.
You've probably encountered devices for which the manufacturer provided a single manual that applied to e.g. ten slightly different types of dishwashers, where you had to remember the serial number in order to find the relevant pages. Luckily, the digital world (specifically, the world wide web) allows dynamic ways to cater just the information that one needs. User documentation could become more interactive, containing different levels of detail that can be explored at will, starting from a specific user question.
This does pose a burden on those who create documentation though: for many research software packages, this may not be a feasible direction, as long as standard tools are lacking.
Sometimes you find video demos of software as well. This may serve as an extra source, but cannot replace written documentation to allow for searching.
Should documentation be citable, separate from the software package? It was proposed that documentation might have its own DOI. There are pros and cons when it comes to publishing documentation separately from source code (e.g. as a software paper in JOSS), related to work on software vs work on docs and referencing software in general. It probably shouldn't be necessary to have separate docs just for referencing purposes.
Sara Veldhoen, Research Software Engineer. I hold a BSc degree in Artificial Intelligence from Utrecht University (UU) and an MSc degree in the same field from the University of Amsterdam (UvA),
● https://www.divio.com/blog/documentation/ on functions of documentation
● http://readthedocs.io for automating building, versioning, and hosting of your docs
● https://github.com/milooy/art-of-readme on the art of README
● https://www.writethedocs.org/ is a community of people who care about documentation
● https://keepachangelog.com/en/1.0.0/ on changelogs: curated, chronologically ordered list of notable changes for each version of a project
● https://devrel.net/developer-experience/an-introduction-to-friction-logging on friction logging: what makes tools hard to use
● https://mkdocs.org a generator for html documentation from markdown
● https://www.sphinx-doc.org/en/master/ generate documentation in a range of formats from reStructuredText (e.g. docstrings in Python)
● https://entangled.github.io/entangled/ as an example of literate programming (Haskell-based)