Writing Useful “Readme” Files

By Casey Burleyson

One of the simplest things you can do to start making your data FAIR is to write readme files that describe your data. I do this each time I download a new dataset from an outside source or create a unique dataset that I intend to eventually contribute to a repository. Readme files are uploaded alongside the raw datasets and are meant to summarize the key features of the dataset in a short and digestible way. While all data repositories include mechanisms to enter basic metadata about new data packages, those metadata fields are often generic and will not fully describe your unique dataset. Readmes are crucial to promoting data reuse and can save you a lot of time answering questions from potential downstream users who may access your dataset.

This post provides a list of suggested content and best practices for creating readme files. The suggestions below follow the best practices guidelines from Cornell’s research data management group: https://data.research.cornell.edu/content/readme.

Best Practices

  • Create a readme file for each unique type of data file in your dataset. For example, if you had a dataset consisting of climate forcing files and model output files you would create two readme documents – one for each unique type of data. Clearly name each readme file so that it is clear which data type it is associated with (e.g., climate_forcing_files_readme.txt).
  • Create the readme as a plain text (i.e., .txt) file so that it can be opened on any system and doesn’t require licensed software in order to be opened.
  • Create a new readme immediately after downloading new data from an outside source. This will keep the information fresh in your mind and prevent you from having to remember and track down data sources at the end of your project.

Recommended Content

  • Descriptive title of the dataset
  • Date on which the data was generated, downloaded, or committed to the repository
  • Contact information for the data creator or curator:
    • Name
    • Institution
    • Email
  • For each file type, write a short description of what the data files contain. For example, “The climate forcing files include an hourly time series of temperature and wind at a given location and year. The file name of each climate forcing file includes the location of the weather station and the year of the time series”.
  • Description of how the dataset was generated or processed, including links to the appropriate code repositories that host the processing code when possible.
  • A description of the basic data structure within each file (e.g., The columns are units of time in hours and the rows are different meteorological variables).
  • A list of all of the variables included in each data file and their associated units (e.g., Wind [m/s], Temperature [C], Relative Humidity [%], etc.).
  • References to the papers in which the dataset was created or used (e.g., This dataset is associated with Seinfeld et al. 1995 in the Journal of Practical Marine Biology).

Leave a Reply