ALA’s new book – Cleaning Biodiversity Data in R

Why did we publish this book?

Tidying messy data can be difficult, but it is a task that anybody who works with data must face if they want to use data to tackle scientific questions!

The Atlas of Living Australia’s (ALA) Science & Decision Team recently authored a book titled Cleaning Biodiversity Data in R. This free resource provides up-to-date information on data cleaning processes for anybody who handles their data using code in R. As Australia’s national biodiversity database, the ALA is frequently asked questions about the best way to clean data using code.  To provide advice on this topic, our first step was to investigate what the currently accepted “best” way might be. We deep-dived into multiple recent peer-reviewed research to see whether experts cleaned their biodiversity data in the same way.

cover of the data cleaning book, with dark blue background and an orange crustacean

Through this process we discovered that the steps to clean ecological data can be really messy! Even among experts, data cleaning can look completely different depending on the type of investigation and study species. No wonder people ask how to do it!

This book is our response to questions about how to clean biodiversity data in R. The book provides guidance on many of the processes for cleaning biodiversity data.

flow diagram with multiple grey lines intersecting, to illustrate the high number of possible workflows
A flow diagram of possible ecological data cleaning workflows (made using data from ALA’s review). Credit: Martin Westgate

What’s covered in the book and who is it for?

In Cleaning Biodiversity Data in R, we provide an overview of a typical biodiversity data cleaning workflow for open-access geo-referenced biodiversity data—from acquisition, to error identification, to correction. These processes are broken down into sections on exploring data, general data cleaning, and data cleaning processes that require expertise in your study species. The book has information suited to those just starting out working with biodiversity using R, to those who are looking for more advanced techniques.

green and yellow dots on a map of NSW showing kingfisher records, most are focussed near the Sydney region
An informative, artistic representation of Kingfisher records near Sydney, New South Wales, Australia. Credit: Dax Kellie

We hope that this resource can help support your data cleaning tasks in the future, and act as a useful learning or reference tool for teaching and sharing knowledge around data cleaning.

For more information on working with data in R and Python, check out the galah package and ALA Labs.