<- function(csvName, migCol){
prepData <- read.csv(paste0(csvName,".csv")) |> janitor::clean_names()
csv <- csv |> mutate(NewMig = case_when(
csv == 1 ~ "Yes",
migCol == 0 ~ "NO",
migCol TRUE ~ NA
))return(csv)
}
Readable, Reliable, Reusable: A Guide to Clean R Code
Intro
Clean code is code that can be easily understood—by everyone on the team, and by future you. It’s readable, maintainable, easier to debug, and scalable. But writing clean code isn’t automatic—it’s a skill that improves with practice, reflection, and shared standards.
I’ll share practical tips and examples for writing clean R code. Many of these ideas are inspired by Robert C. Martin—aka “Uncle Bob” in his book Clean Code and adapted to fit the patterns and challenges of R programming. I’ll focus on three key areas: naming practices, function practices, and commenting practices. These aren’t just stylistic preferences—they’re tools for building code that lasts.
Naming Practices
Use Meaningful Names
Names should reveal the intent of the object. It should tell you why it exists, what it does, and how it is used. If a name requires a comment as to what the code does, then the name does not reveal its intent. Let’s say we use date
- but what date? It could be today’s date or date of someone’s birthday? Be specific. In this example, projectStartDate
is clearer than just date
.
date <- "2024-01-01" # date project started
projectStartDate <- "2024-01-01"
Naming Conventions
Choose a naming convention for objects and functions. This can be camelCase or snake_case, or something else. The key is to be consistent. {tidyverse}
mainly uses snake_case while {shiny}
uses camelCase.
Choose Descriptive and Unambiguous Names
Be sure names actually represent what the object does. patientList
should really be a list, because list means something specific to programmers. Data frames and values should have a noun or noun phrase like totalOhioPopulation
or currentShiftList
. Functions should have a verb or a verb phrase and be descriptive. Avoid general verbs. Here’s suggestions of alternative verbs to use:
Word | Alternatives |
---|---|
send | deliver, dispatch, announce, distribute, route |
find | search, extract, locate, recover |
start | launch, create, begin, open |
make | create, set up, build, generate, compose, add new |
Make Meaningful Distinction
Names should be used appropriately and consistently. For example, using patients
and person
interchangeably is inconsistent and confusing. Also be as descriptive as possible. Don’t use patientsA
patientsB
rather describe the differences between the two such as patientsWithDiabetes
and patientsWithHypertension
. But don’t have A1CPatients
since it follows a different naming standard as it doesn’t start with patients.
Variables should communicate what they represent and how they relate to the broader logic. Let’s look at variables firstName
, lastName
, street
, houseNumber
, city
, state
, and zipcode.
Taken together it’s pretty clear that they form an address, but seeing state
alone would leave the reader guessing. Pefixing with addr, like addrFirstName
, addrLastName
, addrState
, and so on, the reader will understand that these variables are part of a larger structure. In added benefit - in RStudio - these will be listed together under the Environment pane.
Use Pronounceable Names
Use words, like actual words. If I see flwRate
I’ll say flow rate in my head, but this really means follow-up rate. This hinders flow of reading the code as I’m thinking about translating this word rather than skimming my code. A better name would be followup_rate.
Now that we have code auto completion and aren’t limited by character lengths (i.e. 8 characters), it’s now easier to be more descriptive. Clarity is more important than brevity!
Use Searchable Names
Clean code favors clarity over brevity. Single-letter variables and hard-coded numeric constants may save keystrokes, but they cost time when debugging or refactoring. If a variable like s
or d
appears across multiple lines, it becomes difficult to trace its meaning or locate it reliably in a large codebase.
This code works, but it’s opaque. What does s
represent? What does 5
mean? Without context, it’s hard to tell.
s = 0
for (i in 1:5) {
d = i * 4
e = d / 5
s = s + e
}
Here, every variable name communicates intent. Even though sum
isn’t perfect, it’s at least searchable. More importantly, workDaysPerWeek
is far easier to locate and understand than the raw constant 5
.
realDaysPerIdealDay = 4
workDaysPerWeek = 5
sum = 0
for (i in 1:workDaysPerWeek) {
realTaskDays = i * realDaysPerIdealDay
realTaskWeeks = realTaskDays / workDaysPerWeek
sum = realTaskWeeks + sum
}
Even better, instead of generic counters like i
, use descriptive names that reflect what the loop is iterating over. In the example, workDay
makes the loop’s purpose immediately clear.
for (workDay in 1:workDaysPerWeek) {
realTaskDays = workDay * realDaysPerIdealDay
realTaskWeeks = realTaskDays / workDaysPerWeek
sum = realTaskWeeks + sum
Replace Magic Numbers With Named Constants
Avoid hard-coded numbers that lack context. These so-called magic numbers make equations harder to interpret and maintain. Instead, use named constants that describe what the number represents.
In the example below, AADT
(Annual Average Daily Traffic) is a well-known acronym in transportation analytics. While abbreviations are generally discouraged, domain-specific terms like AADT
are acceptable when they’re widely recognized by your audience.
This example hides meaning behind raw numbers. What is 12081500
? Why divide by 365
?
MainStAADT = 12081500 / 365
Now the equation reads like a sentence. Each variable name adds context, and daysPerYear
is searchable—far easier to locate than every instance of 365
.
daysPerYear = 365
MainStAnnualCount = 12081500
MainStAADT = MainStAnnualCount / daysPerYear
Function Practices
Do One Thing
Functions should do one thing, and do it well. When a function tries to handle multiple tasks—like reading a file, cleaning data, and recoding values—it becomes harder to understand, test, and reuse. Even if each step is short, in real-world applications, these tasks often expand into multiple lines of logic. That’s why it’s important to break them down into smaller, focused functions.
Take the example below. The original prepData()
function reads a CSV, cleans column names, and recodes a migration column—all in one place. While it works, it mixes three distinct responsibilities, making the function harder to maintain, test, and harder to read.
A cleaner approach is to separate each task into its own function. This way, the top-level function reads like a summary of the process, and each child function handles a single responsibility.
<- function(csvName, migCol){
prepData <- readCSV(csvName) |>
csv cleanDataCols() |>
recodeData(migCol)
}
<- function(csvName){
readCSV read.csv(paste0(csvName,".csv"))
}
<- function(df){
cleanDataCols |> janitor::clean_names()
df
}
<- function(df){
recodeData |> mutate(
df NewMig = case_when(
== 1 ~ "Yes",
migCol == 0 ~ "NO",
migCol TRUE ~ NA
) ) }
This structure makes the code easier to follow from top to bottom. The parent function operates at a higher level of abstraction, while the child functions handle the details. It also makes each piece easier to test and reuse in other contexts. Clean code isn’t just about making things shorter—it’s about making your intent unmistakably clear.
Single Level of Abstraction
Single Level of Abstraction (SLA) is a principle that helps keep functions readable and focused. Within any given function, all lines of code should operate at the same conceptual level. When high-level logic—such as business rules—is mixed with low-level implementation details like iteration or data parsing, it creates cognitive friction. This makes the function harder to read, harder to test, and more difficult to maintain over time.
In the previous example, the prepData()
function is composed of three distinct steps: reading a CSV file, cleaning column names, and recoding a migration column. Each of these steps is handled by its own helper function—readCSV()
, cleanDataCols()
, and recodeData()
—which all operate at the same level of abstraction. Together, they form a clear and cohesive sequence that supports the higher-level goal of preparing the data. This structure allows the parent function to read like a summary, while the child functions handle the details, keeping each layer of logic clean and consistent.
Nested loops and nested if-else statements often violate SLA because they introduce multiple layers of logic within a single function. This layering makes it difficult to follow the function’s intent, especially when conditions are deeply nested or intertwined. In practice, this can obscure the purpose of the code and increase the cognitive load for anyone trying to read or maintain it.
Consider the following example. The first version uses nested if-else blocks to determine whether a user should be granted access. While technically correct, the logic is buried under multiple layers of conditions, making it harder to trace the decision path.
if (user$isActive) {
if (user$role == "admin") {
if (!is.null(user$permissions)) {
grantAccess(user)
} else {
denyAccess("Missing permissions")
}
} else {
denyAccess("Not an admin")
}
} else {
denyAccess("Inactive user")
}
This version reads like a checklist, clearly stating the conditions under which access should be denied, and allowing the main action—grantAccess(user)—to stand out.
if (!user$isActive) return denyAccess("Inactive user")
if (user$role != "admin") return denyAccess("Not an admin")
if (is.null(user$permissions)) return denyAccess("Missing permissions")
grantAccess(user)
Sometimes it’s easier to write clean logic by thinking in reverse. Instead of asking when should we grant access? ask when should we deny it? This shift in perspective often leads to simpler, flatter code that respects abstraction levels and improves readability.
Use Descriptive Names
Use descriptive names that clearly communicate what a function or variable is meant to do. Ambiguous or overly terse names may save a few keystrokes, but they cost clarity. A long, descriptive name is almost always better than a short, enigmatic one—especially when revisiting code months later or sharing it with teammates.
If a function calculates the average number of tasks completed per week, naming it calcAvgTasksPerWeek()
is far more helpful than simply calling it avg()
or doCalc()
. Descriptive names reduce the need for comments, making your logic flow more easily. Don’t be afraid of length if it adds meaning—clarity is always worth the extra characters.
Have No Side Effects
A clean function should do exactly what it claims to do. When a function has side effects, such as modifying global variables, altering external states, or changing data outside its scope, it becomes unpredictable and harder to test. If the function’s intent is clear and it uses only its inputs to produce its output, then it has no side effects. This makes the function safer to reuse and easier to reason about.
Functions should never rely on or alter global state. Instead, they should accept all necessary inputs explicitly and return outputs without leaving behind hidden changes. This discipline ensures that the function behaves consistently, regardless of where or how it’s called.
There are exceptions where the global state is changed. Writing files (i.e. write.csv()
) and showing warning messages. These are intentional, transparent and explicit.
Write Vectorized Code
Vectorized code operates on entire vectors, matrices, or arrays in a single operation, rather than processing elements one by one through explicit loops. This approach is significantly faster and more efficient, especially when working with large data sets.
Built-in functions like sum()
, mean()
, and operators such as *
are inherently vectorized in R. They reduce code complexity, improve readability, and align with R’s design philosophy of concise, expressive syntax. Vectorized code is also easier to maintain and less prone to errors, since it avoids manual iteration and conditional logic scattered across multiple lines.
Beyond performance, vectorization supports the creation of pure functions—functions that rely only on their inputs and produce predictable outputs. This makes code more reusable and testable, and integrates well with R’s functional programming tools and packages like {dplyr}
, {purrr}
and {data.table}
.
Here’s a simple example that compares a non-vectorized loop to a vectorized alternative:
countOfCups=c(1:5)
# Non-vectorized
cupsOverThree <- numeric(0)
for (i in 1:length(countOfCups)) {
if (countOfCups[i] > 3) {
cupsOverThree <- c(cupsOverThree, countOfCups[i])
}
}
# Vectorized
cupsOverThree <- countOfCups[countOfCups > 3]
Output is Consistent
A clean function should always return an output in a consistent format, regardless of the inputs. Whether it’s a specific class, structure, or set of column names in a dataframe, consistency ensures that downstream code can rely on predictable behavior. If a function sometimes returns an integer and other times a string, it forces the developer to write additional logic just to handle the variability. This adds unnecessary complexity and increases the risk of bugs.
By keeping outputs consistent, we make our functions easier to test, easier to integrate, and easier to reuse. It also allows other developers—or future you—to use the function without needing to inspect its internals or guess what kind of result it might produce. In data workflows, this is especially important when working with pipelines or chaining operations, where each step depends on the structure of the previous one.
Flexible Functions
Function design should support reuse across different contexts. Consider the two examples below. Is one better than the other?
<-
multiplyTwoValues function(df, newColName, one_value, second_value){
|>
df ::mutate({{newColName}} := {{one_value}} * {{second_value}})
dplyr
}
<-
multiplyTwoValues function(one_value, second_value){
* second_value
one_value }
The first version assumes the input will always be a dataframe and relies on {dplyr}
for its operation. While this may be appropriate in a pipeline-heavy workflow, it limits the function’s flexibility. It can’t be easily reused in loops, {purrr}
mappings, or inside {ggplot2}
calculations without additional scaffolding.
The second version is simpler and more versatile. It works with any vectorized inputs and doesn’t depend on a specific data structure or package. This makes it easier to reuse across different parts of a project, regardless of whether you’re working inside a tidyverse pipeline or writing base R logic.
This isn’t to say one approach is always better than the other. Instead, it’s a reminder to consider the trade-offs. A flexible function can reduce coupling, improve testability, and support a wider range of use cases. When designing functions, think about how and where they’ll be used—and whether their structure invites reuse or restricts it.
Commenting Practices
I’ll just go through these practices briefly.
Focus on the intent of the comment rather than describing the mechanic on what the code does. Like why did you decide to use this function over another? What was your thinking process when writing the steps?
Avoid redundant comments that restate what the code already makes clear.
Don’t leave unused code hanging around. We use git as a reason if one ever needs old code. Remove this clutter.
For functions, use comments or
{roxygen2}
to describe parameters, return values, and purpose. If you decide to turn your functions into a package, you’re already ahead of the game.Use
# TODO
comments when works still needs to be done, but only use them sparingly.Use comments to add a section. In RStudio you can add a section by using ctrl-shift-r.
Conclusion
Clean code isn’t just about elegance—it’s about resilience. The true value of maintainable code is that when bugs happen (and they will), you can find them, fix them, and write tests to prevent them from happening again. Simplicity is your ally in this process. Breaking up functions into smaller, focused pieces doesn’t just abstract complexity—it makes your logic easier to trace, easier to test, and easier to trust.
Perhaps you work with production code someone else wrote. Taken from Robert Martin, follow the Boy Scout rule: leave the campground cleaner than you found it. Rename one unclear variable. Split one over-sized function. Remove one bit of duplication. These small improvements compound over time, and more importantly, they build the habit of writing code that’s not just functional—but thoughtful.
Clean code is a mindset. It’s how we make our work easier to maintain, easier to share, and easier to build on. And it’s how we turn good code into great systems.