class: top, left, title-slide # The Importance of
Good Coding Practices
For Data Scientists ### Randall Pruim, Calvin University ### Maria-Cristiana Gîrjău, and Nicholas J. Horton, Amherst College ### SDSS 2021 --- ## Before we get started * **Slides** at <https://tinyurl.com/sdss-2021> * This will appear again at the end of the talk. * Feel free to post questions/comments in the chat. --- ## Why bother? -- 1. Good coding practices lead to **more reliable** code. -- 2. Good coding practices **save** more **time** than they cost. -- 3. Good coding practices are important, **even for beginners**. -- 4. Good coding practices **focus attention**. -- 5. Good coding practices need to be a **consistent habit**. > Make no mistake about it. Bad habits are called 'bad' for a reason. They kill our productivity and creativity. They slow us down. They hold us back from achieving our goals. And they're detrimental to our health. .right[• John Rampton] ??? 1. but perhaps you object to the time it takes in the moment 2. summer research experience * save time redoing * save time maintaining, updating * save time by collaborating 3. music lessons 4. for both coding and for concepts 5. we don't just flip a switch and have better coding practices -- we develop them (like music lessons again) 1. Good coding practices lead to more reliable code. * It is easier to notice and fix errors in well written code. * It is less likely that the errors occur in the first place if the authors are using good practices. 2. Good coding practices save more time than they cost. Maybe you are convinced that good coding practices are valuable, but just now you are in a hurry and need to get the job done. * requires some time and attention -- establishing good habits * in the end, good practices save far more time than they consume * more likely to be **correct**, saving the time of redoing things * easier to **maintain**, saving time when it comes time to modify or adapt the code in the future * easier to **collaborate**, saving time when you need to work with others 3. Good coding practices are important, even for beginners. * Easier and more efficient to learn good coding practices early than to unlearn bad habits later. * Especially important that beginners see code that meets the highest standards, even if they are not forced to meet those standards themselves (at first). * We need to provide excellent models for our students to emulate -- if not, we are impeding their progress unnecessarily. * Second violin teacher -- "My other teacher never played like that." 4. Good coding practices focus attention. * makes it easier for students (and others) to learn both **code** and the **concepts** --- ## Goals -- 1. **Illustrate** some key aspects of coding practices (both good and bad). -- 2. Suggest ways to **include these practices** in the **statistics and data science curriculum**. -- 3. **Encourage everyone** to "level up" their coding practices. --- ## Examples will be in R -- Why R? ??? ```r args(read.table) ``` ``` function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = FALSE, fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) NULL ``` These features of R, its evolution, and its user base make it particularly important for data scientists who use or teach with R to emphasize good coding practices. -- 1. **R** is a **popular** language used **for data science**, both at academic institutions and in industry. -- 2. **R users** are less likely to be familiar with good coding practices than are users of other languages. -- 3. Developing good coding practices is **especially challenging in R**. * Competing styles in the community * Inconsistencies in base R code * R package ecosystem is large, decentralized, and rapidly evolving -- But the main points apply to other languages as well. --- ## A motivating example .small[<https://twitter.com/austingmeyer/status/1380942918593183744>] <img src="../images/twitter-code-train-test.png" width="60%" /> --- ## A motivating example ```r train <- data[0:(length(data[,1])-8),] train.ts <- ts(train[,c(2,3,4,5)],,start=c(2014,1),freq=52) test <- data[(length(data[,1])-7):length(data[,1]),] ``` ### Some responses ??? > It is really painful when taking a graduate level data science course and the instructor's code is considerably below any acceptable standard in the real world. Here is some real life code from a demo offered for the current homework... dear lord. -- * Any R course should **emphasize good practices** such as generality, DRY (don't repeat yourself), etc., and many do. **I'd subtract quite a few points** for a submission with code such as in your screenshot... -- * Sorry, doesn't look terrible to me. A bit cluttered, but **I've seen far worse** by people recognized as leaders in the R world. BTW, I think you are also **overestimating the quality of code in the real world**. :-) --- ## Improvements **Better** ```r train <- data[1:nrow(data) - 8), ] train.ts <- ts(train[, 2:5], start = c(2014, 1), freq = 52) test <- data[(nrow(data) - 7):nrow(data), ] ``` -- **Much better** ```r train <- head(data, -8) test <- tail(data, 8) train_ts <- ts( select(train, 2:5), # selecting by name would be better start = c(2014, 1), freq = 52 ) ``` ??? better: * 0-indexing removed * use of `nrow()` * use `:` for consecutive integers * more space * some disagreement about how much space to use and where -- but event the doesn't look terrible respondent complained that the original was too compressed. * spacebar is the largest key on the keyboard for a reason much better * use `head()` and `tail()` * use `select()` -- longer here, but develops better general data wrangling skills * avoid using `.` for things other than S3 methods --- ## Improvements **Better** ```r train <- data[1:nrow(data) - 8), ] train.ts <- ts(train[, c(2,3,4,5)], start = c(2014, 1), freq = 52) test <- data[(nrow(data) - 7):nrow(data), ] ``` **Much better (pipe version)** ```r test_size <- 8 train <- data |> head(- test_size) test <- data |> tail( test_size) train_ts <- train |> select(2:5) |> # better: select by name ts(start = c(2014, 1), freq = 52) ``` ??? pipe version makes data flow clearer: train -> select some cols -> convert to time series --- class: inverse, center # Low Hanging Fruit ### 1. Style ### 2. Naming things ### 3. Working on your accent ### 4. Taking advantage of authoring tools --- # The time is now > If you don't have to do it right, when will you have the time to do it over? .right[• John Wooden (Head Coach, UCLA men's basketball, 1948 – 1975)] -- **Now** is always the right answer to the question *When should I improve my coding practices?* **Reason**: You are always developing habits — good or bad is your only choice. -- > The difference between an amateur and a professional is in their habits. An amateur has amateur habits. A professional has professional habits. We can never free ourselves from habit. But we can replace bad habits with good ones. .right[• Steven Pressfield] ??? --- ## 1. You Gotta Have Style -- > A foolish consistency is the hobgoblin of little minds... .right[• Ralph Waldo Emerson, *Self-Reliance*] -- > A foolish **inconsistency** is the bane of coding projects. .right[• RJP] -- > Not conforming to our style guide is a fireable offence. .right[• JSM 2016 industry participant] -- Adopting and following a style guide is more important than which style guide you choose. * .small[But it is best to mimic one of the popular style guides] ??? It's all about **consistency**. --- class: normal-size ## {formatR}, {lintr}, {styler}, etc. Some parts of "style" can (and should) be automated. ### Original ```r train <- my_data[0:(length(my_data[,1])-8),] train.ts <- ts(train[,c(2,3,4,5)],,start=c(2014,1),freq=52) test <- my_data[(length(my_data[,1])-7):length(my_data[,1]),] ``` ### `formatR::tidy_source()` ```r train <- my_data[0:(length(my_data[, 1]) - 8), ] train.ts <- ts(train[, c(2, 3, 4, 5)], , start = c(2014, 1), freq = 52) test <- my_data[(length(my_data[, 1]) - 7):length(my_data[, 1]), ] ``` --- class: center ## 2. Naming Things > There are only two hard things in Computer Science: > cache invalidation and **naming things**. > -- Phil Karlton -- .left[ See also * [naming things slides](http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf) -- Jenny Bryan, Reproducible Science Workshop * [Project structure slides](https://slides.djnavarro.net/project-structure/#1) by Danielle Navarro * mostly about naming files (and folders) in a larger project ] --- ## ~~Three~~ Four principles for naming files 1. Good for machines 2. Good for people 3. Good for sorting and searching -- 4. Be consistent ??? Consistency follows from the previous 3, but it worth mentioning --- ## Naming files 1. Use (non-whitespace) delimiters: `_` (higher level) and `-` (lower level) * easier for both humans and machines to parse 2. Use consistent **slugs** 3. Use ISO 8601 for dates (YYYY-MM-DD) 4. Pad numbers with 0's for easy sorting 5. Prepend numbers to force a particular order * leave gaps for future insertions ```bash 010_first-file_2021-06-01.txt 010_first-file_2021-06-01.txt 020_second-file_2021-06-02.txt 015_inserted-file_2021-06-03.txt 015_inserted-file_2021-06-03.txt 020_second-file_2021-06-02.txt ``` --- class: small ## Naming data **Bad** ```r faithful |> round(1) |> head(2) ``` <table> <thead> <tr> <th style="text-align:right;"> eruptions </th> <th style="text-align:right;"> waiting </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3.6 </td> <td style="text-align:right;"> 79 </td> </tr> <tr> <td style="text-align:right;"> 1.8 </td> <td style="text-align:right;"> 54 </td> </tr> </tbody> </table> -- **Better** ```r MASS::geyser |> round(1) |> head(2) ``` <table> <thead> <tr> <th style="text-align:right;"> waiting </th> <th style="text-align:right;"> duration </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 80 </td> <td style="text-align:right;"> 4.0 </td> </tr> <tr> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 2.1 </td> </tr> </tbody> </table> -- * `time_til_next` or `time_since_prev` would be even better than `waiting` --- ## Data naming principles 1. Use **row-centric names** for columns * `eruptions` vs `eruption` vs `duration` -- 2. Decide how to handle **units** * in code-book -- common, but easily separated from the data * in variable names ```r palmerpenguins::penguins |> names() ``` ``` [1] "species" "island" "bill_length_mm" "bill_depth_mm" [5] "flipper_length_mm" "body_mass_g" "sex" "year" ``` * in attributes -- **{labelled}** package facilitates, for example -- 3. Personal preference: Capitalize Data Frames, lower case for variables * Helpful for students who often struggle to maintain the distinction ??? What do we want to say about putting units into the variables names? --- ### Labeling example ```r library(ggformula) penguins_labelled <- palmerpenguins::penguins |> * set_variable_labels( bill_length_mm = "bill length (mm)", body_mass_g = "body mass (g)") penguins_labelled |> gf_point(body_mass_g ~ bill_length_mm, color = ~species) penguins_labelled |> gf_histogram(~ body_mass_g | species ~ ., fill = ~species) ``` ![](Coding-Practices-SDSS-2021_files/figure-html/unnamed-chunk-16-1.png)![](Coding-Practices-SDSS-2021_files/figure-html/unnamed-chunk-16-2.png) --- ## Teaching tips 1. Use good names for the files you distribute to students. -- 2. Explicitly tell students what **naming scheme** to use for things they submit to you. -- 3. Names can give insight into how students are thinking. * Pay attention to the names they use. * Comment on poor choices. --- ## 3. Don't speak R with a foreign accent -- * Harder on instructors than on (many) students * Non-CS students (often) don't have other paradigms to impose * Can be a bigger problem for students with more CS background -- * Exhibit 1: For loops * Staple in most other programming languages. * Usually more efficient/more elegant/simpler ways to do things in R. * For loops are only appear to be a "basic construct" after you have learned them. * Actually an implementation detail for higher level concepts. ??? This is primarily a concern for new users (and instructors) who are already familiar with other programming languages and may be tempted to bring with them coding practices not well suited for work in R. The most salient of these is the misuse of for loops, a staple of programming in many languages that should mostly be avoided in R, which provides . other ways to iterate that are both more efficient and more elegant. This is not an issue for students who are writing their first code in R, but it can be a challenge for students who are already familiar with languages like Python, C++, or java who have been developing coding practices that are suited to those languages but do not transfer over to R very well. The most salient of these is the use of for loops. Typically considered one of the essential learning outcomes of a first course in most programming languages, for loops are needed only infrequently in R code. The key concept of iterating over an object (a list, a vector, etc.) *is* very important. But R provides other ways to do this that are both more efficient and more elegant. --- ## 4. Take advantage of R Markdown (etc.) .pull-left[ * documents * web pages and dashboards * slides (including these) * template docs * model good behavior * scaffold student work * script replacement ] .pull-right[ <img src="../images/Rmarkdown-screenshot.png" width="100%" style="display: block; margin: auto;" /> ] ??? Mention COVID0-dashboard with rsconnect server Many instructors who use R emphasize the use of R Markdown from very early in their courses [@RMarkdown], but some practitioners still rely on R scripts that produce auxiliary files that are then included elsewhere for reporting, perhaps via unautamated copy-and-paste steps. Parameterized R Markdown and the ability to schedule compilation on a server further add to the usefulness of R Markdown. --- class: inverse, center # Next Level Up ### 5. Choose Wisely: Less Volume, More Creativity ### 6. Avoid Copy-and-paste: Funcitons, Packages, Authoring tools ### 7. Prepare for success by preparing for failure. --- ## 5. Choose Wisely -- Sometimes less is more There are often many ways to perform some common tasks in R. Goal: A toolkit that is * small, * powerful, * coherent A toolkit of functions that work well together improves readability and reduces errors. * [Less Volume More Creativity in R](https://teachdatascience.com/mosaic/) blog post has links to longer discussions. ??? Why not in low hanging fruit section? * Very important for instructors of beginning students * But challenging for instructors new to R Reduces cognitive overload There are many ways to perform some common tasks in R. Even when several are equally good on their own merits, selecting a toolkit consisting of functions that work well together improves readability and reduces errors that arise from failing to switch from one standard to another. --- ## 6. Copy and paste is not a work flow **WET** vs **DRY:** * WET = Write every time * DRY = Don't repeat yourself * much less error prone * easier to update/maintain **Useful principles** * Encapsulate reusable code in functions. * Encapsulate reusable functions in R packages. **Question for instructors**: When do we introduce function/package creation? ??? Don't be intimidated by package authoring Package levels * project * cross-project * organization * share with the world (github/cran) --- ## 6. Copy and paste is not a work flow Another plug for authoring tools like R Markdown (and RSconnect). <img src="../images/Covid-dashboard.png" width="100%" style="display: block; margin: auto;" /> ??? Chemistry grant report example --- ## 7. Preparing for Failure: Sanity checks & unit tests ### How do you know it works? 1. Start small * Try out code on small examples with known answers. -- 2. Encourage sanity checks/audits after data transformations. * Good place to practice visualization techniques. -- 3. Write code that checks assumptions and fails safely. -- 4. We need to get students doing more unit testing. * Lay the groundwork for this early. -- **Important concept:** Test, not trust (hope?) ??? Teaching tip: Get students to show their sanity checks * include example assignment question? * **{testthat}** helps implement this in packages (for data and functions). While full blown unit testing of the sort supported by the `testthat` package may not be needed from the start, encouraging students to perform sanity checks is vital and a first step toward later unit testing. Visualizations or tables can help convince us that a data transformation was performed correctly. Trying code on small examples, or examples where the result is known, can reassure us that the code will work on other examples. --- class: inverse, center # If only we had more time... ### Great ways to improve your coding practice<br> that deserve a lengthier discussion ### 8. Debuggers ### 9. Version Control ### 10. Other languages --- ## 8. When it doesn't work: Learn to use a debugger Want a quick way to get started with debugging? Try this: ```r debugonce(my_function) ``` ??? Sometimes we know the code we have written is not working, what then? Developing some rudimentary debugging skills, including the use of a debugger, can make finding and fixing errors much less frustrating and time consuming. --- ## 9. Get (version) control of the situation **Version control** (Git/GitHub, etc.) is a key part of a workflow that fosters many good code practices, including * sane **collaboration** * organized **code review** * safe **experimentation** (eg, development and production environments) --- ### Git/GitHub References * *Happy Git and GitHub for the R User*, Jenny Bryan, Jim Hester (and TAs), [happygitwithr.com](https://happygitwithr.com/) * *Implementing version control with Git and GitHub as a learning objective in statistics and data science courses*, Beckman et al, JSDSE 2021. * *Learn Git Branching*, [learngitbranching.js.org](<https://learngitbranching.js.org/) ??? Where does this go? [@beckman]. --- ## 10. R may not always be the best choice * Willingness to use other languages when they are better suited to a task is important. * **{reticulate}** makes it relatively easy to mix R and Python in the same project. * Easy to move data between R and Python. * Greatly expands the data science toolkit without breaking workflow. * RStudio provides support for **{reticulate}**. ??? While proficiency in a language that supports data analysis well is important for any statistician or data scientist, a willingness to use other languages when they are better suited for a given task is also important. R Markdown support for multiple languages (especially Python) provides an easy way for R users to incorporate other languages in their work flow as appropriate. --- ## Summary 1. Improving your coding practices pays off in more reliable, more readable, more maintainable code. * Instructors: more readable = easier to grade 2. Take a developmental approach over time. 3. Demand more of yourself than of others (especially students). 4. Instructors: Choose your spots for what you will enforce when. * Assess what you value: Include some points for code quality issues. * But don't overwhelm newbies. ??? When looking back at old code, I have never regretted using good coding practices. I have often regretted using poor coding practices. --- class:normal-size ### From my class notes (300-level Bayesian Statistics course) Computer code simultaneously **communicates both to humans and to the computer**. Not all code that works is equally good. Here are 4 C's to consider as you write your code. * **Correctness:** It is important that the code be correct so that the computer does what you intend. * **Clarity:** But it is also important the that code be clear, so that humans reading and writing the code can tell what it is intended to do, and easily make modifications as necessary. * **Containment:** It is also helpful if the code is appropriately contained, to keep separate things that should be separate and together things that should be together. * **Consistency:** Finally, it is useful if your code exhibits internal consistency of style and naming conventions. We will adopt a style guide patterned after that used by the authors of the tidyverse suite of packages. --- class:center ## Thanks! **Slides** at <https://tinyurl.com/sdss-2021> #### Contact Us <table> <tr> <td> <img src = "../images/rpruim-square500.jpg" height = 200> <br> Randall Pruim<br> <a href="mailto:rpruim@gmail.com">rpruim@gmail.com</a> </td> <td> <img src = "../images/nick_horton.jpg" height = 200><br> Nicholas J Horton <br> <a href="mailto:nhorton@amherst.edu">nhorton@amherst.edu</a> </td> <td> <img src = "../images/kitty.jpeg" height = 200><br> Maria-Cristiana Gîrjău <br> <a href="mailto:mg4345@columbia.edu">mg4345@columbia.edu</a> </td> </tr> </table> .center[We acknowledge and appreciate support from NSF IIS grant 1923388.]