MSDS 431 Example Assignment
Go for Statistics
Management Problem #
Managers of a technology startup are keen on limiting the number of computer languages supported by the company. They would like software engineers and data scientists to work together using the same language for backend research and product development. In particular, they want to see employees using Go as their primary programming language.
The managers know that Go will serve the company’s needs for backend web and database servers. They know that Go is the right language for distributed service offerings on the cloud. But they are concerned that it may be difficult to convince data scientists to use Go rather than Python or R.
Searches at https://go.dev/ point to numerous statistics, machine learning, and neural network packages that may serve the needs of data scientists. One statistics package they are looking at, in particular, is listed at https://pkg.go.dev/github.com/montanaflynn/stats with a GitHub repository at https://github.com/montanaflynn/stats. As of June 19, 2023, this pure Go statistics package had 28 contributors and more than 15 thousand users.
The company’s data scientists are concerned about the prospect of having to use Go for their work. At the very least, the data scientists want to ensure that the proposed Go statistics package will provide correct answers. Tests could examine Go linear regression results against results from Python and R. It is suggested that an initial test be run on four small data sets: The Anscombe Quartet as described by Anscombe (1973) and Miller (2015).
Assignment Requirements #
Take on the role of the company’s data scientists. Using data from The Anscombe Quartet and the Go testing package, ensure that the Go statistical package yields results comparable to those obtained from Python and R. In particular, ensure that similar results are obtained for estimated linear regression coefficients. Also, use the Go testing package to obtain program execution times and compare these with execution times observed from running Python and R programs on The Anscombe Quartet.
The Anscombe Quartet, developed by Anscombe (1973), is a set of four data sets with one independent variable x and one dependent variable y. Simple linear regression of y on x yields identical estimates of regression coefficients despite the fact that these are very different data sets. The Anscombe Quartet provides a telling demonstration of the importance of data visualization. Here is a plot of the four data sets generated from an R program in Miller (2015):
As part of the program documentation (in a README.md file), include a recommendation to management. Note any concerns that data scientists might have about using the Go statistics package instead of Python or R statistical packages.
Utilize test-driven development. The testing package in the Go standard library provides methods for testing and benchmarking. And the go test tool is bundled into the Go programming environment. Bates and LaNou (2023) and Bodner (2021) provide Go programming examples of testing and benchmarking as needed for this assignment. Want to learn test-driven development while you learn Go? Check out Chris James’s GitBook Learn Go with Tests. Chloé Powell provides a brief introduction to unit testing in Go, highlighting the testing and testify packages: Unit Testing in Golang. Adelina Simion (2023) provides a comprehensive review of testing in Go. Want to take your unit testing to the next level? Check out the GitHub repository for testify.
Grading Guidelines (100 Total Points) #
- Coding rules, organization, and aesthetics (20 points). Effective use of Go modules and idiomatic Go. Code should be readable, easy to understand. Variable and function names should be meaningful, specific rather than abstract. They should not be too long or too short. Avoid useless temporary variables and intermediate results. Code blocks and line breaks should be clean and consistent. Break large code blocks into smaller blocks that accomplish one task at a time. Utilize readable and easy-to-follow control flow (if/else blocks and for loops). Distribute the not rather than the switch (and/or) in complex Boolean expressions. Programs should be self-documenting, with comments explaining the logic behind the code (McConnell 2004, 777–817).
- Testing and software metrics (20 points). Employ unit tests of critical components, generating synthetic test data when appropriate. Generate program logs and profiles when appropriate. Monitor memory and processing requirements of code components and the entire program. If noted in the requirements definition, conduct a Monte Carlo performance benchmark.
- Design and development (20 points). Employ a clean, efficient, and easy-to-understand design that meets all aspects of the requirements definition and serves the use case. When possible, develop general-purpose code modules that can be reused in other programming projects.
- Documentation (20 points). Effective use of Git/GitHub, including a README.md Markdown file for each repository, noting the roles of programs and data and explaining how to test and use the application.
- Application (20 points). Provide instructions for creating an executable load module or application. The application should run to completion without issues. If user input is required, the application should check for valid/usable input and should provide appropriate explanation to users who provide incorrect input. The application should employ clean design for the user experience and user interface (UX/UI).
Assignment Deliverable #
- Text showing the link (URL) for the public GitHub repository for the assignment
References #
- Anscombe, F. J. 1973, February. “Graphs in Statistical Analysis.” The American Statistician 27(1): 17–21. Available online at https://www.sjsu.edu/faculty/gerstman/StatPrimer/anscombe1973.pdf
- Bates, Mark, and Cory LaNou. 2023. Go Fundamentals: Gopher Guides. Boston: Addison-Wesley. [ISBN-13: 978-0-13-791830-0] Chapter 7, Testing, pages 195–229.
- Bodner, Jon. 2021. Learning Go: An Idiomatic Approach to Real-World Go Programming. Sebastopol, CA: O’Reilly. [ISBN-13: 978-1-432-07721-3] Book website at https://learning-go-book.dev/. GitHub code repository at https://github.com/learning-go-book. Chapter 13, Writing Tests, pages 271–297.
- James, Chris. 2023. Learn Go with Tests. GitBook available online at https://quii.gitbook.io/learn-go-with-tests/
- McConnell, Steve. 2004. Code Complete: A Practical Handbook of Software Construction (second edition). Redmond, WA: Microsoft Press [ISBN-13: 978-0-7356-1967-8] Chapter 32, Self-Documenting Code, pages 777–817.
- Miller, Thomas W. 2015. Modeling Techniques in Predictive Analysis with Python and R: A Guide to Data Science. Upper Saddle River, NJ: Pearson Education. [ISBN-13: 978-0-13-389206-2] Chapter 1 Analytics and Data Science, pages 1–32, includes discussion and analysis of The Anscombe Quartet. To see this sample chapter and book’s bibliography, download the pdf file. Python and R programs available at https://github.com/mtpa/mtpa/tree/master/MTPA_Chapter_1.
- Simion, Adelina. 2023. Test-Driven Development in Go: A Practical Guide to Writing Idiomatic and Efficient Go Tests Through Real-World Examples. Birmingham, UK: Packt. [ISBN-13: 978-1803247878]
Back to the Data Engineering with Go main page.