Scalable, High-Performance Analytics ·

Most contemporary developments in applied statistics and machine learning, as reviewed by Gelman and Vehtari (2022), depend on computer-intensive algorithms and large data sets. With built-in concurrency, an expansive library of tools, and contributions from a large and active open-source community, Go is well-suited to serve the needs of applied statisticians and data scientists.

Go provides a robust environment for building scalable, high-performance systems, and Go works with Apache Arrow, a language-agnostic toolbox for accelerated data interchange, in-memory processing, and vectorized calculations, which are especially important when working with large data sets.

Gophers can access Apache Arrow with the Go arrow package.

The objectives of the Apache Arrow project were described in an ACM talk from July 2020, Apache Arrow and the Future of Data Frames, featuring Wes McKinney, creator of Python Pandas co-founder of Voltron Data, and principal architect of Posit. With over 900 contributors to the open-source project, Apache Arrow has seen significant development in recent years.

Polars, a Rust-based, high-performance alternative to Python Pandas, PySpark, and Apache Spark, was built on Apache Arrow.

On the December 2018 Gopher Academy Blog, Sebastien Binet posted the article Go and Apache Arrow: Building Blocks for Data Science. Go and Apache Arrow have improved substantially in the years since this posting, much to the advantage of data science.

Matt Topol, a key contributor to the Apache Arrow project and author of the leading book about Arrow, presented Apache Arrow and Go: A Match Made in Data at ApacheCon in January 2023. Then at the Data Council conference in May 2023, he discussed Arrow Database Connectivity (ADBC), a columnar data interchange alternative to row-based Open Database Connectivity (ODBC). Go has been essential to the development of ADBC.

Apache Arrow specifications and tools are widely supported across the languages of data science, including Go, R, Python, Julia, Rust, and MATLAB, as well as numerous database systems. It provides a platform for accelerated data interchange and in-memory processing, as needed for developing cross-language and multi-language systems.

Apache Arrow is central to the offerings of numerous data systems providers, including Voltron Data, dremio, Snowflake, and Databricks.

The Data Thread Conference from June 2022 put a spotlight on many projects across the Apache Arrow ecosystem.

References #

Gelman, Andrew, and Aki Vehtari. 2022. “What Are the Most Important Statistical Ideas of the Past 50 Years?” Journal of the American Statistical Association, 116: 2087-2097. Available in the online archive. See a video review of key ideas from this article: The most important ideas in modern statistics.

Topol, Matthew. 2022. In-Memory Analytics with Apache Arrow: Perform Fast and Efficient Data Analysis on Both Flat and Hierarchical Structured Data. Birmingham, UK: Packt. [ISBN-13: 978-1-80107-103-1] Code examples are provided in Go, Python, and C++ in the book’s GitHub Repository

Back to Building Systems and Services