Computational Statistics in Data Science. Группа авторов

Читать онлайн.
Название Computational Statistics in Data Science
Автор произведения Группа авторов
Жанр Математика
Серия
Издательство Математика
Год выпуска 0
isbn 9781119561088



Скачать книгу

(Java matrix package, https://math.nist.gov/javanumerics/jama/) or EJML (efficient Java matrix library, http://ejml.org/wiki/). Such packages allow for routine computation – for example, matrix decomposition and dense/sparse matrix calculation. JFreeCHart enables data visualization by generating scatter plots, histograms, barplots, and so on. Recently, these Java libraries are being replaced by more popular JavaScript libraries such as Plot.ly (https://plot.ly/), Bokeh (bokeh.pydata.org), D3 [26], or Highcharts (www.highcharts.com).

      As outlined above, Java could serve as a useful statistical software solution, especially for developers familiar with it or who have interest in cross‐platform development. We would then recommend its use for seasoned programmers looking to add some statistical punch to their desktop, web, and mobile apps. For the analysis of big data, Java offers some of the best ML tools available.

      3.6 JavaScript, Typescript

      3.7 Maple

      Maple is a “math software that combines the world's most powerful math engine with an interface that makes it extremely easy to analyze, explore, visualize, and solve mathematical problems.” (https://www.maplesoft.com/products/Maple/). While not specifically a statistical software package, Maple's computer algebra system is a handy supplement to an analyst's toolkit. Often in statistical computing, a user may employ Maple to check a hand calculation or reduce the workload/error rate in lengthy derivations. Moreover, Maple offers add‐on packages for statistics, calculus, analysis, linear algebra, and more. One can even create interactive plots and animations. In sum, Maple is a solid choice for a computer algebra system to aid in statistical computing.

      3.8 MATLAB, GNU Octave

      MATLAB began as FORTRAN subroutines for solving linear (LINPACK) and eigenvalue (EISPACK) problems. Cleve Moler developed most of the subroutines in the 1970s for use in the classroom. MATLAB quickly gained popularity, primarily through word of mouth. Developers rewrote MATLAB in C during the 1980s, adding speed and functionality. The parent company of MATLAB, the Mathworks, Inc., was created in 1984, and MATLAB has since become a fully featured tool that is often used in engineering and developer fields where integration with sensors and controls is a primary concern.

      MATLAB has a substantial user base in government, academia, and the private sector. The MATLAB base distribution allows reading/writing data in ASCII, binary, and MATLAB proprietary formats. The data are presented to the user as an array, the MATLAB generic term for a matrix. The base distribution comes with a standard set of mathematical functions including trigonometric, inverse trigonometric, hyperbolic, inverse hyperbolic, exponential, and logarithmic. In addition, MATLAB provides the user with access to cell arrays, allowing for heterogeneous data across the cells and creation analogous to a C/Cplus plus. MATLAB provides the user with numerical methods, including optimization and quadrature functions.

      A highly similar yet free and open‐sourced programming language is GNU Octave. Octave offers many if not all features of the core MATLAB distribution, although MATLAB has many add‐on packages for which Octave has no equivalent, and that may prompt a user to choose MATLAB over Octave. We caution analysts against using MATLAB/Octave as their primary statistical computing solution as MATLAB's popularity is diminishing [4] – likely due to open‐source, more fully featured competitors such as R and Python.

      3.9 Minitab®

      Minitab offers import tools and a comprehensive set of statistical capabilities. Minitab's features include basic statistics, ANOVA, fixed and mixed models, regression analyses, measurement systems analysis, and graphics including contour and rotating 3D plots. A full feature list resides at http://www.minitab.com/en‐us/products/minitab/features‐list/. For advanced users, a command‐line editor exists. Within the editor, users may customize macros (functions).

      Minitab serves its user base well and will continue to be viable in the future. For teaching academics, Minitab provides near immediate access to many statistical methods and graphics. For industry, Minitab offers tools to produce standardized analyses and reports with little training. However, Minitab's flexibility and big data capabilities are limited.

      3.10 Workload Managers: SLURM/LSF

      Working on shared computing clusters has become commonplace in contemporary data science applications. Some working knowledge of workload managing programs (aka schedulers) is essential to running statistical software in these environments. Two popular workload managers are SLURM (https://slurm.schedmd.com/documentation.html) and IBM's platform load sharing facility (LSF), another popular workload management platform for distributed high‐performance computing. These schedulers can be used to execute batch jobs on networked Unix and Windows systems on many different architectures. A user would typically interface with a scheduling program via a command line tool or through a scripting language. The user specifies the hardware resources and program inputs. The scheduler then distributes the work across resources, and jobs are run based on system‐prioritization schemes. In such a way, hundreds or even thousands of programs can be run in parallel, increasing the scale of statistical computations possible within a reasonable time frame. For example, simulations for a novel statistical method could require many thousands of runs at various configurations, and this could be done in days rather than months.

      3.11 SQL

      Structured Query Language (SQL) is the standard language for relationship database management systems. While not strictly a statistical computing environment, the ability to query databases through SQL is an essential skill for data scientists. Nearly all companies seeking a data scientist require SQL knowledge as much of an analyst's job is extracting, transforming, and loading data from an established relational database.

      3.12 Stata®

      Stata provides hundreds of tools across broad applications and methods. Even Bayesian modeling and maximum‐likelihood