Computational Statistics in Data Science. Группа авторов. Читать онлайн. Mreadz. MREADZ.COM

Название	Computational Statistics in Data Science
Автор произведения	Группа авторов
Жанр	Математика
Серия
Издательство	Математика
Год выпуска	0
isbn	9781119561088

Скачать книгу

greater-than-or-equal-to n 0 colon less-than-or-equal-to less-than-or-equal-to plus plus plus Vn slash slash 1 p times times times epsilon double-vertical-bar double-vertical-bar theta Ì‚ ha of II left-parenthesis right-parenthesis less-than less-than nn asterisk n minus minus 1 times times epsilon double-vertical-bar double-vertical-bar theta Ì‚ ha"/>

This termination rule essentially controls the coefficient of variation for ModifyingAbove theta With Ì‚ Subscript h . An advantage here is that problem‐free choices of epsilon can be used since problems where double-vertical-bar theta Subscript h Baseline double-vertical-bar Subscript a is small will automatically require smaller cutoff. A clear disadvantage is that this rule is ineffective when theta Subscript h Baseline equals 0 .

5.2 MCMC

Although both upper T Subscript a Baseline left-parenthesis epsilon right-parenthesis and upper T Subscript m Baseline left-parenthesis epsilon right-parenthesis may be used in MCMC, a third alternative arises due to the correlation in the Markov chain. A relative‐standard deviation sequential stopping rule terminates the simulation when the Monte Carlo variability (as measured by the volume of the confidence region) is small compared to the underlying variability inherent to the problem left-parenthesis normal upper Lamda right-parenthesis . That is,

upper T Subscript s Baseline left-parenthesis epsilon right-parenthesis equals inf left-brace right-brace colon greater-than-or-equal-to greater-than-or-equal-to n 0 colon less-than-or-equal-to less-than-or-equal-to plus plus plus Vn slash slash 1 p times times times epsilon vertical-bar vertical-bar upper Lamda Ì‚ n slash slash 1 times times 2 p of II left-parenthesis right-parenthesis less-than less-than nn asterisk n minus minus 1 times times epsilon vertical-bar vertical-bar upper Lamda Ì‚ n slash slash 1 times times 2 p

If this rule is used for IID Monte Carlo, then upper A Subscript n in Equation (2) is ModifyingAbove normal upper Lamda With Ì‚ Subscript n , and upper T Subscript s Baseline left-parenthesis epsilon right-parenthesis almost-equals upper T Subscript a Baseline left-parenthesis epsilon prime right-parenthesis for some other (deterministic) . For MCMC, this sequential stopping rule connects directly to the concept of effective sample size [26]. That is, stopping at upper T Subscript s Baseline left-parenthesis epsilon right-parenthesis is equivalent to stopping when

(7)

Thus, simulation is terminated when the number of effective samples is larger than the lower bound in Equation (7). Effective sample size measures the number of equivalent IID samples that would produce equivalent variability in ModifyingAbove theta With Ì‚ Subscript h . Terminating simulation using Equation (7) is intuitive and easy to implement in MCMC sampling once appropriate estimators of normal upper Lamda and upper Sigma have been obtained.

6 Workflow

We have presented tools for determining when to stop a Monte Carlo simulation. The workflow starts by identifying upper F and theta and then running a chosen sampler for some small n Superscript asterisk iterations. Preliminary estimates of theta and normal upper Lamda or upper Sigma are obtained along with visualizations determining quality of the sampler. The simulation continues until a chosen stopping rule indicates termination using a prespecified epsilon . In the following section, we present three examples where we demonstrate this workflow.

In our examples, we assume that a CLT (or asymptotic distribution) for Monte Carlo estimators exists. However, extra care must be taken when working with a generic Monte Carlo procedure. Particularly, importance sampling can often yield estimators with infinite variances, where a CLT cannot hold. See Refs [3, 4] for more details. A CLT is particularly difficult to establish for MCMC due to serial correlation in the Markov chain. However, many individual Markov chains have been shown to be at least polynomially ergodic, for examples, see Jarner and Hansen [30], Roberts and Tweedie [31], Vats [32], Khare and Hobert [33], Tan et al. [34], Hobert and Geyer [35], Jones and Hobert [36].

A similar workflow can be adopted for embarrassingly parallel implementations of Monte Carlo samplers. Given the power of the modern personal computer, most Monte Carlo samplers can run on multiple cores simultaneously, producing more samples in the same clock time. For IID Monte Carlo, averaging estimators across all independent runs is reasonable. However, for estimating upper Sigma in MCMC, estimation quality can be improved by sharing information across multiple runs at the end of the simulation, see Gupta and Vats [37] for more details.

Sequential stopping rules, particularly in MCMC, should not be implemented as a black‐box procedure. Each implementation of the stopping rule must be accompanied with visualizations that give qualitative insights about the quality of the samplers. A better quality sampler can significantly improve estimation and lead to smaller run times. We illustrate this point by comparing samplers in our examples.

7 Examples

7.1 Action Figure Collector Problem

Скачать книгу

Computational Statistics in Data Science. Группа авторов

Информация о произведении:

5.2 MCMC

6 Workflow

7 Examples 7.1 Action Figure Collector Problem

7 Examples

7.1 Action Figure Collector Problem