Practical Data Analysis with JMP, Third Edition. Robert Carver

Читать онлайн.
Название Practical Data Analysis with JMP, Third Edition
Автор произведения Robert Carver
Жанр Программы
Серия
Издательство Программы
Год выпуска 0
isbn 9781642956122



Скачать книгу

into a new data table. This is not the most efficient method, but it emphasizes the idea of random selection, and it introduces two useful commands.

      In this chapter, we will work with several data tables. As such, this is an opportunity to introduce JMP Projects. A project keeps track of data and report files together.

      1. First, we will create a Project to store the work that we are about to do. File ► New ► Project. A blank project window opens. (See Figure 2.1.)

      Figure 2.1: A Project Window

Figure 1.1 Some JMP Help Options

      2. File ► Open. Select the data table called World Nations. This table lists the countries in the world as of 2017, as identified by the United Nations and the World Bank. Notice that the data table opens within the project window, and the Windows List in the upper left contains World Nations.

      3. Select Rows ► Row Selection ► Select Randomly…. A small dialog box opens (Figure 2.2) asking either for a sampling rate or a sample size. If you enter a value between 0 and 1, JMP understands it as a rate. A number large than 1 is interpreted as a sample size, n. Enter 20 into the dialog box and click OK.

      Figure 2.2: Specifying a Simple Random Sample Size of 20

Figure 1.1 Some JMP Help Options

      JMP randomly selects 20 of the 215 rows in this table. When you look at the Rows panel in the Data Table window, you will see that 20 rows have been selected. As you scroll down the list of countries, you will see that 20 selected rows are highlighted. If you repeat Step 2, a different list of 20 rows will be highlighted because the selection process is random. If you compare notes with others in your class, you should discover that their samples comprise different countries. With the list of 20 countries now chosen, let’s create a new data table containing just the SRS of 20 countries.

      4. Select Tables ► Subset. This versatile dialog box (see Figure 2.3) enables us to build a table using the just-selected rows, or to randomly sample directly. Note that the dialog box opens in a fresh tab within the Project, and as an item in the Window List.

      5. As shown in the figure, choose Selected Rows and then change the Output table name to World Nations SRS. Then click OK.

      Figure 2.3: Creating a Subset of Selected Rows

Figure 1.1 Some JMP Help Options

      In the upper left corner of the data table window, there is a green arrow with the label Source. JMP inserts the JSL script that created the subset. Readers wishing to learn more about writing JMP scripts should right-click the green arrow, choose Edit, and see what a JSL script looks like.

      Before moving to the next example, let’s save the Project. Before saving a project, all documents within the project must be saved individually. We just created a new data table; save it now.

      6. File ► Save. The default name is World Nations SRS, which is fine. Place it in a folder of your choice.

      7. File ► Save Project. Choose a location and name this project Chap_02. Then, click OK. Among your Recent Files list, you will now find Chap_02.jmpprj.

      Other Types of Random Sampling

      As noted previously, simple random sampling requires that we can identify and access all N elements within a population. Sometimes this is not practical, and there are several alternative strategies available. It is well beyond the scope of this chapter to discuss these strategies at length, but Chapter 8 provides basic coverage of some of these approaches.

      Non-Random Sampling

      This book is about practical data analysis, and in practice, many data tables contain data that were not generated by anything like a random sampling process. Most data collected within businesses and nonprofit organizations come from the normal operations of the organization rather than from a carefully constructed process of sampling. The data generated by “Internet of Things” devices, for example, are decidedly non-random. We can summarize and describe the data within a non-random sample but should be very cautious about the temptation to generalize from such samples. Whether we are conducting the analysis or reading about it, we always want to ask whether a given sample is likely to misrepresent the population or process from which it came. Voluntary response surveys, for example, are very likely to mislead us if only highly motivated individuals respond. On the other hand, if we watch the variation in stock prices during an uneventful period in the stock markets, we might reasonably expect that the sample could represent the process of stock market transactions.

      Big Data

      You might have heard or read about “Big Data”—high volume raw data generated by numerous electronic technologies like cell phones, supermarket scanners, radio-frequency identification (RFID) chips, or other automated devices. The world-spanning, continuous stream of data carries huge potential for the future of data analytics and presents many ethical, technical, and economic challenges. In general, data generated in this way are not random in the conventional sense and don’t neatly fit into the traditional classifications of an introductory statistics course. Big data can include photographic images, video, or sound recordings that don’t easily occupy columns and rows in a data table. Furthermore, streaming data is neither cross-sectional nor time series in the usual sense.

      When the research concerns a population, the sampling approach is often cross-sectional, which is to say the researchers select individuals from the population at one period of time. Again, the individuals can be people, animals, firms, cells, plants, manufactured goods, or anything of interest to the researchers.

      When the research concerns a process, the sampling approach is more likely to be time series or longitudinal, whereby a single individual is repeatedly measured at regular time intervals. A great deal of business and economic data is longitudinal. For example, companies and governmental agencies track and report monthly sales, quarterly earnings, or annual employment. The major distinction between time series data and streaming data is whether observations occur according to a pre-determined schedule or whether they are event-driven (for example, when a customer places a cell phone call).

      Panel studies combine cross-sectional and time series approaches. In a panel study, researchers repeatedly gather data about the same group of individuals. Some long-term public health studies follow panels of individuals for many years; some marketing researchers use consumer panels to monitor changes in taste and consumer preferences.

      If the goal of a study is to demonstrate a cause-and-effect relationship, then the ideal approach is a designed experiment. The hallmark features of an experiment are that the investigator controls and manipulates the values of one or more variables, randomly assigns treatments to observational units, and then observes changes in the response variable. For example, engineers in the concrete industry might want to know how varying the amount of different additives affects the strength of the concrete. A research team would plan an experiment in which they would systematically vary specific additives and conditions, then measure the strength of the resulting batch of concrete.

      Similarly, consider a large retail company that has a “customer loyalty” program, offering discounts to its regular customers who present their bar-coded key tags at the check-out counter. Suppose the firm wants to nudge customers to return to their stores more frequently and generates discount coupons that can be redeemed if the customer visits the store again within so many days. The marketing analysts in the company could design an experiment in which they vary the size of the discount and the expiration date of the offer, issue the different coupons to randomly chosen customers, and then see when customers return.

      In an experimental design, the causal variables are called factors, and the outcome variable is called