Production-targeted Spark guidance with real-world use cases Spark: Big Data Cluster Computing in Production goes beyond general Spark overviews to provide targeted guidance toward using lightning-fast big-data clustering in production. Written by an expert team well-known in the big data community, this book walks you through the challenges in moving from proof-of-concept or demo Spark applications to live Spark in production. Real use cases provide deep insight into common problems, limitations, challenges, and opportunities, while expert tips and tricks help you get the most out of Spark performance. Coverage includes Spark SQL, Tachyon, Kerberos, ML Lib, YARN, and Mesos, with clear, actionable guidance on resource scheduling, db connectors, streaming, security, and much more. Spark has become the tool of choice for many Big Data problems, with more active contributors than any other Apache Software project. General introductory books abound, but this book is the first to provide deep insight and real-world advice on using Spark in production. Specific guidance, expert tips, and invaluable foresight make this guide an incredibly useful resource for real production settings. Review Spark hardware requirements and estimate cluster size Gain insight from real-world production use cases Tighten security, schedule resources, and fine-tune performance Overcome common problems encountered using Spark in production Spark works with other big data tools including MapReduce and Hadoop, and uses languages you already know like Java, Scala, Python, and R. Lightning speed makes Spark too good to pass up, but understanding limitations and challenges in advance goes a long way toward easing actual production implementation. Spark: Big Data Cluster Computing in Production tells you everything you need to know, with real-world production insight and expert guidance, tips, and tricks.
Complete guidance for mastering the tools and techniques of the digital revolution With the digital revolution opening up tremendous opportunities in many fields, there is a growing need for skilled professionals who can develop data-intensive systems and extract information and knowledge from them. This book frames for the first time a new systematic approach for tackling the challenges of data-intensive computing, providing decision makers and technical experts alike with practical tools for dealing with our exploding data collections. Emphasizing data-intensive thinking and interdisciplinary collaboration, The Data Bonanza: Improving Knowledge Discovery in Science, Engineering, and Business examines the essential components of knowledge discovery, surveys many of the current research efforts worldwide, and points to new areas for innovation. Complete with a wealth of examples and DISPEL-based methods demonstrating how to gain more from data in real-world systems, the book: Outlines the concepts and rationale for implementing data-intensive computing in organizations Covers from the ground up problem-solving strategies for data analysis in a data-rich world Introduces techniques for data-intensive engineering using the Data-Intensive Systems Process Engineering Language DISPEL Features in-depth case studies in customer relations, environmental hazards, seismology, and more Showcases successful applications in areas ranging from astronomy and the humanities to transport engineering Includes sample program snippets throughout the text as well as additional materials on a companion website The Data Bonanza is a must-have guide for information strategists, data analysts, and engineers in business, research, and government, and for anyone wishing to be on the cutting edge of data mining, machine learning, databases, distributed systems, or large-scale computing.
The final edition of the incomparable data warehousing and business intelligence reference, updated and expanded The Kimball Group Reader, Remastered Collection is the essential reference for data warehouse and business intelligence design, packed with best practices, design tips, and valuable insight from industry pioneer Ralph Kimball and the Kimball Group. This Remastered Collection represents decades of expert advice and mentoring in data warehousing and business intelligence, and is the final work to be published by the Kimball Group. Organized for quick navigation and easy reference, this book contains nearly 20 years of experience on more than 300 topics, all fully up-to-date and expanded with 65 new articles. The discussion covers the complete data warehouse/business intelligence lifecycle, including project planning, requirements gathering, system architecture, dimensional modeling, ETL, and business intelligence analytics, with each group of articles prefaced by original commentaries explaining their role in the overall Kimball Group methodology. Data warehousing/business intelligence industry's current multi-billion dollar value is due in no small part to the contributions of Ralph Kimball and the Kimball Group. Their publications are the standards on which the industry is built, and nearly all data warehouse hardware and software vendors have adopted their methods in one form or another. This book is a compendium of Kimball Group expertise, and an essential reference for anyone in the field. Learn data warehousing and business intelligence from the field's pioneers Get up to date on best practices and essential design tips Gain valuable knowledge on every stage of the project lifecycle Dig into the Kimball Group methodology with hands-on guidance Ralph Kimball and the Kimball Group have continued to refine their methods and techniques based on thousands of hours of consulting and training. This Remastered Collection of The Kimball Group Reader represents their final body of knowledge, and is nothing less than a vital reference for anyone involved in the field.
A hands on guide to web scraping and text mining for both beginners and experienced users of R Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL. Provides basic techniques to query web documents and data sets (XPath and regular expressions). An extensive set of exercises are presented to guide the reader through each technique. Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management. Case studies are featured throughout along with examples for each technique presented. R code and solutions to exercises featured in the book are provided on a supporting website.
Addresses the impacts of data mining on education and reviews applications in educational research teaching, and learning This book discusses the insights, challenges, issues, expectations, and practical implementation of data mining (DM) within educational mandates. Initial series of chapters offer a general overview of DM, Learning Analytics (LA), and data collection models in the context of educational research, while also defining and discussing data mining’s four guiding principles— prediction, clustering, rule association, and outlier detection. The next series of chapters showcase the pedagogical applications of Educational Data Mining (EDM) and feature case studies drawn from Business, Humanities, Health Sciences, Linguistics, and Physical Sciences education that serve to highlight the successes and some of the limitations of data mining research applications in educational settings. The remaining chapters focus exclusively on EDM’s emerging role in helping to advance educational research—from identifying at-risk students and closing socioeconomic gaps in achievement to aiding in teacher evaluation and facilitating peer conferencing. This book features contributions from international experts in a variety of fields. Includes case studies where data mining techniques have been effectively applied to advance teaching and learning Addresses applications of data mining in educational research, including: social networking and education; policy and legislation in the classroom; and identification of at-risk students Explores Massive Open Online Courses (MOOCs) to study the effectiveness of online networks in promoting learning and understanding the communication patterns among users and students Features supplementary resources including a primer on foundational aspects of educational mining and learning analytics Data Mining and Learning Analytics: Applications in Educational Research is written for both scientists in EDM and educators interested in using and integrating DM and LA to improve education and advance educational research.
How to effectively use BigQuery, avoid common mistakes, and execute sophisticated queries against large datasets Google BigQuery Analytics is the perfect guide for business and data analysts who want the latest tips on running complex queries and writing code to communicate with the BigQuery API. The book uses real-world examples to demonstrate current best practices and techniques, and also explains and demonstrates streaming ingestion, transformation via Hadoop in Google Compute engine, AppEngine datastore integration, and using GViz with Tableau to generate charts of query results. In addition to the mechanics of BigQuery, the book also covers the architecture of the underlying Dremel query engine, providing a thorough understanding that leads to better query results. Features a companion website that includes all code and data sets from the book Uses real-world examples to explain everything analysts need to know to effectively use BigQuery Includes web application examples coded in Python
Updated new edition of Ralph Kimball's groundbreaking book on dimensional modeling for data warehousing and business intelligence! The first edition of Ralph Kimball's The Data Warehouse Toolkit introduced the industry to dimensional modeling, and now his books are considered the most authoritative guides in this space. This new third edition is a complete library of updated dimensional modeling techniques, the most comprehensive collection ever. It covers new and enhanced star schema dimensional modeling patterns, adds two new chapters on ETL techniques, includes new and expanded business matrices for 12 case studies, and more. Authored by Ralph Kimball and Margy Ross, known worldwide as educators, consultants, and influential thought leaders in data warehousing and business intelligence Begins with fundamental design recommendations and progresses through increasingly complex scenarios Presents unique modeling techniques for business applications such as inventory management, procurement, invoicing, accounting, customer relationship management, big data analytics, and more Draws real-world case studies from a variety of industries, including retail sales, financial services, telecommunications, education, health care, insurance, e-commerce, and more Design dimensional databases that are easy to understand and provide fast query response with The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition.
Fundamentals of Big Data Network Analysis for Research and Industry Hyunjoung Lee, Institute of Green Technology, Yonsei University, Republic of Korea Il Sohn, Material Science and Engineering, Yonsei University, Republic of Korea Presents the methodology of big data analysis using examples from research and industry There are large amounts of data everywhere, and the ability to pick out crucial information is increasingly important. Contrary to popular belief, not all information is useful; big data network analysis assumes that data is not only large, but also meaningful, and this book focuses on the fundamental techniques required to extract essential information from vast datasets. Featuring case studies drawn largely from the iron and steel industries, this book offers practical guidance which will enable readers to easily understand big data network analysis. Particular attention is paid to the methodology of network analysis, offering information on the method of data collection, on research design and analysis, and on the interpretation of results. A variety of programs including UCINET, NetMiner, R, NodeXL, and Gephi for network analysis are covered in detail. Fundamentals of Big Data Network Analysis for Research and Industry looks at big data from a fresh perspective, and provides a new approach to data analysis. This book: Explains the basic concepts in understanding big data and filtering meaningful data Presents big data analysis within the networking perspective Features methodology applicable to research and industry Describes in detail the social relationship between big data and its implications Provides insight into identifying patterns and relationships between seemingly unrelated big data Fundamentals of Big Data Network Analysis for Research and Industry will prove a valuable resource for analysts, research engineers, industrial engineers, marketing professionals, and any individuals dealing with accumulated large data whose interest is to analyze and identify potential relationships among data sets.
Wring more out of the data with a scientific approach to analysis Graph Analysis and Visualization brings graph theory out of the lab and into the real world. Using sophisticated methods and tools that span analysis functions, this guide shows you how to exploit graph and network analytic techniques to enable the discovery of new business insights and opportunities. Published in full color, the book describes the process of creating powerful visualizations using a rich and engaging set of examples from sports, finance, marketing, security, social media, and more. You will find practical guidance toward pattern identification and using various data sources, including Big Data, plus clear instruction on the use of software and programming. The companion website offers data sets, full code examples in Python, and links to all the tools covered in the book. Science has already reaped the benefit of network and graph theory, which has powered breakthroughs in physics, economics, genetics, and more. This book brings those proven techniques into the world of business, finance, strategy, and design, helping extract more information from data and better communicate the results to decision-makers. Study graphical examples of networks using clear and insightful visualizations Analyze specifically-curated, easy-to-use data sets from various industries Learn the software tools and programming languages that extract insights from data Code examples using the popular Python programming language There is a tremendous body of scientific work on network and graph theory, but very little of it directly applies to analyst functions outside of the core sciences – until now. Written for those seeking empirically based, systematic analysis methods and powerful tools that apply outside the lab, Graph Analysis and Visualization is a thorough, authoritative resource.
Provides the fundamentals, technologies, and best practices in designing, constructing and managing mission critical, energy efficient data centers Organizations in need of high-speed connectivity and nonstop systems operations depend upon data centers for a range of deployment solutions. A data center is a facility used to house computer systems and associated components, such as telecommunications and storage systems. It generally includes multiple power sources, redundant data communications connections, environmental controls (e.g., air conditioning, fire suppression) and security devices. With contributions from an international list of experts, The Data Center Handbook instructs readers to: Prepare strategic plan that includes location plan, site selection, roadmap and capacity planning Design and build «green» data centers, with mission critical and energy-efficient infrastructure Apply best practices to reduce energy consumption and carbon emissions Apply IT technologies such as cloud and virtualization Manage data centers in order to sustain operations with minimum costs Prepare and practice disaster reovery and business continuity plan The book imparts essential knowledge needed to implement data center design and construction, apply IT technologies, and continually improve data center operations.