Computational Statistics in Data Science. Группа авторов. Читать онлайн. Mreadz. MREADZ.COM

Название	Computational Statistics in Data Science
Автор произведения	Группа авторов
Жанр	Математика
Серия
Издательство	Математика
Год выпуска	0
isbn	9781119561088

Скачать книгу

issues to adapt to the high‐volume and complexity in data. While data streams provide the opportunity for machine learning algorithms to uncover useful and interesting patterns, traditional machine learning algorithms face the challenge of scalability to truly uncover the hidden value in the data stream [26].

3.2 Integration

Building a distributed framework with every node having a data stream flow view implies that each node is liable for performing analysis with few sources. Aggregating these views to build a complete view is inconsequential. This calls for the development of an integration technique that can perform efficient operations through disparate datasets [27].

3.3 Fault‐Tolerance

For life‐critical systems, high fault‐tolerance is required. In streaming computing environments, where unbounded data are generated in real‐time, an amazing high adaptation to noncritical failure procedure and scalable system, is required to allow an application to keep working without interruption despite component failure. The most widely recognized adaptation to internal failure is checkpointing, where the framework state is intermittently persisted to recapture the computational state after system failures. However, the overhead incurred with checkpointing can negatively affect system performance. An improved checkpointing to minimize the overhead cost was proposed by [28, 29].

3.4 Timeliness

Time is essential for time‐sensitive processes, which incorporate foiling fraud, mitigating security threats, or responding to a natural disaster. Such architectures or platforms must be scalable to enable consistent handling of data streams [30]. The fundamental challenge bothers on implementing a distributed architecture for data aggregation with insignificant latency between the communicating nodes.

3.5 Consistency

Achieving high consistency or stability in the data stream computing environments is nontrivial as it is hard to figure out which data are required and which nodes ought to be consistent [31, 32]. Thus, a good framework is required.

3.6 Heterogeneity and Incompleteness

Data streams are heterogeneous in structure, semantics, organizations, granularity, and accessibility. Different data in disparate sources, different formats, combined with the volume of data, make the integration, retrieval, and reasoning over the data stream a challenging task [33]. The challenge here is how to deal with ever‐growing data, extract, aggregate, and correlate data streams from numerous sources in real‐time. There is need to design a competent data presentation to mirror the structure, hierarchy, and diversity of data streams.

3.7 Load Balancing

In an ideal situation, a data stream framework should be self‐adaptive and avoid load shedding. Be that as it may, this is challenging because the possibility of dedicated resources to cover peak loads 24/7 is slim, and load shedding is not realistic, most especially when the variance between the average load and the peak load is high [34]. Consequently, a distributing environment that can stream, analyze, and aggregate partial data streams to a global center when local resources become deficient is required.

3.8 High Throughput

Decision concerning the identification of the data stream portion that needs replication, number of these replicas that is required, and which of the data stream to assign to each replica is an issue in data stream computing environment. Proper multiple instances replication is required if high throughput is to be achieved [35].

3.9 Privacy

Data stream analytics open doors for real‐time analysis of massive amount of data but also made a colossal danger to individual privacy [36]. As indicated by the International Data Cooperation (IDC), half of the aggregate data that needs protection is not adequately protected. Relevant and efficient privacy‐preserving solutions for interpretation, observation, evaluation, and decision for data stream mining should be designed [37]. The sensitive nature of data necessitates privacy‐preserving techniques. One of the leading privacy‐preserving techniques is perturbation [38].

3.10 Accuracy

Developing efficient methods that can precisely predict future observations is one of the leading goals of data stream analysis. Yet, the intrinsic features of data stream, which include noisy characteristics, velocity, volume, variety, value, veracity, variability, and volatility, data stream analysis strongly constrain processing algorithms spatiotemporally. Hence, to guarantee high accuracy, mitigation of these challenges must be taken into consideration as they can negatively influence the accuracy of data stream analysis [39].

4 Streaming Data Tools and Technologies

The demand for stream processing is on the increase, and data have to be processed fast to make decisions in real‐time. Because of the developing interest in streaming data analysis, a huge number of enormous streaming data solutions have been created both by the open‐source community and enterprise technology vendors [10]. As indicated by Millman [40], there are a few elements to consider while choosing data stream tools and technologies in request to settle on viable data management decisions. Those elements include the shape of the data, data accessibility, availability, and consistency requirement, and workload. Some prominent open‐source tools and technologies for data stream analytics include NoSQL [41], Apache Spark [42–44], Apache Storm [45], Apache Samza [46, 47], Yahoo! S4 [48], Photon [49], Apache Aurora [50], EsperTech [51], SAMOA [52], C‐SPARQL [53], CQELS [54], ETALIS [55], SpagoWorld [56]. Some proprietary tools and technologies for streaming data are Cloudet [57], Sentiment Brand Monitoring [58], Elastic Streaming Processing Engine [59], IBM InfoSphere Streams [16, 60, 61], Google MillWheel [46], Infochimps Cloud [56], Azure Stream [62], Microsoft Stream Insight [63], TIBCO StreamBase [64], Lambda Architecture [6], IoTSim‐Stream [65], and Apama Stream [62].

5 Streaming Data Pre‐Processing: Concept and Implementation

Data stream pre‐processing, which aims at reducing the inherent complexity associated with streaming data for a faster, more understandable, and interpretable, and more precise learning process is an essential technique in knowledge discovery. However, despite the recorded growth in online learning, data stream pre‐processing methods still have a long way to go due to the high level of noise [66]. These noisy terms incorporate a short length of messages, slangs, abbreviations, acronyms, blended dialects, linguistic and spelling mistakes, sporadic, casual, shortened words, and ill‐advised sentence structure, which make it hard for learning algorithms to perform productively and adequately [67]. Additionally, error from sensor reading due to low battery, damage, incorrect calibrations, among others, can render data delivered from such sensors unsuitable for analysis [68].

Data quality is a fundamental determinant in the knowledge discovery pipeline as low‐quality data yields low‐quality models and choices [69]. There is need to strengthen data stream pre‐processing stage in the face of multi‐label [70], imbalance [71], and multi‐instance [72] problems associated data stream [66]. Also, data stream pre‐processing techniques with low computational requirement [73] needs to be evolved as this is still open for research. Moreover, the representation of social media posts must be in a way that the semantics of social media content is preserved [74, 75]. To improve the result of analysis in the data stream, there is need to develop frameworks that will cope with the noisy characteristics, redundancy, heterogeneity, data imbalance, transformation, feature representation, or selection issues in data streams [26]. Some of the new frameworks developed for pre‐processing and enriching data stream for better results are SlangSD [76], N‐gram and Hidden Markov Model [77], SLANGZY [78], and SMFP [67].