Home Page Icon
Home Page
Table of Contents for
Section III - Big Data Stream Techniques and Algorithms
Close
Section III - Big Data Stream Techniques and Algorithms
by Hai Jiang, Laurence T. Yang, Alfredo Cuzzocrea, Kuan-Ching Li
Big Data
Foreword by Jack Dongarra
Foreword by Dr. Yi Pan
Foreword by D. Frank Hsu
Preface
Editors
Contributors
Section I - Big Data Management
Chapter 1 - Scalable Indexing for Big Data Processing
Abstract
1.1 Introduction
1.2 Permutation-Based Indexing
1.2.1 Indexing Model
1.2.2 Technical Implementation
1.3 Related Data Structures
1.3.1 Metric Inverted Files
1.3.2 Brief Permutation Index
1.3.3 Prefix Permutation Index
1.3.4 Neighborhood Approximation
1.3.5 Metric Suffix Array
1.3.6 Metric Permutation Table
1.4 Distributed Indexing
1.4.1 Data Based
1.4.1.1 Indexing
1.4.1.2 Searching
1.4.2 Reference Based
1.4.2.1 Indexing
1.4.2.2 Searching
1.4.3 Index Based
1.4.3.1 Indexing
1.4.3.2 Searching
1.5 Evaluation
1.5.1 Recall and Position Error
1.5.2 Indexing and Searching Performance
1.5.3 Big Data Indexing and Searching
1.6 Conclusion
Acknowledgment
References
Chapter 2 - Scalability and Cost Evaluation of Incremental Data Processing Using Amazon’s Hadoop Service
Abstract
2.1 Introduction
2.2 Introduction of MapReduce and Apache Hadoop
2.3 A Motivating Application: Movie Ratings from Netflix Prize
2.4 Implementation in Hadoop
2.5 Deployment Architecture
2.6 Scalability and Cost Evaluation
2.7 Discussions
2.8 Related Work
2.9 Conclusion
Acknowledgment
References
Appendix 2.A: Source Code of Mappers and Reducers
Chapter 3 - Singular Value Decomposition, Clustering, and Indexing for Similarity Search for Large Data Sets in High-Dimensional Spaces
Abstract
3.1 Introduction
3.2 Data Reduction Methods and SVD
3.3 Clustering Methods
3.3.1 Partitioning Methods
3.3.2 Hierarchical Clustering
3.3.3 Density-Based Methods
3.3.4 Grid-Based Methods
3.3.5 Subspace Clustering Methods
3.4 Steps in Building an Index for k-NN Queries
3.5 Nearest Neighbors Queries in High-Dimensional Space
3.6 Alternate Method Combining SVD and Clustering
3.7 Survey of High-Dimensional Indices
3.8 Conclusions
Acknowledgments
References
Appendix 3.A: Computing Approximate Distances with Dimensionality-Reduced Data
Chapter 4 - Multiple Sequence Alignment and Clustering with Dot Matrices, Entropy, and Genetic Algorithms
Abstract
4.1 Introduction
4.2 CDM
4.3 PEA
4.4 Divide and Conquer
4.5 GAs
4.6 DCGA
4.7 K-Means
4.8 Clustering Genetic Algorithm with the SSE Criterion
4.9 MapReduce Section
4.10 Simulation
4.11 Conclusion
References
Section II - Big Data Processing
Chapter 5 - Approaches for High-Performance Big Data Processing: Applications and Challenges
Abstract
5.1 Introduction
5.2 Big Data Definition and Concepts
5.3 Cloud Computing for Big Data Analysis
5.3.1 Data Analytics Tools as SaaS
5.3.2 Computing as IaaS
5.4 Challenges and Current Research Directions
5.5 Conclusions and Perspectives
References
Chapter 6 - The Art of Scheduling for Big Data Science
Abstract
6.1 Introduction
6.2 Requirements for Scheduling in Big Data Platforms
6.3 Scheduling Models and Algorithms
6.4 Data Transfer Scheduling
6.5 Scheduling Policies
6.6 Optimization Techniques for Scheduling
6.7 Case Study on Hadoop and Big Data Applications
6.8 Conclusions
References
Chapter 7 - Time–Space Scheduling in the MapReduce Framework
Abstract
7.1 INTRODUCTION
7.2 OVERVIEW OF Big Data PROCESSING ARCHITECTURE
7.3 SELF-ADAPTIVE REDUCE TASK SCHEDULING
7.3.1 Problem Analysis
7.3.2 Runtime Analysis of MapReduce Jobs
7.3.3 A Method of Reduce Task Start-Time Scheduling
7.4 REDUCE PLACEMENT
7.4.1 Optimal Algorithms for Cross-Rack Communication Optimization
7.4.2 Locality-Aware Reduce Task Scheduling
7.4.3 MapReduce Network Traffic Reduction
7.4.4 The Source of MapReduce Skews
7.4.5 Reduce Placement in Hadoop
7.5 NER IN BIOMEDICAL Big Data MINING: A CASE STUDY
7.5.1 Biomedical Big Data
7.5.2 Biomedical Text Mining and NER
7.5.3 MapReduce for CRFs
7.6 CONCLUDING REMARKS
References
Chapter 8 - GEMS: Graph Database Engine for Multithreaded Systems
Abstract
8.1 INTRODUCTION
8.2 RELATED INFRASTRUCTURES
8.3 GEMS OVERVIEW
8.4 GMT ARCHITECTURE
8.4.1 GMT: Aggregation
8.4.2 GMT: Fine-Grained Multithreading
8.5 EXPERIMENTAL RESULTS
8.5.1 Synthetic Benchmarks
8.5.2 BSBM
8.5.3 RDESC
8.6 CONCLUSIONS
References
Chapter 9 - KSC-net: Community Detection for Big Data Networks
Abstract
9.1 INTRODUCTION
9.2 KSC fOR Big Data NETWORKS
9.2.1 Notations
9.2.2 FURS Selection
9.2.3 KSC Framework
9.2.3.1 Training Model
9.2.3.2 Model Selection
9.2.3.3 Out-of-Sample Extension
9.2.4 Practical Issues
9.3 KSC-net SOFTWARE
9.3.1 KSC Demo on Synthetic Network
9.3.2 KSC Subfunctions
9.3.3 KSC Demo on Real-Life Network
9.4 CONCLUSION
Acknowledgments
References
Chapter 10 - Making Big Data Transparent to the Software Developers’ Community
Abstract
10.1 Introduction
10.2 Software Developers’ Information Needs
10.2.1 Information Needs: Core Work Practice
10.2.2 Information Needs: Constructing and Maintaining Relationships
10.2.3 Information Needs: Professional/Career Development
10.3 Software Developers’ Ecosystem
10.3.1 Social Media Use
10.3.2 The Ecosystem
10.4 Information Overload and Awareness Issue
10.5 The Application of Big Data to Support the Software Developers’ Community
10.5.1 Data Generated from Core Practices
10.5.2 Software Analytics
10.6 Conclusion
References
Section III - Big Data Stream Techniques and Algorithms
Chapter 11 - Key Technologies for Big Data Stream Computing
Abstract
11.1 INTRODUCTION
11.1.1 Stream Computing
11.1.2 Application Background
11.1.3 Chapter Organization
11.2 OVERVIEW OF A BDSC SYSTEM
11.2.1 Directed Acyclic Graph and Stream Computing
11.2.2 System Architecture for Stream Computing
11.2.3 Key Technologies for BDSC Systems
11.2.3.1 System Structure
11.2.3.2 Data Stream Transmission
11.2.3.3 Application Interfaces
11.2.3.4 High Availability
11.3 EXAMPLE BDSC SYSTEMS
11.3.1 Twitter Storm
11.3.1.1 Task Topology
11.3.1.2 Fault Tolerance
11.3.1.3 Reliability
11.3.1.4 Storm Cluster
11.3.2 Yahoo! S4
11.3.2.1 Processing Element
11.3.2.2 Processing Nodes
11.3.2.3 Fail-Over, Checkpointing, and Recovery Mechanism
11.3.2.4 System Architecture
11.3.3 Microsoft TimeStream and Naiad
11.3.3.1 TimeStream
11.3.3.2 Naiad
11.4 FUTURE PERSPECTIVE
11.4.1 Grand Challenges
11.4.1.1 High Scalability
11.4.1.2 High Fault Tolerance
11.4.1.3 High Consistency
11.4.1.4 High Load Balancing
11.4.1.5 High Throughput
11.4.2 On-the-Fly Work
Acknowledgments
References
Chapter 12 - Streaming Algorithms for Big Data Processing on Multicore Architecture
Abstract
12.1 Introduction
12.2 An Unconventional Big Data Processor
12.2.1 Terminology
12.2.2 Overview of Hadoop
12.2.3 Hadoop Alternative: Big Data Replay
12.3 Putting the Pieces Together
12.3.1 More on the Scope of the Problem
12.3.2 Overview of Literature
12.4 The Data Streaming Problem
12.4.1 Data Streaming Terminology
12.4.2 Related Information Theory and Formulations
12.4.3 Practical Applications and Designs
12.5 Practical Hashing and Bloom Filters
12.5.1 Bloom Filters: Store, Lookup, and Efficiency
12.5.2 Unconventional Bloom Filter Designs for Data Streams
12.5.3 Practical Data Streaming Targets
12.6 Big Data Streaming Optimization
12.6.1 A Simple Model of a Data Streaming Process
12.6.2 Streaming on Multicore
12.6.3 Performance Metrics
12.6.4 Example Analysis
12.7 Big Data Streaming on Multicore Technology
12.7.1 Parallel Processing Basics
12.7.2 DLL
12.7.3 Lock-Free Parallelization
12.7.4 Software APIs
12.8 Summary
References
Chapter 13 - Organic Streams: A Unified Framework for Personal Big Data Integration and Organization Towards Social Sharing and Individualized Sustainable Use
Abstract
13.1 Introduction
13.2 Overview of Related Work
13.3 Organic Stream: Definitions and Organizations
13.3.1 Metaphors and Graph Model
13.3.2 Definition of Organic Stream
13.3.3 Organization of Social Streams
13.4 Experimental Result and Analysis
13.4.1 Functional Modules
13.4.2 Experiment Analysis
13.5 Summary
References
Chapter 14 - Managing Big Trajectory Data: Online Processing of Positional Streams
Abstract
14.1 Introduction
14.2 Trajectory Representation and Management
14.3 Online Trajectory Compression with Spatiotemporal Criteria
14.4 Amnesic Multiresolution Trajectory Synopses
14.5 Continuous Range Search over Uncertain Locations
14.6 Multiplexing of Evolving Trajectories
14.7 Toward Next-Generation Management of Big Trajectory Data
References
Section IV - Big Data Privacy
Chapter 15 - Personal Data Protection Aspects of Big Data
Abstract
15.1 Introduction
15.1.1 Topic and Aim
15.1.2 Note to the Reader, Structure, and Arguments
15.2 Data Protection Aspects
15.2.1 Big Data and Analytics in Four Steps
15.2.2 Personal Data
15.2.2.1 Profiling Activities on Personal Data
15.2.2.2 Pseudonymization
15.2.2.3 Anonymous Data
15.2.2.4 Reidentification
15.2.3 Purpose Limitation
15.3 Conclusions and Recommendations
References
Chapter 16 - Privacy-Preserving Big Data Management: The Case of OLAP
Abstract
16.1 Introduction
16.1.1 Problem Definition
16.1.2 Chapter Organization
16.2 Literature Overview and Survey
16.2.1 Privacy-Preserving OLAP in Centralized Environments
16.2.2 Privacy-Preserving OLAP in Distributed Environments
16.3 Fundamental Definitions and Formal Tools
16.4 Dealing with Overlapping Query Workloads
16.5 Metrics for Modeling and Measuring Accuracy
16.6 Metrics for Modeling and Measuring Privacy
16.7 Accuracy and Privacy Thresholds
16.8 Accuracy Grids and Multiresolution Accuracy Grids: Conceptual Tools for Handling Accuracy and Privacy
16.9 An Effective and Efficient Algorithm for Computing Synopsis Data Cubes
16.9.1 Allocation Phase
16.9.2 Sampling Phase
16.9.3 Refinement Phase
16.9.4 The computeSynDataCube Algorithm
16.10 Experimental Assessment and Analysis
16.11 Conclusions and Future Work
References
Section V - Big Data Applications
Chapter 17 - Big Data in Finance
Background
17.1 Introduction
17.2 Financial Domain Dynamics
17.2.1 Historical Landscape versus Emerging Trends
17.3 Financial Capital Market Domain: In-Depth View
17.3.1 Big Data Origins
17.3.2 Information Flow
17.3.3 Data Analytics
17.4 Emerging Big Data Landscape in Finance
17.4.1 Challenges
17.4.2 New Models of Computation and Novel Architectures
17.5 Impact on Financial Research and Emerging Research Landscape
17.5.1 Background
17.5.2 UHFD (Big Data)–Driven Research
17.5.3 UHFD (Big Data) Implications
17.5.4 UHFD (Big Data) Challenges
17.6 Summary
References
Chapter 18 - Semantic-Based Heterogeneous Multimedia Big Data Retrieval
Abstract
18.1 Introduction
18.2 Related Work
18.3 Proposed Framework
18.3.1 Overview
18.3.2 Semantic Annotation
18.3.3 Optimization and User Feedback
18.3.4 Semantic Representation
18.3.5 NoSQL-Based Semantic Storage
18.3.6 Heterogeneous Multimedia Retrieval
18.4 Performance Evaluation
18.4.1 Running Environment and Software Tools
18.4.2 Performance Evaluation Model
18.4.3 Precision Ratio Evaluation
18.4.4 Time and Storage Cost
18.5 Discussions and Conclusions
Acknowledgments
References
Chapter 19 - Topic Modeling for Large-Scale Multimedia Analysis and Retrieval
Abstract
19.1 Introduction
19.2 Large-Scale Computing Frameworks
19.3 Probabilistic Topic Modeling
19.4 Couplings among Topic Models, Cloud Computing, and Multimedia Analysis
19.4.1 Large-Scale Topic Modeling
19.4.2 Topic Modeling for Multimedia
19.4.3 Large-Scale Computing in Multimedia
19.5 Large-Scale Topic Modeling for Multimedia Retrieval and Analysis
19.6 Conclusions and Future Directions
References
Chapter 20 - Big Data Biometrics Processing: A Case Study of an Iris Matching Algorithm on Intel Xeon Phi
Abstract
20.1 Introduction
20.2 Background
20.2.1 Intel Xeon Phi
20.2.2 Iris Matching Algorithm
20.2.3 OpenMP
20.2.4 Intel VTune Amplifier
20.3 Experiments
20.3.1 Experiment Setup
20.3.2 Workload Characteristics
20.3.3 Impact of Different Affinity
20.3.4 Optimal Number of Threads
20.3.5 Vectorization
20.4 Conclusions
Acknowledgments
References
Chapter 21 - Storing, Managing, and Analyzing Big Satellite Data: Experiences and Lessons Learned from a Real-World Application
21.1 Introduction
21.2 The Landsat Program
21.3 New Challenges and Solutions
21.3.1 The Conventional Satellite Imagery Distribution System
21.3.2 The New Satellite Data Distribution Policy
21.3.3 Impact on the Data Process Work Flow
21.3.4 Impact on the System Architecture, Hardware, and Software
21.3.5 Impact on the Characteristics of Users and Their Behaviors
21.3.6 The New System Architecture
21.4 Using Big Data Analytics to Improve Performance and Reduce Operation Cost
21.4.1 Vis-EROS: Big Data Visualization
21.4.2 FastStor: Data Mining-Based Multilayer Prefetching
21.5 Conclusions: Experiences and Lessons Learned
Acknowledgments
References
Chapter 22 - Barriers to the Adoption of Big Data Applications in the Social Sector
22.1 Introduction
22.2 The Potential of Big Data: Benefits to the Social Sector—From Business to Social Enterprise to NGO
22.3 How NGOs can Leverage Big Data to Achieve Their Missions
22.4 Historical Limitations and Considerations
22.5 The Gap in Understanding within the Social Sector
22.6 Next Steps: How to Bridge the Gap
22.7 Conclusion
REFERENCES
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Chapter 10 - Making Big Data Transparent to the Software Developers’ Community
Next
Next Chapter
Chapter 11 - Key Technologies for Big Data Stream Computing
III
Big Data Stream Techniques and Algorithms
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset