Resource Management In Cluster Computing Platforms For Large Scale Data Processing

Download Resource Management In Cluster Computing Platforms For Large Scale Data Processing PDF/ePub or read online books in Mobi eBooks. Click Download or Read Online button to get Resource Management In Cluster Computing Platforms For Large Scale Data Processing book now. This website allows unlimited access to, at the time of writing, more than 1.5 million titles, including hundreds of thousands of titles in various foreign languages.
Resource Management in Cluster Computing Platforms for Large Scale Data Processing

In the era of big data, one of the most significant research areas is cluster computing for large-scale data processing. Many cluster computing frameworks and cluster resource management schemes were recently developed to satisfy the increasing demands on large volume data processing. Among them, Apache Hadoop became the de facto platform that has been widely adopted in both industry and academia due to its prominent features such as scalability, simplicity and fault tolerance. The original Hadoop platform was designed to closely resemble the MapReduce framework, which is a programming paradigm for cluster computing proposed by Google. Recently, the Hadoop platform has evolved into its second generation, Hadoop YARN, which serves as a unified cluster resource management layer to support multiplexing of different cluster computing frameworks. A fundamental issue in this field is that given limited computing resources in a cluster, how to efficiently manage and schedule the execution of a large number of data processing jobs. Therefore, in this dissertation, we mainly focus on improving system efficiency and performance for cluster computing platforms, i.e., Hadoop MapReduce and Hadoop YARN, by designing the following new scheduling algorithms and resource management schemes. First, we developed a Hadoop scheduler (LsPS), which aims to improve average job response times by leveraging job size patterns of different users to tune resource sharing between users as well as choose a good scheduling policy for each user. We further presented a self-adjusting slot configuration scheme, named TuMM, for Hadoop MapReduce to improve the makespan of batch jobs. TuMM abandons the static and manual slot configurations in the existing Hadoop MapReduce framework. Instead, by using a feedback control mechanism, TuMM dynamically tunes map and reduce slot numbers on each cluster node based on monitored workload information to align the execution of map and reduce phases. The second main contribution of this dissertation lies in the development of new scheduler and resource management scheme for the next generation Hadoop, i.e., Hadoop YARN. We designed a YARN scheduler, named HaSTE, which can effectively reduce the makespan of MapReduce jobs in YARN platform by leveraging the information of requested resources, resource capacities, and dependency between tasks. Moreover, we proposed an opportunistic scheduling scheme to reassign reserved but idle resources to other waiting tasks. The major goal of our new scheme is to improve system resource utilization without incurring severe resource contentions due to resource over provisioning. We implemented all of our resource management schemes in Hadoop MapReduce and Hadoop YARN, and evaluated the effectiveness of these new schedulers and schemes on different cluster systems, including our local clusters and large clusters in cloud computing, such as Amazon EC2. Representative benchmarks are used for sensitivity analysis and performance evaluations. Experimental results demonstrate that our new Hadoop/YARN schedulers and resource management schemes can successfully improve the performance in terms of job response times, job makespan, and system utilization in both Hadoop MapReduce and Hadoop YARN platforms.
Data Processing and Modeling with Hadoop

Author: Vinicius Aquino do Vale
language: en
Publisher: BPB Publications
Release Date: 2021-10-12
Understand data in a simple way using a data lake. KEY FEATURES ● In-depth practical demonstration of Hadoop/Yarn concepts with numerous examples. ● Includes graphical illustrations and visual explanations for Hadoop commands and parameters. ● Includes details of dimensional modeling and Data Vault modeling. ● Includes details of how to create and define a structure to a data lake. DESCRIPTION The book 'Data Processing and Modeling with Hadoop' explains how a distributed system works and its benefits in the big data era in a straightforward and clear manner. After reading the book, you will be able to plan and organize projects involving a massive amount of data. The book describes the standards and technologies that aid in data management and compares them to other technology business standards. The reader receives practical guidance on how to segregate and separate data into zones, as well as how to develop a model that can aid in data evolution. It discusses security and the measures that are utilized to reduce the impact of security. Self-service analytics, Data Lake, Data Vault 2.0, and Data Mesh are discussed in the book. After reading this book, the reader will have a thorough understanding of how to structure a data lake, as well as the ability to plan, organize, and carry out the implementation of a data-driven business with full governance and security. WHAT YOU WILL LEARN ● Learn the basics of components to the Hadoop Ecosystem. ● Understand the structure, files, and zones of a Data Lake. ● Learn to implement the security part of the Hadoop Ecosystem. ● Learn to work with the Data Vault 2.0 modeling. ● Learn to develop a strategy to define good governance. ● Learn new tools to work with Data and Big Data WHO THIS BOOK IS FOR This book caters to big data developers, technical specialists, consultants, and students who want to build good proficiency in big data. Knowing basic SQL concepts, modeling, and development would be good, although not mandatory. TABLE OF CONTENTS 1. Understanding the Current Moment 2. Defining the Zones 3. The Importance of Modeling 4. Massive Parallel Processing 5. Doing ETL/ELT 6. A Little Governance 7. Talking About Security 8. What Are the Next Steps?
Optimized Cloud Resource Management and Scheduling

Optimized Cloud Resource Management and Scheduling identifies research directions and technologies that will facilitate efficient management and scheduling of computing resources in cloud data centers supporting scientific, industrial, business, and consumer applications. It serves as a valuable reference for systems architects, practitioners, developers, researchers and graduate level students. Explains how to optimally model and schedule computing resources in cloud computing Provides in depth quality analysis of different load-balance and energy-efficient scheduling algorithms for cloud data centers and Hadoop clusters Introduces real-world applications, including business, scientific and related case studies Discusses different cloud platforms with real test-bed and simulation tools