Big Data: Principles and best practices of scalable realtime data systems
Format: PDF / Kindle (mobi) / ePub
Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built.
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the Book
Web-scale applications like social networks, real-time analytics, or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional database systems. These applications require architectures built around clusters of machines to store and process data of any size, or speed. Fortunately, scale and simplicity are not mutually exclusive.
Big Data teaches you to build big data systems using an architecture designed specifically to capture and analyze web-scale data. This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm, and NoSQL databases.
This book requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful.
- Introduction to big data systems
- Real-time processing of web-scale data
- Tools like Hadoop, Cassandra, and Storm
- Extensions to traditional database skills
About the Authors
Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems. James Warren is an analytics architect with a background in machine learning and scientific computing.
Table of Contents
- A new paradigm for Big Data
- Data model for Big Data
- Data model for Big Data: Illustration
- Data storage on the batch layer
- Data storage on the batch layer: Illustration
- Batch layer
- Batch layer: Illustration
- An example batch layer: Architecture and algorithms
- An example batch layer: Implementation
- Serving layer
- Serving layer: Illustration
- Realtime views
- Realtime views: Illustration
- Queuing and stream processing
- Queuing and stream processing: Illustration
- Micro-batch stream processing
- Micro-batch stream processing: Illustration
- Lambda Architecture in depth
PART 1 BATCH LAYER
PART 2 SERVING LAYER
PART 3 SPEED LAYER
is to 2. Updated by batch layer load the views somewhere so that they can be queried. This is where the serving layer Serving layer comes in. The serving layer is a specialized distributed database that loads in a batch Batch layer view and makes it possible to do random reads on it (see figure 1.9). When new batch views are available, the serving layer Figure 1.9 Serving layer automatically swaps those in so that more up-to-date results are available. A serving layer database supports batch
Data not matching these properties would indicate a problem in your system, and you wouldn’t want them written to your master dataset. This may not seem like a limitation because serialization frameworks seem somewhat similar to how schemas work in relational databases. In fact, you may have found relational database schemas a pain to work with and worry that making schemas even stricter would be even more painful. But we urge you not to confuse the incidental complexities of working with
the load of serving the HTML. More practically, with algorithms that are parallelizable, you might be able to increase performance by adding more machines, but the improvements will diminish the more machines you add. This is because of the increased overhead and communication costs associated with having more machines. We delved into this discussion about scalability to set the scene for introducing MapReduce, a distributed computing paradigm that can be used to implement a batch layer. As we
re-implements the functionality of the Count and Sum aggregators without being able to reuse the code written for those aggregators. This is unfortunate because it’s more code to maintain, and every time an improvement is made to Count and Sum, those changes need to be incorporated into Average as well. It’s much better to define Average as the composition of a count aggregation, a sum aggregation, and the division function. Unfortunately, Pig’s abstractions don’t allow you to define Average in
situation occurs when only one identifier was ever recorded for the user. In these cases, the outer join ensures these person IDs aren’t filtered from the results and will join those pageviews to a null value. The chosen person ID is the joined ID if it exists—otherwise it’s the original person ID. Input: [userid, url, timestamp] Input: [userid, normalized-id] Outer Inner Join Function: ChooseUserID (userid, normalized-id) -> (output-id) Output: [output-id, url, timestamp] Figure 8.14