Take a look at our Big Data books. Shulph carries a great selection of Big Data books, and we are always adding more.
Get to grips with processing large volumes of data and presenting it as engaging, interactive insights using Spark and Python. Key Features Get a hands-on, fast-paced introduction to the Python data science stack Explore ways to create useful metrics and statistics from large datasets Create detailed analysis reports with real-world data Book Description Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. With this book, you'll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems. The book begins with an introduction to data manipulation in Python using pandas. You'll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you'll be able to analyze data that is distributed on several computers by using Dask. As you progress, you'll study how to aggregate data for plots when the entire data cannot be accommodated in memory. You'll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The book also covers Spark and explains how it interacts with other tools. By the end of this book, you'll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs. What you will learn Use Python to read and transform data into different formats Generate basic statistics and metrics using data on disk Work with computing tasks distributed over a cluster Convert data from various sources into storage or querying formats Prepare data for statistical analysis, visualization, and machine learning Present data in the form of effective visuals Who this book is for Big Data Analysis with Python is designed for Python developers, data analysts, and data scientists who want to get hands-on with methods to control data and transform it into impactful insights. Basic knowledge of statistical measurements and relational databases will help you to understand various concepts explained in this book.
No need to spend hours ploughing through endless data – let Spark, one of the fastest big data processing engines available, do the hard work for you. Key Features Get up and running with Apache Spark and Python Integrate Spark with AWS for real-time analytics Apply processed data streams to machine learning APIs of Apache Spark Book Description Processing big data in real time is challenging due to scalability, information consistency, and fault-tolerance. This book teaches you how to use Spark to make your overall analytical workflow faster and more efficient. You'll explore all core concepts and tools within the Spark ecosystem, such as Spark Streaming, the Spark Streaming API, machine learning extension, and structured streaming. You'll begin by learning data processing fundamentals using Resilient Distributed Datasets (RDDs), SQL, Datasets, and Dataframes APIs. After grasping these fundamentals, you'll move on to using Spark Streaming APIs to consume data in real time from TCP sockets, and integrate Amazon Web Services (AWS) for stream consumption. By the end of this book, you'll not only have understood how to use machine learning extensions and structured streams but you'll also be able to apply Spark in your own upcoming big data projects. What you will learn Write your own Python programs that can interact with Spark Implement data stream consumption using Apache Spark Recognize common operations in Spark to process known data streams Integrate Spark streaming with Amazon Web Services (AWS) Create a collaborative filtering model with the movielens dataset Apply processed data streams to Spark machine learning APIs Who this book is for Data Processing with Apache Spark is for you if you are a software engineer, architect, or IT professional who wants to explore distributed systems and big data analytics. Although you don't need any knowledge of Spark, prior experience of working with Python is recommended.
Data has dramatically changed how our world works. From entertainment to politics, from technology to advertising and from science to the business world, understanding and using data is now one of the most transferable and transferable skills out there. Learning how to work with data may seem intimidating or difficult but with Confident Data Skills you will be able to master the fundamentals and supercharge your professional abilities. This essential book covers data mining, preparing data, analysing data, communicating data, financial modelling, visualizing insights and presenting data through film making and dynamic simulations. In-depth international case studies from a wide range of organizations, including Netflix, LinkedIn, Goodreads, Deep Blue, Alpha Go and Mike's Hard Lemonade Co. show successful data techniques in practice and inspire you to turn knowledge into innovation. Confident Data Skills also provides insightful guidance on how you can use data skills to enhance your employability and improve how your industry or company works through your data skills. Expert author and instructor, Kirill Eremenko, is committed to making the complex simple and inspiring you to have the confidence to develop an understanding, adeptness and love of data.
Data has dramatically changed how our world works. Understanding and using data is now one of the most transferable and desirable skills. Whether you're an entrepreneur wanting to boost your business, a jobseeker looking for that employable edge, or simply hoping to make the most of your current career, Confident Data Skills is here to help. This updated second edition takes you through the basics of data: from data mining and preparing and analysing your data, to visualizing and communicating your insights. It now contains exciting new content on neural networks and deep learning. Featuring in-depth international case studies from companies including Amazon, LinkedIn and Mike's Hard Lemonade Co, as well as easy-to understand language and inspiring advice and guidance, Confident Data Skills will help you use your new-found data skills to give your career that cutting-edge boost. About the Confident series... From coding and web design to data, digital content and cyber security, the Confident books are the perfect beginner's resource for enhancing your professional life, whatever your career path.
Use PySpark to easily crush messy data at-scale and discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs Key Features Work with large amounts of agile data using distributed datasets and in-memory caching Source data from all popular data hosting platforms, such as HDFS, Hive, JSON, and S3 Employ the easy-to-use PySpark API to deploy big data Analytics for production Book Description Apache Spark is an open source parallel-processing framework that has been around for quite some time now. One of the many uses of Apache Spark is for data analytics applications across clustered computers. In this book, you will not only learn how to use Spark and the Python API to create high-performance analytics with big data, but also discover techniques for testing, immunizing, and parallelizing Spark jobs. You will learn how to source data from all popular data hosting platforms, including HDFS, Hive, JSON, and S3, and deal with large datasets with PySpark to gain practical big data experience. This book will help you work on prototypes on local machines and subsequently go on to handle messy data in production and at scale. This book covers installing and setting up PySpark, RDD operations, big data cleaning and wrangling, and aggregating and summarizing data into useful reports. You will also learn how to implement some practical and proven techniques to improve certain aspects of programming and administration in Apache Spark. By the end of the book, you will be able to build big data analytical solutions using the various PySpark offerings and also optimize them effectively. What you will learn Get practical big data experience while working on messy datasets Analyze patterns with Spark SQL to improve your business intelligence Use PySpark's interactive shell to speed up development time Create highly concurrent Spark programs by leveraging immutability Discover ways to avoid the most expensive operation in the Spark API: the shuffle operation Re-design your jobs to use reduceByKey instead of groupBy Create robust processing pipelines by testing Apache Spark jobs Who this book is for This book is for developers, data scientists, business analysts, or anyone who needs to reliably analyze large amounts of large-scale, real-world data. Whether you're tasked with creating your company's business intelligence function or creating great data platforms for your machine learning models, or are looking to use code to magnify the impact of your business, this book is for you.
Solve all big data problems by learning how to create efficient data models Key Features Create effective models that get the most out of big data Apply your knowledge to datasets from Twitter and weather data to learn big data Tackle different data modeling challenges with expert techniques presented in this book Book Description Modeling and managing data is a central focus of all big data projects. In fact, a database is considered to be effective only if you have a logical and sophisticated data model. This book will help you develop practical skills in modeling your own big data projects and improve the performance of analytical queries for your specific business requirements. To start with, you'll get a quick introduction to big data and understand the different data modeling and data management platforms for big data. Then you'll work with structured and semi-structured data with the help of real-life examples. Once you've got to grips with the basics, you'll use the SQL Developer Data Modeler to create your own data models containing different file types such as CSV, XML, and JSON. You'll also learn to create graph data models and explore data modeling with streaming data using real-world datasets. By the end of this book, you'll be able to design and develop efficient data models for varying data sizes easily and efficiently. What you will learn Get insights into big data and discover various data models Explore conceptual, logical, and big data models Understand how to model data containing different file types Run through data modeling with examples of Twitter, Bitcoin, IMDB and weather data modeling Create data models such as Graph Data and Vector Space Model structured and unstructured data using Python and R Who this book is for This book is great for programmers, geologists, biologists, and every professional who deals with spatial data. If you want to learn how to handle GIS, GPS, and remote sensing data, then this book is for you. Basic knowledge of R and QGIS would be helpful.
Combine advanced analytics including Machine Learning, Deep Learning Neural Networks and Natural Language Processing with modern scalable technologies including Apache Spark to derive actionable insights from Big Data in real-time Key Features Make a hands-on start in the fields of Big Data, Distributed Technologies and Machine Learning Learn how to design, develop and interpret the results of common Machine Learning algorithms Uncover hidden patterns in your data in order to derive real actionable insights and business value Book Description Every person and every organization in the world manages data, whether they realize it or not. Data is used to describe the world around us and can be used for almost any purpose, from analyzing consumer habits to fighting disease and serious organized crime. Ultimately, we manage data in order to derive value from it, and many organizations around the world have traditionally invested in technology to help process their data faster and more efficiently. But we now live in an interconnected world driven by mass data creation and consumption where data is no longer rows and columns restricted to a spreadsheet, but an organic and evolving asset in its own right. With this realization comes major challenges for organizations: how do we manage the sheer size of data being created every second (think not only spreadsheets and databases, but also social media posts, images, videos, music, blogs and so on)? And once we can manage all of this data, how do we derive real value from it? The focus of Machine Learning with Apache Spark is to help us answer these questions in a hands-on manner. We introduce the latest scalable technologies to help us manage and process big data. We then introduce advanced analytical algorithms applied to real-world use cases in order to uncover patterns, derive actionable insights, and learn from this big data. What you will learn Understand how Spark fits in the context of the big data ecosystem Understand how to deploy and configure a local development environment using Apache Spark Understand how to design supervised and unsupervised learning models Build models to perform NLP, deep learning, and cognitive services using Spark ML libraries Design real-time machine learning pipelines in Apache Spark Become familiar with advanced techniques for processing a large volume of data by applying machine learning algorithms Who this book is for This book is aimed at Business Analysts, Data Analysts and Data Scientists who wish to make a hands-on start in order to take advantage of modern Big Data technologies combined with Advanced Analytics.
Take the strategic and systematic approach to analyze data to solve business problems Key Features Gain detailed information about the theory of data science Augment your coding knowledge with practical data science techniques for efficient data analysis Learn practical ways to strategically and systematically use data Book Description Principles of Strategic Data Science is created to help you join the dots between mathematics, programming, and business analysis. With a unique approach that bridges the gap between mathematics and computer science, this book takes you through the entire data science pipeline. The book begins by explaining what data science is and how organizations can use it to revolutionize the way they use their data. It then discusses the criteria for the soundness of data products and how to best visualize information. As you progress, you'll discover the strategic aspects of data science by learning the five-phase framework that enables you to enhance the value you extract from data. The final chapter of the book discusses the role of a data science manager in helping an organization take the data-driven approach. By the end of this book, you'll have a good understanding of data science and how it can enable you to extract value from your data. What you will learn Get familiar with the five most important steps of data science Use the Conway diagram to visualize the technical skills of the data science team Understand the limitations of data science from a mathematical and ethical perspective Get a quick overview of machine learning Gain insight into the purpose of using data science in your work Understand the role of data science managers and their expectations Who this book is for This book is ideal for data scientists and data analysts who are looking for a practical guide to strategically and systematically use data. This book is also useful for those who want to understand in detail what is data science and how can an organization take the data-driven approach. Prior programming knowledge of Python and R is assumed.