Identifying candidate problems that can be solved using ML
It is also important for data scientists to be able to understand the scale of data that they are working with. There might be tasks related to medical research that span thousands of patients, with hundreds of features that can be processed on a single node device. However, tasks such as advertising, where companies collect several petabytes of data on customers based on every online advertisement that is served to the user, may require several thousand machines to compute and train ML algorithms. Deep learning algorithms are GPU-intensive and require a different type of machine than other ML algorithms. In this book, for each algorithm, we supply a description of how it is implemented simply using Python libraries, and then, how it can be scaled on large AWS clusters using technologies such as Spark and AWS SageMaker. We also discuss how TensorFlow is used for deep learning applications.
It is crucial to understand the customer of their ML-related tasks. Although it is challenging for data scientists to find which algorithm works for a specific application area, it is also important to gather evidence on how that algorithm enhances the application area and present this to the product owners. Hence, we also discuss how to evaluate each algorithm and visualize the results where necessary. AWS offers a large array of tools for evaluating ML algorithms and presenting the results.
Finally, a data scientist also needs to be able to make decisions on what types of machines best fit their needs on AWS. Once the algorithm is implemented, there are important considerations regarding how it can be deployed on large clusters in the most economical way. AWS offers more than 25 hardware alternatives, called instance types, which can be selected. We will discuss case studies on how an application is deployed on production clusters, and the various issues that a data scientist can face during this process.