Preface
If you have ever wanted to get into data mining, but didn't know where to start, I've written this book with you in mind.
Many data mining books are highly mathematical, which is great when you are coming from such a background, but I feel they often miss the forest for the trees—that is, they focus so much on how the algorithms work, that we forget about why we are using these algorithms.
In this book, my aim has been to create a book for those who can program and want to learn data mining. By the end of this book, my aim is that you have a good understanding of the basics, some best practices to jump into solving problems with data mining, and some pointers on the next steps you can take.
Each chapter in this book introduces a new topic, algorithm, and dataset. For this reason, it can be a bit of a whirlwind tour, moving quickly from topic to topic. However, for each of the chapters, think about how you can improve upon the results presented in the chapter. Then, take a shot at implementing it!
One of my favorite quotes is from Shakespeare's Henry IV:
But will they come when you do call for them?
Before this quote, a character is claiming to be able to call spirits. In response, Hotspur points out that anyone can call spirits, but what matters is whether they actually come when they are called.
In much the same way, learning data mining is about performing experiments and getting the result. Anyone can come up with an idea to create a new data mining algorithm or improve upon an experiment's results. However, what matters is: can you build it and does it work?
What this book covers
Chapter 1, Getting Started with Data Mining, introduces the technologies we will be using, along with implementing two basic algorithms to get started.
Chapter 2, Classifying with scikit-learn Estimators, covers classification, which is a key form of data mining. You'll also learn about some structures to make your data mining experimentation easier to perform..
Chapter 3, Predicting Sports Winners with Decision Trees, introduces two new algorithms, Decision Trees and Random Forests, and uses them to predict sports winners by creating useful features.
Chapter 4, Recommending Movies Using Affinity Analysis, looks at the problem of recommending products based on past experience and introduces the Apriori algorithm.
Chapter 5, Extracting Features with Transformers, introduces different types of features you can create and how to work with different datasets.
Chapter 6, Social Media Insight Using Naive Bayes, uses the Naive Bayes algorithm to automatically parse text-based information from the social media website, Twitter.
Chapter 7, Discovering Accounts to Follow Using Graph Mining, applies cluster and network analysis to find good people to follow on social media.
Chapter 8, Beating CAPTCHAs with Neural Networks, looks at extracting information from images and then training neural networks to find words and letters in those images.
Chapter 9, Authorship Attribution, looks at determining who wrote a given document, by extracting text-based features and using support vector machines.
Chapter 10, Clustering News Articles, uses the k-means clustering algorithm to group together news articles based on their content.
Chapter 11, Classifying Objects in Images Using Deep Learning, determines what type of object is being shown in an image, by applying deep neural networks.
Chapter 12, Working with Big Data, looks at workflows for applying algorithms to big data and how to get insight from it.
Appendix, Next Steps…, goes through each chapter, giving hints on where to go next for a deeper understanding of the concepts introduced.
What you need for this book
It should come as no surprise that you'll need a computer, or access to one, to complete this book. The computer should be reasonably modern, but it doesn't need to be overpowered. Any modern processor (from about 2010 onwards) and 4 GB of RAM will suffice, and you can probably run almost all of the code on a slower system too.
The exception here is with the final two chapters. In these chapters, I step through using Amazon Web Services (AWS) to run the code. This will probably cost you some money, but the advantage is less system setup than running the code locally. If you don't want to pay for those services, the tools used can all be set up on a local computer, but you will definitely need a modern system to run it. A processor built in at least 2012 and with more than 4 GB of RAM is necessary.
I recommend the Ubuntu operating system, but the code should work well on Windows, Macs, or any other Linux variant. You may need to consult the documentation for your system to get some things installed, though.
In this book, I use pip to install code, which is a command-line tool for installing Python libraries. Another option is to use Anaconda, which can be found online here: http://continuum.io/downloads.
I have also tested all code using Python 3. Most of the code examples work on Python 2, with no changes. If you run into any problems and can't get around them, send an email and we can offer a solution.
Who this book is for
This book is for programmers who want to get started in data mining in an application-focused manner.
If you haven't programmed before, I strongly recommend that you learn at least the basics before you get started. This book doesn't introduce programming, nor does it give too much time to explain the actual implementation (in code) of how to type out the instructions. That said, once you go through the basics, you should be able to come back to this book fairly quickly—there is no need to be an expert programmer first!
I highly recommend that you have some Python programming experience. If you don't, feel free to jump in, but you might want to take a look at some Python code first, possibly focusing on tutorials using the IPython Notebook. Writing programs in the IPython Notebook works a little differently than other methods such as writing a Java program in a fully fledged IDE.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
The most important is code. Code that you need to enter is displayed separate from the text, in a box like this one:
if True: print("Welcome to the book")
Keep a careful eye on indentation. Python cares about how much lines are indented. In this book, I've used four spaces for indentation. You can use a different number (or tabs), but you need to be consistent. If you get a bit lost counting indentation levels, reference the code bundle that comes with the book.
Where I refer to code in text, I'll use this format
. You don't need to type this in your IPython Notebooks, unless the text specifically states otherwise.
Any command-line input or output is written as follows:
# cp file1.txt file2.txt
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Click on the Export link."
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <feedback@packtpub.com>
, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/6053OS_ColorImages.pdf.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <copyright@packtpub.com>
with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at <questions@packtpub.com>
, and we will do our best to address the problem.