Spark shell
We will go back into our Spark folder, which is spark-2.3.2-bin-hadoop2.7, and start our PySpark binary by typing .\bin\pyspark.
We can see that we've started a shell session with Spark in the following screenshot:
Spark is now available to us as a spark variable. Let's try a simple thing in Spark. The first thing to do is to load a random file. In each Spark installation, there is a README.md markdown file, so let's load it into our memory as follows:
text_file = spark.read.text("README.md")
If we use spark.read.text and then put in README.md, we get a few warnings, but we shouldn't be too concerned about that at the moment, as we will see later how we are going to fix these things. The main thing here is that we can use Python syntax to access Spark.
What we have done here is put README.md as text data read by spark into Spark, and we can use text_file.count() can get Spark to count how many characters are in our text file as follows:
text_file.count()
From this, we get the following output:
103
We can also see what the first line is with the following:
text_file.first()
We will get the following output:
Row(value='# Apache Spark')
We can now count a number of lines that contain the word Spark by doing the following:
lines_with_spark = text_file.filter(text_file.value.contains("Spark"))
Here, we have filtered for lines using the filter() function, and within the filter() function, we have specified that text_file_value.contains includes the word "Spark", and we have put those results into the lines_with_spark variable.
We can modify the preceding command and simply add .count(), as follows:
text_file.filter(text_file.value.contains("Spark")).count()
We will now get the following output:
20
We can see that 20 lines in the text file contain the word Spark. This is just a simple example of how we can use the Spark shell.