PySpark

PySpark

Table of Contents

Pyspark will be the tool of python API that supports python with Apache spark. Pyspark will provide py4j library with the help of library, python can be easily integrated with the Apache spark. Pyspark plays an important role as it needs to work with the vast dataset.

Key features  of pyspark are

  1. Real time computation- This pyspark will provide real-time computation on a large amount of data because as it highlight the in-memory processing.It also shows low latency.
  2. Support multiple language- Pyspark framework best known with various programming languages like scala, Java, Python and R making comfortable for the convenient frameworks for processing huge datasets.
  3. Caching and disk constancy- Pyspark framework caching and good disk constancy.
  4. Swift Processing- Pyspark requires data processing speed which is about 100 times faster and quicker in memory and 10 times faster on the disk.
  5. Works well with RDD- Python programming language typed which helps RDD.

Why pyspark?

Huge amount of data is created offline and online. These data has  hidden patterns unknown correction, market trends, customer preference useful business information. It can be necessary to extract valuable information from the raw data.

We need more efficient tool that performs different types of big data. We have many tools to perform the multiple tasks on the huge dataset but these tools not convincing. This needs  scalable tools to crack big data and gain benefit from it.

What is the real life usage of Pyspark?

Data is very essential in every industry. Many industries works on the big data and also hires analysts to exact useful information from the raw data.

  1. Entertainment industry:

The entertainment industry will be the best sectors that will be growing through online streaming. The Netflix will operate Apache spark for real time processing for personal online movies for its users.

2. Commercial sector:

The commercial sector will be Apache spark’s real time processing system. Banks and other financial service providers are using spark. It retrieves the customer’s social media profile for analysis to get useful information to make the accurate decisions. This is extracted information which is used for credit risk assessment that is decided for customer segmentation.

3. Healthcare- spark may be used to analyse patients records along with the previous medical data to identify which patient is probable to face health issues after being discharged from the clinic.

4.Tourism industry- The tourism industry uses Apache spark to provide suggestions to many  travelers by comparing hundreds of tourism websites.

What is sparkconf?

The sparkconf offers an configuration for any spark application. To start application on a local cluster we have to group of configuration and parameters will be using sparkconf.

The features of sparkconf are:

set(key,value)

setMastervalue(value)

setAppName(value)

get(key,defaultvalue=None)

setSparkHome(value)

we have an example

from pyspark.conf import SparkConf  

from pyspark.context import SparkContext  

Conf = SparkConf ().setAppName (‘PySpark Demo App’).setMaster(‘local[2]’)  

conf.get (‘spark.master’)  

conf.get (‘spark.app.name’)  

Sparkcontext

The sparkcontext is important thing that is invoked there is executing any spark application. The significant step of any spark driver application will be to generate Sparkcontext. It is the entry gate for spark related application. It will be available as sc by default in Pyspark.

sparkcontext will accept the following parameters that we may say:

Master

This is the URL of the cluster connects to spark

appName

This is the name of the task

SparkHome

SparkHome will be the spark installation directory

Pyfiles

.zip or.py files uses the cluster and then added to the PYTHONPATH.

environment

It represents the worker nodes  environment variables.

Batchsize

The number of the python object that represents the Batchsize. If we want to disable the batching set it to 1

serializer

It represents the serializer an RDD

conf

It is set all to the spark properties. An object of L

Questions

1. What is pyspark?

2.Why is pyspark used?

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class