kascemagic.blogg.se - Spark lab master

#SPARK LAB MASTER HOW TO#
#SPARK LAB MASTER CODE#

# importing the necessary modules # creating the string indexers # features to be included in the model # adding the categorical features # putting all of the features into a single vector Īlright! Now let's see if that worked.

A VectorAssembler (to combine all features into one SparseVector)Īll of these initialized estimators should be stored in a list.

A OneHotEncoderEstimator (to encode the newly indexed strings into categorical variables).

3 StringIndexers (for each categorical feature).

Now, it's time to fit the data to the PySpark machine learning model pipeline. It looks like males have an ever so slightly higher default rate than females. # make barplot for female and male default v no default rate # perform a groupby for default and sex [Row(default=1, SEX='Female', count=3762), Let's also visualize the difference in default rate between males and females in this dataset. groupBy() as well as an aggregation method. Create a barplot to compare the number of defaults vs. Let's first look at the overall distribution of class balance of the default and not default labels. Now, let's do a little more investigation into our target variable before diving into the machine learning aspect of this project.

#SPARK LAB MASTER CODE#

dtypes : # your code here Feature SEX has: Now let's take a look at all the values contained in the categorical columns of the DataFrame: for column, data_type in spark_df_done. from import when # changing the values in the education column # changing the values in the marriage column spark_df_done = None spark_df_done. You can do this by using a method called. Similarly, the category "0" looks small, so let's throw it in with the "Other" values. We can go ahead and throw them into the "Other" category since it's already operating as a catchall here.

It looks like there are barely any of the 0 and 5 categories. import seaborn as sns import matplotlib.pyplot as plt # plotting the categories for education After doing so, come up with a strategy for accounting for the extra values. Create bar plots of the variables 'EDUCATION' and 'MARRIAGE' to see how many of the undefined values there are. Let's look at some visualizations of each of these to determine just how many of them there are. it looks like we have some extraneous values in each of our categories. dtypes : # your code here Feature SEX has: įeature EDUCATION has: įeature MARRIAGE has: Now, determine how many categories there are in each of the categorical columns. [('ID', 'int'),Ĭheck to see how many missing values are in the dataset. # import necessary libraries from pyspark import SparkContext from pyspark.sql import SparkSession # initialize Spark Session # read in csv to a spark dataframe spark_df = NoneĬheck the datatypes to ensure that all columns are the datatype you expect. To begin with create a SparkSession and import the 'credit_card_default.csv' file into a PySpark DataFrame. Create a Spark ML pipeline that transforms data and runs over a grid of hyperparameters.

Load and manipulate data using Spark DataFrames.

This dataset is from a Taiwanese financial company, and the task is to determine which individuals are going to default on their credit card based off of characteristics such as limit balance, past payment history, age, marriage status, and sex. Afterward, you're going to make use of different visualizations to see if you can get any insights from the model. In this lab, you're going to practice loading data, manipulating it, and fitting it in the Spark framework.

#SPARK LAB MASTER HOW TO#

In the previous lesson, you saw how to manipulate data with Spark DataFrames as well as create machine learning models.