

# importing the necessary modules # creating the string indexers # features to be included in the model # adding the categorical features # putting all of the features into a single vector Īlright! Now let's see if that worked.

#SPARK LAB MASTER CODE#
dtypes : # your code here Feature SEX has: Now let's take a look at all the values contained in the categorical columns of the DataFrame: for column, data_type in spark_df_done. from import when # changing the values in the education column # changing the values in the marriage column spark_df_done = None spark_df_done. You can do this by using a method called. Similarly, the category "0" looks small, so let's throw it in with the "Other" values. We can go ahead and throw them into the "Other" category since it's already operating as a catchall here.

It looks like there are barely any of the 0 and 5 categories. import seaborn as sns import matplotlib.pyplot as plt # plotting the categories for education After doing so, come up with a strategy for accounting for the extra values. Create bar plots of the variables 'EDUCATION' and 'MARRIAGE' to see how many of the undefined values there are. Let's look at some visualizations of each of these to determine just how many of them there are. it looks like we have some extraneous values in each of our categories. dtypes : # your code here Feature SEX has: įeature EDUCATION has: įeature MARRIAGE has: Now, determine how many categories there are in each of the categorical columns. [('ID', 'int'),Ĭheck to see how many missing values are in the dataset. # import necessary libraries from pyspark import SparkContext from pyspark.sql import SparkSession # initialize Spark Session # read in csv to a spark dataframe spark_df = NoneĬheck the datatypes to ensure that all columns are the datatype you expect. To begin with create a SparkSession and import the 'credit_card_default.csv' file into a PySpark DataFrame. Create a Spark ML pipeline that transforms data and runs over a grid of hyperparameters.

#SPARK LAB MASTER HOW TO#
In the previous lesson, you saw how to manipulate data with Spark DataFrames as well as create machine learning models.
