PrepPy package

Submodules

PrepPy.datatype module

PrepPy.datatype.data_type(df)

Identify features of different data types.

Parameters:df (pandas.core.frame.DataFrame) – Original feature dataframe containing one column for each feature.
Returns:Stores the categorical and numerical columns separately as two dataframes.
Return type:tuple

Examples

>>> from PrepPy import datatype
>>> my_data = pd.DataFrame({'fruits': ['apple', 'banana', 'pear'],
                            'count': [3, 5, 8],
                            'price': [1.0, 6.5, 9.23]})
>>> datatype.data_type(my_data)[0]
      count price
    0     3     1.0
    1     5     6.5
    2     8     9.23
>>> datatype.data_type(my_data)[1]
      fruits
    0   apple
    1   banana
    2   pear

PrepPy.onehot module

PrepPy.onehot.onehot(cols, train, valid=None, test=None)

One-hot encodes features of categorical type

cols : list
list of column names
train : pandas.DataFrame
The train set from which the columns come
valid : pandas.DataFrame
The validation set from which the columns come
test : pandas.DataFrame
The test set from which the columns come
Returns
train_encoded, valid_encoded, test_encoded : pandas.DataFrames
The encoded DataFrames

Examples

>>> from PrepPy import onehot
>>> my_data = pd.DataFrame(np.array([['monkey'], ['dog'], ['cat']]),
                            columns=['animals'])
>>> onehot.onehot(['animals'], my_data)['train']
animals_monkey    animals_dog   animals_cat
        1               0           0
        0               1           0
        0               0           1

PrepPy.scaler module

PrepPy.scaler.scaler(x_train, x_validation, x_test, colnames)

Perform standard scaler on numerical features. :param x_train: :type x_train: pandas.core.frame.DataFrame, numpy array or list :param Dataframe of train set containing columns to be scaled.: :param x_validation: :type x_validation: pandas.core.frame.DataFrame, numpy array or list :param Dataframe of validation set containing columns to be scaled.: :param x_test: :type x_test: pandas.core.frame.DataFrame, numpy array or list :param Dataframe of test set containing columns to be scaled.: :param num_columns: :type num_columns: list :param A list of numeric column names:

Returns:Stores the scaled and transformed x_train and x_test sets separately as two dataframes.
Return type:dict

Examples

>>> from PrepPy import prepPy as pp
>>> x_train = pd.DataFrame(np.array([['Blue', 56, 4], ['Red', 35, 6],
['Green', 18, 9]]),
                         columns=['color', 'count', 'usage'])
>>> x_test = pd.DataFrame(np.array([['Blue', 66, 6], ['Red', 42, 8],
['Green', 96, 0]]),
                         columns=['color', 'count', 'usage'])
>>> x_validation = pd.DataFrame(np.array([['Blue', 30, 18], ['Red', 47, 2],
 ['Green', 100, 4]]),
                         columns=['color', 'count', 'usage'])
>>> colnames = ['count', 'usage']
>>> x_train = pp.scaler(x_train, x_validation, x_test, colnames)['x_train']
>>> x_train
color   count   usage
0 Blue 1.26538 -1.13555
1 Red -0.0857887 -0.162221
2 Green -1.17959 1.29777
>>> x_validation = pp.scaler(x_train, x_validation,
        x_test, colnames)['x_validation']
>>> x_validation
    color  count       usage
0  Blue    1.80879917 -0.16222142
1  Red     0.16460209  1.81110711
2  Green   2.43904552 -4.082207
>>> x_test = pp.scaler(x_train, x_validation, x_test, colnames)['x_test']
>>> x_test
   color   count      usage
0  Blue    1.90879917 -0.16222142
1  Red     0.36460209  0.81110711
2  Green   3.83904552 -3.082207

PrepPy.train_valid_test_split module

PrepPy.train_valid_test_split.train_valid_test_split(X, y, valid_size=0.25, test_size=None, stratify=None, random_state=None, shuffle=True)

Splits dataframes into random train, validation and test subsets The proprotion of the train set relative to the input data will be valid_size * (1 - test) …. :param X, y: Allowable inputs are lists, numpy arrays, scipy-sparse matrices or

pandas dataframes
Parameters:
  • test_size (float or None, optional (default=None)) –
    float, a value between 0.0 and 1.0 to represent the proportion of the
    input dataset toccomprise the size of the test subset

    If None, the value is set to 0.25

  • valid_size (float or None, (default=0.25)) –
    float, a value between 0.0 and 1.0 to represent the proportion of the
    input dataset toccomprise the size of the validation subset

    Default value is set to 0.25

  • stratify (array-like or None (default=None)) –
    If not None, splits categorical data in a stratified fashion preserving
    the same proportion of classes in the train, valid and test sets, using this input as the class labels
  • random_state (integer, optional (default=None)) – A value for the seed to be used by the random number generator If None, the value will be set to 1
  • shuffle (logical, optional (default=TRUE)) – Indicate whether data is to be shuffled prior to splitting
Returns:

splits – List containing train, validation and test splits of the input data

Return type:

list, length = 3 * len(arrays)

Examples

>>> from PrepPy import PrepPy as pp
>>> X, y = np.arange(16).reshape((8, 2)), list(range(8))
>>> X
array([[0, 1],
      [2, 3],
      [4, 5],
      [6, 7],
      [8, 9],
      [10, 11],
      [12, 13],
      [14, 15]])
>>> list(y)
[0, 1, 2, 3, 4, 5, 6, 7]
>>> X_train, X_valid, X_test, y_train, y_valid, y_test =
        pp.train_valid_test_split(X,
                                  y,
                                  test_size=0.25,
                                  valid_size=0.25,
                                  random_state=777)
>>> X_train
array([[4, 5],
      [0, 1],
      [6, 7],
      [12, 13]])
>>> y_train
[3, 0, 2, 5]
>>> X_valid
array([[2, 3],
     [10, 11]])
>>> y_valid
[1, 4]
>>> X_test
array([[8, 9],
     [14, 15]])
>>> y_test
[7, 6]
>>> pp.train_valid_test_split(X, test_size=2, shuffle=False)
>>> X_train
array([[2, 3],
      [14, 15],
      [6, 7],
      [12, 13],
      [4, 5],
      [10, 11]])
>>> X_test
array([[8, 9],
      [0, 1]])

Module contents