PrepPy package¶
Submodules¶
PrepPy.datatype module¶
-
PrepPy.datatype.data_type(df)¶ Identify features of different data types.
Parameters: df (pandas.core.frame.DataFrame) – Original feature dataframe containing one column for each feature. Returns: Stores the categorical and numerical columns separately as two dataframes. Return type: tuple Examples
>>> from PrepPy import datatype >>> my_data = pd.DataFrame({'fruits': ['apple', 'banana', 'pear'], 'count': [3, 5, 8], 'price': [1.0, 6.5, 9.23]}) >>> datatype.data_type(my_data)[0] count price 0 3 1.0 1 5 6.5 2 8 9.23 >>> datatype.data_type(my_data)[1] fruits 0 apple 1 banana 2 pear
PrepPy.onehot module¶
-
PrepPy.onehot.onehot(cols, train, valid=None, test=None)¶ One-hot encodes features of categorical type
- cols : list
- list of column names
- train : pandas.DataFrame
- The train set from which the columns come
- valid : pandas.DataFrame
- The validation set from which the columns come
- test : pandas.DataFrame
- The test set from which the columns come
- Returns
- train_encoded, valid_encoded, test_encoded : pandas.DataFrames
- The encoded DataFrames
Examples
>>> from PrepPy import onehot >>> my_data = pd.DataFrame(np.array([['monkey'], ['dog'], ['cat']]), columns=['animals']) >>> onehot.onehot(['animals'], my_data)['train'] animals_monkey animals_dog animals_cat 1 0 0 0 1 0 0 0 1
PrepPy.scaler module¶
-
PrepPy.scaler.scaler(x_train, x_validation, x_test, colnames)¶ Perform standard scaler on numerical features. :param x_train: :type x_train: pandas.core.frame.DataFrame, numpy array or list :param Dataframe of train set containing columns to be scaled.: :param x_validation: :type x_validation: pandas.core.frame.DataFrame, numpy array or list :param Dataframe of validation set containing columns to be scaled.: :param x_test: :type x_test: pandas.core.frame.DataFrame, numpy array or list :param Dataframe of test set containing columns to be scaled.: :param num_columns: :type num_columns: list :param A list of numeric column names:
Returns: Stores the scaled and transformed x_train and x_test sets separately as two dataframes. Return type: dict Examples
>>> from PrepPy import prepPy as pp >>> x_train = pd.DataFrame(np.array([['Blue', 56, 4], ['Red', 35, 6], ['Green', 18, 9]]), columns=['color', 'count', 'usage']) >>> x_test = pd.DataFrame(np.array([['Blue', 66, 6], ['Red', 42, 8], ['Green', 96, 0]]), columns=['color', 'count', 'usage']) >>> x_validation = pd.DataFrame(np.array([['Blue', 30, 18], ['Red', 47, 2], ['Green', 100, 4]]), columns=['color', 'count', 'usage']) >>> colnames = ['count', 'usage'] >>> x_train = pp.scaler(x_train, x_validation, x_test, colnames)['x_train'] >>> x_train color count usage 0 Blue 1.26538 -1.13555 1 Red -0.0857887 -0.162221 2 Green -1.17959 1.29777
>>> x_validation = pp.scaler(x_train, x_validation, x_test, colnames)['x_validation'] >>> x_validation color count usage 0 Blue 1.80879917 -0.16222142 1 Red 0.16460209 1.81110711 2 Green 2.43904552 -4.082207 >>> x_test = pp.scaler(x_train, x_validation, x_test, colnames)['x_test'] >>> x_test color count usage 0 Blue 1.90879917 -0.16222142 1 Red 0.36460209 0.81110711 2 Green 3.83904552 -3.082207
PrepPy.train_valid_test_split module¶
-
PrepPy.train_valid_test_split.train_valid_test_split(X, y, valid_size=0.25, test_size=None, stratify=None, random_state=None, shuffle=True)¶ Splits dataframes into random train, validation and test subsets The proprotion of the train set relative to the input data will be valid_size * (1 - test) …. :param X, y: Allowable inputs are lists, numpy arrays, scipy-sparse matrices or
pandas dataframesParameters: - test_size (float or None, optional (default=None)) –
- float, a value between 0.0 and 1.0 to represent the proportion of the
- input dataset toccomprise the size of the test subset
If None, the value is set to 0.25
- valid_size (float or None, (default=0.25)) –
- float, a value between 0.0 and 1.0 to represent the proportion of the
- input dataset toccomprise the size of the validation subset
Default value is set to 0.25
- stratify (array-like or None (default=None)) –
- If not None, splits categorical data in a stratified fashion preserving
- the same proportion of classes in the train, valid and test sets, using this input as the class labels
- random_state (integer, optional (default=None)) – A value for the seed to be used by the random number generator If None, the value will be set to 1
- shuffle (logical, optional (default=TRUE)) – Indicate whether data is to be shuffled prior to splitting
Returns: splits – List containing train, validation and test splits of the input data
Return type: list, length = 3 * len(arrays)
Examples
>>> from PrepPy import PrepPy as pp >>> X, y = np.arange(16).reshape((8, 2)), list(range(8)) >>> X array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]]) >>> list(y) [0, 1, 2, 3, 4, 5, 6, 7] >>> X_train, X_valid, X_test, y_train, y_valid, y_test = pp.train_valid_test_split(X, y, test_size=0.25, valid_size=0.25, random_state=777) >>> X_train array([[4, 5], [0, 1], [6, 7], [12, 13]]) >>> y_train [3, 0, 2, 5] >>> X_valid array([[2, 3], [10, 11]]) >>> y_valid [1, 4] >>> X_test array([[8, 9], [14, 15]]) >>> y_test [7, 6] >>> pp.train_valid_test_split(X, test_size=2, shuffle=False) >>> X_train array([[2, 3], [14, 15], [6, 7], [12, 13], [4, 5], [10, 11]]) >>> X_test array([[8, 9], [0, 1]])
- test_size (float or None, optional (default=None)) –