jupyter_DeepLearningFeatureEngineering
In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
print("Version: TensorFlow", tf.__version__)
Version: TensorFlow 2.5.0

Summary

tf.feature_column is the bridge to map from columns in a CSV to features used to train tensorflow model. The workflow is listed at the following:

  • Step 1. input data from .csv file
  • Step 2. feature engineering base on type, such as numeric and categorical
  • Step 3. feed Dense Tensor into model (tensorflow model only take Dense Tensor as input)

Class: feature_column

feature_column subclass input data/tensor type process output data/tensor type
numeric_column float or integer wrap regular numeric type into dtyp float or integer Dense Tensor
bucketized_column Dense tensor from numeric_column bucketized integer (one-hot) Dense Tensor
categorical_column_with_identity integer (unique key) encode keys integer (one-hot) Sparse Tensor
categorical_column_with_vocabulary_list string encode word's index integer (one-hot) Sparse Tensor
categorical_column_with_vocabulary_file string encode word's index integer (one-hot) Sparse Tensor
categorical_column_with_hash_bucket string, Sparse Tensor hash-encoding words integer (approximate one-hot) Sparse Tensor
crossed_column string, Sparse Tensor Crossing combinations of features integer Dense Tensor
indicator_column Sparse Tensor wrap Sparse Tensor to Dense Tensor integer Dense Tensor
embedding_column Sparse Tensor;
Dense Tensor
mapping a feature from sparse to dense form;
dimentional reduction
float Dense Tensor

Practical feature engineering

  • user profile
  • item information
  • behavior statistics (bucketized by time, class, etc.)

Example code

In [2]:
# A utility method to create a feature column, and to transform a batch of data
def demo(feature_column):
    # reference: https://www.tensorflow.org/tutorials/structured_data/feature_columns?hl=zh-cn
    feature_layer = layers.DenseFeatures(feature_column)
    print(feature_layer(example_batch))

# example batch of data
example_batch =  {
    'var_numeric': np.arange(1,7),
    'var_categorical': np.tile(['A','B','C'], 2),
}
example_batch
Out[2]:
{'var_numeric': array([1, 2, 3, 4, 5, 6]),
 'var_categorical': array(['A', 'B', 'C', 'A', 'B', 'C'], dtype='<U1')}

dense feature

The feature columns used as inputs to the model should be instances of classes derived from _DenseColumn such as:

  • numeric_column,
  • bucketized_column,
  • indicator_column,
  • embedding_column
In [3]:
print("numeric_column:")
layer = tf.feature_column.numeric_column("var_numeric", default_value=None)
demo(layer)

print("bucketized_column:")
boundaries = [2,4]
layer = tf.feature_column.numeric_column('var_numeric', default_value=None)
layer = tf.feature_column.bucketized_column(layer, boundaries=boundaries)
demo(layer)
numeric_column:
tf.Tensor(
[[1.]
 [2.]
 [3.]
 [4.]
 [5.]
 [6.]], shape=(6, 1), dtype=float32)
bucketized_column:
tf.Tensor(
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]], shape=(6, 3), dtype=float32)

categorical feature

In [4]:
print("categorical_column_with_identity:")
layer = tf.feature_column.categorical_column_with_identity("var_numeric", num_buckets=4, default_value=None)
layer = tf.feature_column.indicator_column(layer)
demo(layer)

print("categorical_column_with_vocabulary_list / categorical_column_with_vocabulary_file:")
layer = tf.feature_column.categorical_column_with_vocabulary_list('var_categorical', ['A', 'B', 'C'])
layer = tf.feature_column.indicator_column(layer)
demo(layer)

print("categorical_column_with_hash_bucket:")
layer = tf.feature_column.categorical_column_with_hash_bucket('var_categorical', hash_bucket_size=2, dtype=tf.string)
layer = tf.feature_column.indicator_column(layer)
demo(layer)

print("crossed_column:")
col_1 = tf.feature_column.categorical_column_with_vocabulary_list('var_categorical', ['A', 'B', 'C'], dtype=tf.string)
col_2 = tf.feature_column.categorical_column_with_identity('var_numeric', num_buckets=2, default_value=0)
layer = tf.feature_column.crossed_column([col_1,col_2], 16)
layer = tf.feature_column.indicator_column(layer)
demo(layer)
categorical_column_with_identity:
tf.Tensor(
[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]], shape=(6, 4), dtype=float32)
categorical_column_with_vocabulary_list / categorical_column_with_vocabulary_file:
tf.Tensor(
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]], shape=(6, 3), dtype=float32)
categorical_column_with_hash_bucket:
tf.Tensor(
[[1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]], shape=(6, 2), dtype=float32)
crossed_column:
tf.Tensor(
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]], shape=(6, 16), dtype=float32)

If you have categorical features, you need to wrap them with :

  • indicator_column: most often used.
  • embedding_column: use for dimentional reduction when feature have many different values
In [5]:
print("indicator_column:")
layer = tf.feature_column.categorical_column_with_hash_bucket('var_categorical', hash_bucket_size=2, dtype=tf.string)
layer = tf.feature_column.indicator_column(layer)
demo(layer)

print("embedding_column:")
layer = tf.feature_column.categorical_column_with_vocabulary_list('var_categorical', ['A', 'B', 'C'], dtype=tf.string)
layer = tf.feature_column.embedding_column(layer, 2, combiner='sqrtn')
demo(layer)
indicator_column:
tf.Tensor(
[[1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]], shape=(6, 2), dtype=float32)
embedding_column:
tf.Tensor(
[[-0.29650226  0.40554228]
 [-0.59952354  0.82049197]
 [ 0.4452031   0.47114116]
 [-0.29650226  0.40554228]
 [-0.59952354  0.82049197]
 [ 0.4452031   0.47114116]], shape=(6, 2), dtype=float32)