Help me convert this TabTransformer model to Nx

Hey everyone, I have a code in python that creates and train a TabTransformer model that I, then, extract its embeddings so I can use it to generate embedding vectors for tabular data that I have.

Here is the code:

import pandas as pd import torch import torch.nn as nn import torch.optim as optim from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Define the TabTransformer model class TabTransformer(nn.Module): def __init__(self, num_features, num_classes, dim_embedding=64, num_heads=4, num_layers=4): super(TabTransformer, self).__init__() self.embedding = nn.Linear(num_features, dim_embedding) encoder_layer = nn.TransformerEncoderLayer(d_model=dim_embedding, nhead=num_heads, batch_first=True) self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers) self.classifier = nn.Linear(dim_embedding, num_classes) def forward(self, x): x = self.embedding(x) x = x.unsqueeze(1) # Adding a sequence length dimension x = self.transformer(x) x = torch.mean(x, dim=1) # Pooling x = self.classifier(x) return x data = pd.DataFrame({ "customer_id": [1, 2, 3, 4, 5], "product_category": [1, 2, 1, 3, 2], "ammount": [100, 200, 150, 300, 250] }) X = data.drop("customer_id", axis = 1) y = data["customer_id"] - 1 # Splitting the dataset into training and test sets # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.0, random_state=42) X_train = X X_test = X y_train = y y_test = y # Standardizing the features scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Model parameters num_features = X_train_scaled.shape[1] num_classes = 5 # Adjusted based on unique customer ids # Initialize the model, loss, and optimizer model = TabTransformer(num_features, num_classes).to(torch.device('cpu')) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # Converting data to tensors X_train_tensor = torch.FloatTensor(X_train_scaled) y_train_tensor = torch.LongTensor(y_train.values) # Training loop for epoch in range(100): optimizer.zero_grad() output = model(X_train_tensor) loss = criterion(output, y_train_tensor) loss.backward() optimizer.step() if epoch % 10 == 0: print(f'Epoch {epoch}, Loss: {loss.item()}') # Evaluation model.eval() X_test_tensor = torch.FloatTensor(X_test_scaled) y_test_tensor = torch.LongTensor(y_test.values) with torch.no_grad(): predictions = model(X_test_tensor) _, predicted_classes = torch.max(predictions, 1) accuracy = (predicted_classes == y_test_tensor).float().mean() print(f'Test Accuracy: {accuracy.item()}') # Get embedding from test data embedding = model.embedding embedding(X_test_tensor) 

This works great, but now I want to convert this code to Elixir so I can use it in my backend to serve embeddings without needing to connect to some python code/service.

I’m very new to the ML world, so I’m sincerely not sure how I should start with doing that, especially the part that actually creates the model itself.

Update

So, I started trying to convert it, so far I was able to generate the inputs and the scaler, but I’m totally stuck in the model creation, I have no idea what are the equivalent functions in Axon for the ones used in the TabTransformer class.

Mix.install([ {:kino_explorer, "~> 0.1.20"}, {:axon, "~> 0.7"}, {:scholar, github: "elixir-nx/scholar"}, {:table_rex, "~> 4.0", override: true} ]) alias Explorer.{Series, DataFrame} alias Scholar.Preprocessing.StandardScaler require DataFrame require Series data = DataFrame.new( customer_id: [1, 2, 3, 4, 5], product_category: [1, 2, 1, 3, 2], amount: [100, 200, 150, 300, 250] ) x = DataFrame.discard(data, :customer_id) y = data |> DataFrame.select(:customer_id) |> DataFrame.mutate(customer_id: customer_id - 1) x_train = Nx.stack(x, axis: 1) y_train = Nx.stack(y, axis: 1) x_test = Nx.stack(x, axis: 1) y_test = Nx.stack(y, axis: 1) scaler = StandardScaler.fit(x_train, axes: [0]) x_train_scaled = StandardScaler.transform(scaler, x_train) x_test_scaled = StandardScaler.transform(scaler, x_test) {_, num_features} = Nx.shape(x_train_scaled) # Adjusted based on unique customer ids num_classes = 5 

So, I found this nodebook in Google Colab that has a Keras implementation of TabTransform that is a little more low level and easier to understand and convert to Axon IMO.

I tried my hand on it and I think I was able to convert the first model, the baseline one which don’t contain the transform part, from the notebook.

Here is my implementation of it:

Mix.install([ {:kino_explorer, "~> 0.1.20"}, {:axon, "~> 0.7"}, {:scholar, github: "elixir-nx/scholar"}, {:table_rex, "~> 4.0", override: true} ]) alias Explorer.{Series, DataFrame} alias Scholar.Preprocessing.StandardScaler require DataFrame require Series categorical_features_vocabulary = %{ work_class: [ " ?", " Federal-gov", " Local-gov", " Never-worked", " Private", " Self-emp-inc", " Self-emp-not-inc", " State-gov", " Without-pay" ], education: [ " 10th", " 11th", " 12th", " 1st-4th", " 5th-6th", " 7th-8th", " 9th", " Assoc-acdm", " Assoc-voc", " Bachelors", " Doctorate", " HS-grad", " Masters", " Preschool", " Prof-school", " Some-college" ], marital_status: [ " Divorced", " Married-AF-spouse", " Married-civ-spouse", " Married-spouse-absent", " Never-married", " Separated", " Widowed" ], occupation: [ " ?", " Adm-clerical", " Armed-Forces", " Craft-repair", " Exec-managerial", " Farming-fishing", " Handlers-cleaners", " Machine-op-inspct", " Other-service", " Priv-house-serv", " Prof-specialty", " Protective-serv", " Sales", " Tech-support", " Transport-moving" ], relationship: [ " Husband", " Not-in-family", " Other-relative", " Own-child", " Unmarried", " Wife" ], race: [ " Amer-Indian-Eskimo", " Asian-Pac-Islander", " Black", " Other", " White" ], gender: [" Female", " Male"], native_country: [ " ?", " Cambodia", " Canada", " China", " Columbia", " Cuba", " Dominican-Republic", " Ecuador", " El-Salvador", " England", " France", " Germany", " Greece", " Guatemala", " Haiti", " Holand-Netherlands", " Honduras", " Hong", " Hungary", " India", " Iran", " Ireland", " Italy", " Jamaica", " Japan", " Laos", " Mexico", " Nicaragua", " Outlying-US(Guam-USVI-etc)", " Peru", " Philippines", " Poland", " Portugal", " Puerto-Rico", " Scotland", " South", " Taiwan", " Thailand", " Trinadad&Tobago", " United-States", " Vietnam", " Yugoslavia" ] } categorical_feature_names = Map.keys(categorical_features_vocabulary) # Embedding dimensions of the categorical features embedding_dimensions = 16 # Number of MLP blocks in the baseline model mlp_blocks = 2 mlp_hidden_units_factors = [2, 1] dropout_rate = 0.2 categorical_inputs_names = [ :work_class, :education, :marital_status, :occupation, :relationship, :race, :gender, :native_country ] numeric_inputs_names = [ :age, :education_number, :capital_gain, :capital_loss, :hours_per_week ] encode_categorical_inputs = fn inputs_names, embedding_dimensions -> Enum.map(inputs_names, fn input_name -> vocabulary_size = categorical_features_vocabulary |> Map.fetch!(input_name) |> Enum.count() input_name |> Atom.to_string() |> Axon.input(shape: {nil}) |> Axon.embedding(vocabulary_size, embedding_dimensions, name: "embedding_#{input_name}") end) end encode_numeric_inputs = fn input_names -> Enum.map(input_names, fn input_name -> input_name |> Atom.to_string() |> Axon.input(shape: {nil}) |> Axon.reshape({:auto, 1}) end) end encoded_categorical_features = encode_categorical_inputs.(categorical_inputs_names, embedding_dimensions) encoded_numeric_features = encode_numeric_inputs.(numeric_inputs_names) features = encoded_categorical_features |> Kernel.++(encoded_numeric_features) |> Axon.concatenate() Axon.Display.as_graph(features, Nx.template({1}, :u32), direction: :left_right) {_, feed_forward_units} = Axon.get_output_shape(features, Nx.template({1}, :u32)).shape create_mlp = fn hidden_units, dropout_rate, activation, normalization_layer, name -> block = fn x -> Enum.reduce(hidden_units, x, fn units, x -> x |> normalization_layer.() |> Axon.dense(units, activation: activation) |> Axon.dropout(rate: dropout_rate) end) end Axon.block(block, name: name) end features = Enum.reduce(1..mlp_blocks, features, fn index, features -> mlp = create_mlp.([feed_forward_units], dropout_rate, :gelu, &Axon.layer_norm/1, "feed_forward_#{index - 1}") mlp.(features) end) Axon.Display.as_graph(features, Nx.template({1}, :u32), direction: :left_right) mlp_hidden_units = Enum.map(mlp_hidden_units_factors, & &1 * feed_forward_units) mlp = create_mlp.(mlp_hidden_units, dropout_rate, :selu, &Axon.batch_norm/1, "MLP") features = mlp.(features) Axon.Display.as_graph(features, Nx.template({1}, :u32), direction: :left_right) model = Axon.dense(features, 1, activation: :sigmoid, name: "sigmoid") Axon.Display.as_graph(model, Nx.template({1}, :u32), direction: :left_right) {init_fn, predict_fn} = Axon.build(model) params = init_fn.(Nx.template({1}, :u32), %{}) inputs = %{ "hours_per_week" => Nx.tensor([1]), "capital_loss" => Nx.tensor([1]), "capital_gain" => Nx.tensor([1]), "education_number" => Nx.tensor([1]), "age" => Nx.tensor([1]), "native_country" => Nx.tensor([1]), "gender" => Nx.tensor([1]), "race" => Nx.tensor([1]), "relationship" => Nx.tensor([1]), "occupation" => Nx.tensor([1]), "marital_status" => Nx.tensor([1]), "education" => Nx.tensor([1]), "work_class" => Nx.tensor([1]) } predict_fn.(params, inputs) 

For now I’m ignoring the part that loads the data and normalizes it.

Now, here are some questions that I got doing that code:

  1. How can I be sure that the model I implemented is equivalent to the one in the Colab notebook? I tried running a predict on it with the same input, but it generate a different result every time the model is built, probably it is using some random number for some part of it (maybe the embeddings?)
    I did generate a graph of both of them and they do seem to be equal, but I’m not sure…

  2. The main difference between the baseline model I implemented and the TabTransform in the Colab notebook is the addition of a Multi Head Attention transformer.
    In the Colab they used keras.layers.MultiHeadAttention to create it:

attention_output = layers.MultiHeadAttention( num_heads=num_heads, key_dim=embedding_dims, dropout=dropout_rate, name=f"multihead_attention_{block_idx}", )(encoded_categorical_features, encoded_categorical_features) 

The issue is that there is no equivalent function in Axon to generate the same layer.

I did find a possible implementation in the Bumblebee project here: bumblebee/lib/bumblebee/layers/transformer.ex at main · elixir-nx/bumblebee · GitHub

But, tbh, I’m not exactly sure how to use it and inject it inside my model, their inputs doesn’t seem to translate exactly to the keras ones. Any help here would be greatly appreciated!

Update:

I think I figure out how to use the bumblebee transformer to have the same effect as the keras MultiHeadAttention:

Bumblebee.Layers.Transformer.multi_head_attention( encoded_categorical_features, encoded_categorical_features, encoded_categorical_features, num_heads: num_heads, hidden_size: embedding_dims, ? dropout_rate: dropout_rate, name: "multi_head_attention_#{block_index}" ) 

I didn’t test it yet, but from the documentation this should be the same. Now, I’m not sure if that function directly gives me the same from the keras one or if I should use Bumblebee.Layers.Transformer.block instead.