Thread by @karlhigley, I’ve been told that one of my super-powers is saying “I don’t [...]

I’ve been told that one of my super-powers is saying “I don’t understand” over and over again until either a coherent explanation is given or everyone else realizes that they don’t fully understand either.

Let’s talk about recommender system modeling!

I have a pile of data that contains user ids, item ids, user and item attributes, and logs of which users interacted with which items. I want to build a model to help me figure out which items to recommend to which users. I have no experience and am therefore completely naive.

I take my data, one-hot encode any categorical features (like ids), and train a linear y=wx+b model with 0-1 labels and cross-entropy loss. Then I recommend the highest scoring items to each user.

Does this work? No! Why not? The model recommends the same items to every user.

Realizing that I’ve tried use a regression model for a classification problem, I say to myself: “Huh, that didn’t work. Maybe the problem is that my model is linear?” So, I wrap the model in a logistic function and train it again.

Does it work now? Still no! Same result.

Okay, back to the drawing board. Why isn’t this model doing any personalization? Well...the user features may shift the predicted item scores up and down, but they don’t change the item scores relative to one another. Seems like I need interactions between user and item features.

I go back to my data, manually compute the element-wise products between user and item features, add those as input features, and train again.

Does this work? Yes! Now I get different item scores for different users, but...seems inefficient.

I go back to my model, add product terms between features (re-inventing higher-order regression), remove crossed features from the inputs, and retrain.

Still works? Sure, but...my data doesn’t have support for all pairs of features, so some of the weights don’t get updates.

One way to handle feature interactions involves low-rank approximations of matrices, so I represent each input feature with a k-dimensional vector and replace the weights on the pairwise feature terms with vector products. People tell me this is called a factorization machine.

Now I don’t need to manually compute crossed features, and I’ve handled the sparsity problem by factorizing the feature interactions, which means that I can compute pairwise interactions between features that don’t occur together in the training data (using vector products.)

That means I’ve moved beyond memorizing exactly the pairwise relationships indicated by the training data and started to generalize to other relationships. I guess it’s sort like triangulation in a vector space? However it works, this is way better.

I’m out of ideas but I hear deep learning is cool, so I go search for other models. Google has this wide and deep model I can pull off the shelf and try, but...how does it work?

Well, it’s got my explicit pairwise feature interactions from earlier and it’s got vector embeddings, but it...concatenates the embeddings and passes them through ReLU layers? I guess that’s non-linear, but this feels like a step backward and a step forward mixed together.

So now I’m wondering:

Do the layers in the network do the same thing as a vector product? If so, why do I need them? If not, what are they doing?

Well, the Wide and Deep paper suggests that factorization machines tend to over-generalize.

Okay, in that case, why didn’t we build a model that contains both explicit and factorized pairwise interactions, like higher-order regression plus a factorization machine?

Look, deep learning is cool, okay? Also, you wouldn’t need Tensorflow for that.

Fine, I want my resume to look good too, but at least tell me that the ReLU layers are good at learning multiplicative relationships?

I poke around, read some more papers, and stumble across Google’s Latent Cross paper. Looks like they’re going to test this out. Perfect!

Well, shit.

Those layers are doing something but it’s a bad approximation of a vector product (at best.)

So now I’m asking myself: Self, why did I ever have any reason to believe that non-linear functions combining individual dimensions of vector embeddings would learn entity relationships?

Something something universal approximation theorem?

Okay, sure, but didn’t we just throw out a bunch of knowledge we had about the structure of the problem and fail to capture it in our model when we got rid of the vector products?

There must be a better way! I still want to do deep learning, but let’s bring back the vector products. How about we smash Wide and Deep together with a factorization machine?

Latest Threads Unrolled: