Custom Architecture for Backbone model

Hello, 
I have a problem where I need to match entities with mixed-data (text, numerical, images) and multiple image inputs for each entity. 
I was wondering if it is possible to use a custom architecture for creating the representation, so that I can use a multi-input multimodal architecture. 
Thank you