I dont have much notoriety, I am a Btech final year student at R.M.D. Engineering College, Kavaraipettai TamilNadu. My interest are Deeplearning, applications of deeplearning such as in Medicine and Convex Optimization. I love Deeplearning and Maths behind it. In the section Research Interest I would have specified about my interest in much more convinent fashion.
I am now looking towards contributing to this field by doing useful research and targetting good conferences and journals with that I can improve my profile and also have a good knowledge about this field, In my leisure time I make games using Godot Game Engine unfortunately the git account I used to commit is lost because of 2FA so you can just see my previous games if you are interested. Also I have attached my CV below you can find about my previous Internships and certifications I have worked on.
Dr. Ajay Shenoy ( PhD @ IISC Banglore , AIML Consultant (Independently)) He plays a pivotal role in my life, Extremely grateful for his guidance and for him.
Hobbies: Listening to Music, Playing games, Walking till I ease my mind off
Below you can find my C.V.
Papers That are my favorite (In no particular order)
Attention Is All You Need - You know it I know it hands down transformed Deep Learning in all fields which are dominated in the era of traditional sequence models such as RNNs ,LSTM and ConvNets, indeed Attention is all we needed all though it is great it still suffers from quadratic time complexity when the model is used in inference and as well as training, To tackle this lot methods are proposed such as efficient attention (training), KV cacheing (inference), still its the greatest.
[
arXiv
]
Image is worth 16 x 16 words - Another classic which is a direct application of the Encoder stack from transformers on Understanding task (aka BERT) but this paved the way for ViT, By tokenzing an image by using Non Overlapping Convoluiton layers and adding with learnable position encoding (similar to GPT2), This has significantly created a new wave of Vision Transformers (ViTs) such as DETER (Encoder decoder architecture) and Many more it is primarily pretrained on JFT-300M (internal dataset by google) thats how it shows more capable understanding.
[
arXiv
]
CLIP - This paper shows that even for Understanding Single Modal tasks such as for classification, We could use Multimodal datas to further more enhance the understanding of the data thus increasing the performance on the Single modal tasks, just like how we learn about objects. Improvements from the loss function of CLIP leads us to SIGLIP
[
arXiv
]
Flamingo - This paper shows Aliging the Vision encoder and text LLMs by using series of Gated Cross attention layer , The LLM is frozen and vision encoder (In the paper they have used conformer) effectively aligning vison and text modality although methodology like Llava exists (In which image representations are directly sent into decoder only transformer with the use of Vision Encoder(Linear on top of frozen ViT - v1, MLP on top of frozen ViT - v2)) it is a bit of personal favorite for me
[
arXiv
]
VQ-VAE - This paper converts the continous nature of latent space in VAE to discrete nature by using embedding layers, The each vector value is mapped to closest embedding layer value (tokenizing it), In this paper they also use distance based restriction on latents. In my opinon VQ-VAE have paved the way for Latent based diffusion models for eg Video poet uses a Multimodal LLM which predicts the token of image or video just like in text. The tokens are encoded and decoded using MAG-VIT v2 which is a VQ-VAE
[
arXiv
]
Mamba - This paper shows the long term nature of SSMs and how it could be incredibly usefull, For me Mamba isn't a one day thing the foundation of Mamba models are laid from S4 SSMs , S4 highlighted that transformers perform worst on Long range tasks on one such task PATH - X transformers performed significantly worse than random guessing (less than 50%) S4 address the issue by initalizing the parameters of it from HiPPO Matrix (again from the same authors of Mamba) to make it long range depended, S4D initalizes the matrix Diagonally, S5 improves the discretiztion from S4 by incorpating ZOH (zero order hold)
, S6 (mamba) combines the gated networks (RNN and LSTM) and H3 (hungry hungry Hippos, the Hippo matrix I mentioned before) and ZOH from S5. further more improvements are done on parallelizing the parameter which led to 1SS-Matrix now the SSMs are equivalent to Transformers
[
arXiv v1,
arXiv v2
]
Am not doing justice by just mentioning only these papers I will make a seperate page dedicated to this later
Publications
I just started my journey on publication I have listed one paper for now will update regularly when ever I publish a paper
A Comprehensive Survey of Mamba Architectures for
Medical Image Analysis: Classification, Segmentation,
Restoration and Beyond
[
arXiv,
Github
]
Research Interest
I can collaborate easily in these areas, comprehend concepts numerically, attempt to explain them theoretically, and try to add something new. I'm confident that I can add something new to these areas.
Deep Learning
Deep Learning for Medicine
Semi supervised Learning
Distributed training
Statistics and Probability
LLMs & Multimodal LLMs
Machine Translation
Language Modeling
AI Agents
Skills
These represent my current skill level for the previously specified research interest.
Deep Learning
Reinforcment Learning
Self Supervised Learning
Semi supervised Learning
Distributed training
Statistics and Probability
Game programming
Convex Optimization
Lecture Notes
I usally present/conduct paper reading sessions on Ai4bharat Discord server (Every Friday - 6:00 pm - 7:00 pm IST/ 12:30 pm - 1:30 pm UTC) the list of the lectures are given below, You can also join AI4bharat discord server here