π€ Hugging Face Overview: Hugging Face is a leading platform for natural language processing (NLP), offering a vast repository of pre-trained models, datasets, and tools, empowering developers and researchers to build innovative NLP applications with ease.
-
Text
-
π chatbot github code: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/chatbot.ipynb
-
π text translation and text summarization github code: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/text%20translation%20and%20text%20summarization.ipynb
-
π sentence embedding github code: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/Sentence%20Embeddings.ipynb
-
Audio
-
π audio classification real-world dataset github code: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/audio%20classification%20real-world%20dataset.ipynb
-
π Automatic Speech Recognitions and Gradio apps github code: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/Automatic%20Speech%20Recognitions%20and%20Gradio%20apps.ipynb
-
π Text to Speech with VITS-Conditional Variational Autoencoder github code: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/Text%20to%20Speech%20with%20VITS-Conditional%20Variational%20Autoencoder.ipynb
-
Image
-
π Object Detection And Generate Audio Based On Detection github code: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/Object%20Detection%20with%20detr-resnet-50%20and%20gradio.ipynb
-
π image segmentation, image depth and Garido Apps: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/image%20segmentation%2C%20image%20depth%20and%20Garido%20Apps.ipynb
-
π image Retrieval: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/Image-Text%20Retrieval.ipynb
-
π Image Captioning: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/Image%20Captioning.ipynb
-
π Visual Questions and Answering: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/Visual%20Questions%20and%20Answering.ipynb
-
π Zero-Shot Image Classification: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/Zero-Shot%20Image%20Classification%20with%20CLIP.ipynb
-
Text
-
π chatbot kaggle code: https://www.kaggle.com/code/shiblinomani/chatbot-with-hugging-face-model
-
π text translation and text summarization kaggle code: https://www.kaggle.com/code/shiblinomani/text-translation-summarization-with-hugging-face/notebook
-
π sentence embedding kaggle code: https://www.kaggle.com/code/shiblinomani/sentence-embedding-with-hugging-face-model/notebook
-
Audio
-
π audio classification real-world dataset kaggle code: https://www.kaggle.com/code/shiblinomani/audio-classification-real-world-dataset/notebook
-
π Automatic Speech Recognitions and Gradio apps kaggle code: https://www.kaggle.com/code/shiblinomani/automatic-speech-recognitions-and-gradio-apps/notebook
-
π Text to Speech with VITS-Conditional Variational Autoencoder kaggle code: https://www.kaggle.com/code/shiblinomani/text-to-speech-with-vits-auto-encoder/notebook
-Image
- π image Retrieval: https://www.kaggle.com/code/shiblinomani/image-text-retrieval-with-hugging-face-models/notebook
- π image Retrieval: https://github.com/Shibli-Nomani/Open-Source-Models-with-Hugging-Face/blob/main/notebooks/Image-Text%20Retrieval.ipynb
- π Image Captioning: https://www.kaggle.com/code/shiblinomani/image-captioning-with-hugging-face-models
- π Visual Questions and Answering: https://www.kaggle.com/shiblinomani/visual-questions-and-answering-vqa
- π Zero-Shot Image Classification: https://www.kaggle.com/shiblinomani/zero-shot-image-classification-with-clip
https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330
-
Install vscode
-
Install Python pyenv
-
Install python 3.10.8 using pyenv (Pyevn cheat sheet added below)
-
video link to install pyenv and python
https://www.youtube.com/watch?v=HTx18uyyHw8
https://k0nze.dev/posts/install-pyenv-venv-vscode/
ref link: https://github.com/Shibli-Nomani/MLOps-Project-AirTicketPricePrediction
pyenv shell 3.10.8
- #create directory if requires
mkdir nameoftheproject
- #install virtualenv
pip install virtualenv
- #create virtualenv uder the project directory
python -m venv hugging_env
- #activate virtual env in powershell
.\hugging_env\Scripts\activate
- #install dependancy(python libraries)
pip install -r requirements.txt
''' https://huggingface.co/ '''
ref link: https://huggingface.co/openai/whisper-large-v3/tree/main
Memory Requirement: trained model weights * 1.2 here, 3.09 - model weight = (3.09GB*1.2) = 3.708GB memory requires in your local pc.
For Automatic Speech Recognation
Hugging Face >> Tasks >> Scroll Down >> Choose your task >> Automatic Speech Recognation >> openai /whisper-large-v3
[best recommendation as per Hugging Face]
- Pipeline code snippets to perform complex preprocessing of data and model integration.
NLP π§ π¬ (Natural Language Processing) is like teaching computers to understand human language π€π. It helps them read, comprehend, extract ππ, translate ππ€, and even generate text ππ.
A Transformer π€π is a deep learning model designed for sequential tasks, like language processing. It's used in NLP π§ π¬ for its ability to handle long-range dependencies and capture context effectively, making it crucial for tasks such as machine translation ππ€, text summarization ππ, and sentiment analysis ππ. It's considered state-of-the-art due to its efficiency in training large-scale models and achieving impressive performance across various language tasks. ππ
PyTorch is an open-source machine learning framework developed by Facebook's AI Research lab (FAIR), featuring dynamic computational graphs and extensive deep learning support, empowering flexible and efficient model development.
github link: kaggle link:
! pip install transformers
- π model-1(blenderbot-400M-distill): https://huggingface.co/facebook/blenderbot-400M-distill/tree/main
π€ The "blenderbot-400M-distill" model, detailed in the paper "Recipes for Building an Open-Domain Chatbot," enhances chatbot performance by emphasizing conversational skills like engagement and empathy. Through large-scale models and appropriate training data, it outperforms existing approaches in multi-turn dialogue, with code and models available for public use.
- π LLM Leadears Board: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- π chatbot-arena-leaderboard: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
π Text Translation: Converting text from one language to another, facilitating cross-cultural communication and understanding.
! pip install transformers
! pip install torch
- π model-2(nllb-200-distilled-600M): https://huggingface.co/facebook/nllb-200-distilled-600M/tree/main
π NLLB-200(No Language Left Behind), the distilled 600M variant, excels in machine translation research, offering single-sentence translations across 200 languages. Detailed in the accompanying paper, it's evaluated using BLEU, spBLEU, and chrF++ metrics, and trained on diverse multilingual data sources with ethical considerations in mind. While primarily for research, its application extends to improving access and education in low-resource language communities. Users should assess domain compatibility and acknowledge limitations regarding input lengths and certification.
- π Language code for machine translation: https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200
### π© To Clear Memory Allocation
Delete model and clear memory using **Garbage Collector**.
#garbage collector
import gc
#del model
del translator
#reclaiming memory occupied by objects that are no longer in use by the program
gc.collect()
π Text Summarization: Condensing a piece of text while retaining its essential meaning, enabling efficient information retrieval and comprehension.
! pip install transformers
! pip install torch
- π model-3(bart-large-cnn): https://huggingface.co/facebook/bart-large-cnn/tree/main
π€ BART (large-sized model), fine-tuned on CNN Daily Mail, excels in text summarization tasks. It employs a transformer architecture with a bidirectional encoder and an autoregressive decoder, initially introduced in the paper "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension" by Lewis et al. This model variant, although lacking a specific model card from the original team, is particularly effective for generating summaries, demonstrated by its fine-tuning on CNN Daily Mail text-summary pairs.
- π Sentence Embedding Overview: Sentence embedding represents a sentence as a dense vector in a high-dimensional space, capturing its semantic meaning.
-
π§ Encoder Function: During embedding, the encoder transforms the sentence into a fixed-length numerical representation by encoding its semantic information into a vector format.
-
π Cosine Similarity/Distance: Cosine similarity measures the cosine of the angle between two vectors, indicating their similarity in orientation. It's vital for comparing the semantic similarity between sentences, irrespective of their magnitude.
- π― Importance of Cosine Similarity/Distance: Cosine similarity is crucial for tasks like information retrieval, document clustering, and recommendation systems, facilitating accurate assessment of semantic similarity while ignoring differences in magnitude.
! pip install transformers
! pip install sentence-transformers
- π model-3(all-MiniLM-L6-v2): https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/tree/main
π All-MiniLM-L6-v2 Overview: The All-MiniLM-L6-v2 sentence-transformers model efficiently maps sentences and paragraphs into a 384-dimensional dense vector space, facilitating tasks such as clustering or semantic search with ease.
πZero-Shot:
"Zero-shot" refers to the fact that the model makes predictions without direct training on specific classes. Instead, it utilizes its understanding of general patterns learned during training on diverse data to classify samples it hasn't seen before. Thus, "zero-shot" signifies that the model doesn't require any training data for the specific classes it predicts.
-
π€ transformers: This library provides access to cutting-edge pre-trained models for natural language processing (NLP), enabling tasks like
text classification, language generation, and sentiment analysis
with ease. It streamlines model implementation and fine-tuning, fostering rapid development of NLP applications. -
π datasets: Developed by Hugging Face, this library offers a
comprehensive collection of datasets
for NLP tasks, simplifying data acquisition, preprocessing, and evaluation. It enhances reproducibility and facilitates experimentation by providing access to diverse datasets in various languages and domains. -
π soundfile: With functionalities for reading and writing audio files in Python, this library enables seamless audio processing for tasks such as
speech recognition, sound classification, and acoustic modeling
. It empowers users to handle audio data efficiently, facilitating feature extraction and analysis. -
π΅ librosa: Specializing in
music and sound analysis
, this library provides tools for audio feature extraction, spectrogram computation, and pitch estimation. It is widely used in applications likemusic information retrieval, sound classification, and audio-based machine learning tasks
, offering essential functionalities for audio processing projects.
The Ashraq/ESC50 dataset is a collection of 2000 environmental sound recordings, categorized into 50 classes, designed for sound classification tasks. Each audio clip is 5 seconds long and represents various real-world environmental sounds, including animal vocalizations, natural phenomena, and human activities.
! pip install transformers
!pip install datasets
!pip install soundfile
!pip install librosa
- π model-4(clap-htsat-unfused ): https://huggingface.co/laion/clap-htsat-unfused/tree/main
π οΈ clap-htsat-unfused offers a pipeline for contrastive language-audio pretraining, leveraging large-scale audio-text pairs from LAION-Audio-630K dataset. The model incorporates feature fusion mechanisms and keyword-to-caption augmentation, enabling processing of variable-length audio inputs. Evaluation across text-to-audio retrieval and audio classification tasks showcases its superior performance and availability for public use.
For 5 sec of video, SIGNAL VALUE is (5 x 8000) = 40,000
In case of transformers, the SIGNAL VALUE relies on π Sequences and π Attention Mechanism.. SIGNAL VALUE will look like 60 secs for 1 secs.
In the case of transformers, particularly in natural language processing tasks, the SIGNAL VALUE is determined by the length of the πinput sequences and the π attention mechanism employed. Unlike traditional video processing, where each frame corresponds to a fixed time interval, in transformers, the SIGNAL VALUE may appear to be elongated due to the attention mechanism considering sequences of tokens. For example, if the attention mechanism processes 60 tokens per second, the SIGNAL VALUE for 1 second of input may appear equivalent to 60 seconds in terms of processing complexity.
In natural language processing, the input sequence refers to a series of tokens representing words or characters in a text. The attention mechanism in transformers helps the model focus on relevant parts of the input sequence during processing by assigning weights to each token, allowing the model to prioritize important information. Think of it like giving more attention to key words in a sentence while understanding its context, aiding in tasks like translation and summarization.
! pip install transformers
!pip install datasets
!pip install soundfile
!pip install librosa
!pip install gradio
- !pip install transformers: Access state-of-the-art natural language processing models and tools. π€
- !pip install datasets: Simplify data acquisition and preprocessing for natural language processing tasks. π
- !pip install soundfile: Handle audio data reading and writing tasks efficiently. π
- !pip install librosa: Perform advanced audio processing and analysis tasks. π΅
- !pip install gradio: Develop interactive web-based user interfaces for machine learning models. π
Librosa is a Python library designed for audio and music signal processing. It provides functionalities for tasks such as audio loading, feature extraction, spectrogram computation, pitch estimation, and more. Librosa is commonly used in applications such as music information retrieval, sound classification, speech recognition, and audio-based machine learning tasks.
ποΈ LibriSpeech ASR: A widely-used dataset for automatic speech recognition (ASR), containing a large collection of English speech recordings derived from audiobooks. With over 1,000 hours of labeled speech data, it facilitates training and evaluation of ASR models for transcription tasks.
π dataset: https://huggingface.co/datasets/librispeech_asr
π model: https://huggingface.co/distil-whisper
π model: https://github.com/huggingface/distil-whisper
π Distil-Whisper:
Distil-Whisper, a distilled variant of Whisper, boasts 6 times faster speed, 49% smaller size, and maintains a word error rate (WER) within 1% on out-of-distribution evaluation sets. With options ranging from distil-small.en to distil-large-v2, it caters to diverse latency and resource constraints. ππ
- Virtual Assistants
- Voice-Controlled Devices
- Dictation Software
- Mobile Devices
- Edge Computing Platforms
- Online Transcription Services
π οΈπ Build & Share Delightful Machine Learning Apps
Gradio offers the fastest way to showcase your machine learning model, providing a user-friendly web interface that enables anyone to utilize it from any location!
π Gradio Website: https://www.gradio.app/
π Gradio In Hugging Face: https://huggingface.co/gradio
π Gradio Github: https://github.com/gradio-app/gradio
ππ οΈ Gradio: Develop Machine Learning Web Apps with Ease
Gradio, an open-source Python package, enables swift creation of demos or web apps for your ML models, APIs, or any Python function. Share your creations instantly using built-in sharing features, requiring no JavaScript, CSS, or web hosting expertise.
π error: DuplicateBlockError: At least one block in this Blocks has already been rendered.
π solution: change the block name
that we have declared earlier.
demonstrations = gr.Blocks()
- π¦ note: The app will continue running unless you run demo.close()
!pip install transformers
!pip install gradio
!pip install timm
!pip install timm
!pip install inflect
!pip install phonemizer
-
!pip install transformers: Installs the Transformers library, which provides state-of-the-art natural language processing models for various tasks such as text classification, translation, summarization, and question answering.
-
!pip install gradio: Installs Gradio, a Python library that simplifies the creation of interactive web-based user interfaces for machine learning models, allowing users to interact with models via a web browser.
-
!pip install timm: Installs Timm, a PyTorch library that offers a collection of pre-trained models and a simple interface to use them, primarily focused on computer vision tasks such as image classification and object detection.
-
!pip install inflect: Installs Inflect, a Python library used for converting numbers to words, pluralizing and singularizing nouns, and generating ordinals and cardinals.
-
!pip install phonemizer: Installs Phonemizer, a Python library for converting text into phonetic transcriptions, useful for tasks such as text-to-speech synthesis and linguistic analysis.
πNote: py-espeak-ng is only available Linux operating systems.
To run locally in a Linux machine, follow these commands:
sudo apt-get update
sudo apt-get install espeak-ng
pip install py-espeak-ng
π APT stands for Advanced Package Tool. It is a package management system used by various Linux distributions, including Debian and Ubuntu. APT allows users to install, update, and remove software packages on their system from repositories. It also resolves dependencies automatically, ensuring that all required dependencies for a package are installed.
- sudo apt-get update: Updates the package index of APT.
- sudo apt-get install espeak-ng: Installs the espeak-ng text-to-speech synthesizer.
- pip install py-espeak-ng: Installs the Python interface for espeak-ng.
π model: https://github.com/huggingface/distil-whisper
π kakao-enterprise/vits-ljs:
ππ VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
- Overview:
VITS is an end-to-end model for speech synthesis, utilizing a conditional variational autoencoder (VAE) architecture. It predicts speech waveforms based on input text sequences, incorporating a flow-based module and a stochastic duration predictor to handle variations in speech rhythm.
- Features:
πΉ The model generates spectrogram-based acoustic features using a Transformer-based text encoder and coupling layers, allowing it to capture complex linguistic patterns.
πΉ It includes a stochastic duration predictor, enabling it to synthesize speech with diverse rhythms from the same input text.
- Training and Inference:
πΉ VITS is trained with a combination of variational lower bound and adversarial training losses.
πΉ Normalizing flows are applied to enhance model expressiveness.
πΉ During inference, text encodings are up-sampled based on duration predictions and mapped into waveforms using a flow module and HiFi-GAN decoder.
- Variants and Datasets:
πΉ Two variants of VITS are trained on LJ Speech and VCTK datasets.
πΉ LJ Speech comprises 13,100 short audio clips (approx. 24 hours), while VCTK includes approximately 44,000 short audio clips from 109 native English speakers (approx. 44 hours).
π model: https://huggingface.co/kakao-enterprise/vits-ljs
Text-to-audio waveform array for speech generation is the process of converting textual input into a digital audio waveform representation. This involves synthesizing speech from text
, where a machine learning model translates written words into spoken language. The model analyzes the text, generates corresponding speech signals, and outputs an audio waveform array that can be played back as human-like speech. The benefits include enabling natural language processing applications such as virtual assistants, audiobook narration, and automated customer service, enhancing accessibility for visually impaired individuals, and facilitating audio content creation in various industries.
{'audio': array([[ 0.00112925, 0.00134222, 0.00107496, ..., -0.00083117, -0.00077596, -0.00064528]], dtype=float32), 'sampling_rate': 22050}
πnote: This dictionary contains an audio waveform represented as a NumPy array, along with its corresponding sampling rate. π΅ The audio array consists of amplitude values sampled at a rate of 22,050 Hz.
{'audio': array([[ 0.00112925, 0.00134222, 0.00107496, ..., -0.00083117, -0.00077596, -0.00064528]], dtype=float32), 'sampling_rate': 22050}
πnote: This dictionary contains an audio waveform represented as a NumPy array, along with its corresponding sampling rate. π΅ The audio array consists of amplitude values sampled at a rate of 22,050 Hz.
π· Object detection is a computer vision task that involves identifying and locating objects within an image or video. The goal is to not only recognize what objects are present but also to precisely locate them by drawing bounding boxes around them.
It's crucial for automating tasks like surveillance, autonomous driving, and quality control, enhancing safety, efficiency, and user experiences across various industries.
###πTo Find Out the State of Art Models for Object Detection in Hagging Face
π Haggig Face Models: https://huggingface.co/models?sort=trending
π Haggig Face SoTA Models for Object Detection: https://huggingface.co/models?pipeline_tag=object-detection&sort=trending
π Model: https://huggingface.co/facebook/detr-resnet-50
π facebook/detr-resnet-50
DETR (End-to-End Object Detection) model with ResNet-50 backbone:
DETR (DEtection TRansformer) model, trained on COCO 2017 dataset, is an end-to-end object detection model with ResNet-50 backbone. Utilizing encoder-decoder transformer architecture, it employs object queries for detection and bipartite matching loss for optimization, achieving accurate object localization and classification.
π¦ COCO Dataset:
The COCO (Common Objects in Context) 2017 dataset π· is a widely used benchmark dataset for object detection, segmentation, and captioning tasks in computer vision. It consists of a large collection of images with complex scenes containing multiple objects in various contexts. The dataset is annotated with bounding boxes, segmentation masks, and captions for each object instance, providing rich and diverse training data for developing and evaluating object detection algorithms.
π οΈπ Build & Share Delightful Machine Learning Apps For Image Genartion
Gradio offers the fastest way to showcase your machine learning model, providing a user-friendly web interface that enables anyone to utilize it from any location!
π Gradio Website: https://www.gradio.app/
π Gradio In Hugging Face: https://huggingface.co/gradio
π Gradio Github: https://github.com/gradio-app/gradio
- by importing
summarize_predictions_natural_language
for pipeline text generated by object detection model
- using
kakao-enterprise/vits-ljs
, generate text to audio
π kakao-enterprise/vits-ljs:
ππ VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
π model: https://huggingface.co/kakao-enterprise/vits-ljs
- Overview:
VITS is an end-to-end model for speech synthesis, utilizing a conditional variational autoencoder (VAE) architecture. It predicts speech waveforms based on input text sequences, incorporating a flow-based module and a stochastic duration predictor to handle variations in speech rhythm.
Image segmentation involves dividing an image into multiple segments, each representing a distinct object or region. This process is crucial for various applications, as it simplifies image analysis and enhances understanding.
-
π₯ Medical Imaging: Used to identify tumors or anomalies in MRI or CT scans, aiding diagnosis and treatment planning.
-
π Autonomous Vehicles: Enables object detection and obstacle avoidance, crucial for safe navigation on roads.
-
π Satellite Imagery: Facilitates land cover classification, assisting in urban planning, agriculture, and environmental monitoring.
-
In each of these examples, image segmentation plays a vital role in extracting meaningful information from complex visual data, contributing to advancements in healthcare, transportation, and environmental science.
!pip install transformers
!pip install gradio
!pip install timm
!pip install torchvision
torch
-
!pip install transformers: Installs the Transformers library, which provides state-of-the-art natural language processing models for various tasks such as text classification, translation, summarization, and question answering.
-
!pip install gradio: Installs Gradio, a Python library that simplifies the creation of interactive web-based user interfaces for machine learning models, allowing users to interact with models via a web browser.
-
!pip install timm: Installs Timm, a PyTorch library that offers a collection of pre-trained models and a simple interface to use them, primarily focused on computer vision tasks such as image classification and object detection.
-
!pip install torchvision: is used to install the torchvision library, facilitating computer vision tasks in Python environments.
π model: https://huggingface.co/Zigeng/SlimSAM-uniform-77
π Segmentation Anything Model (SAM) is a versatile deep learning architecture π§ πΌοΈ designed for pixel-wise segmentation tasks, capable of accurately delineating objects within images for various applications such as object detection, medical imaging, and autonomous driving.
π SlimSAM:
SlimSAM, a novel SAM compression method, efficiently reuses pre-trained SAMs through a unified pruning-distillation framework and employs masking for selective parameter retention. By integrating an innovative alternate slimming strategy and a label-free pruning criterion, SlimSAM reduces parameter counts to 0.9%, MACs to 0.8%, and requires only 0.1% of the training data compared to the original SAM-H. Extensive experiments demonstrate superior performance with over 10 times less training data usage compared to other SAM compression methods.
Masking in SlimSAM selectively retains crucial parameters, enabling efficient compression of pre-trained SAMs without sacrificing performance, by focusing on essential features and discarding redundancies.
Segmentation mask generation involves creating pixel-wise masks that delineate different objects or regions within an image. For example, in a photo of various fruits ππ, segmentation masks would outline each fruit separately, aiding in their identification and analysis.
π‘ Key Notes:
points_per_batch = 32
in image processing denotes the number of pixel points considered in each batch during model training or inference πΌοΈ, aiding in efficient computation of gradients and optimization algorithms, thereby enhancing training speed and resource utilization.
π note: for smaller size in points_per_batch
is lesser accuracy but less computationally expensive.
-
π§ The model variable initializes a SlimSAM model instance loaded from pre-trained weights π§ π located at "./models/Zigeng/SlimSAM-uniform-77", enabling tasks like inference or fine-tuning.
-
π The processor variable initializes a SamProcessor instance loaded with pre-trained settings π οΈπ located at "./models/Zigeng/SlimSAM-uniform-77", facilitating data preprocessing for compatibility with the SlimSAM model during inference or fine-tuning processes.
-
π οΈ Pretrained settings encompass pre-defined configurations or parameters obtained from training a model π§ π, facilitating effective performance in related tasks with minimal fine-tuning or adjustment.
import torch
π note:
no_grad runs the model inference without tracking operations for gradient computation, thereby conserving memory resources and speeding up the inference process.
with torch.no_grad():
outputs = model(**inputs)
Gradient computation π refers to calculating the derivatives of a loss function with respect to the model parameters, crucial for updating weights during training. These gradients indicate the direction and magnitude of parameter updates needed to minimize the loss during training through optimization algorithms like gradient descent.
DPT (Dense Pretrained Transformer) enhances dense prediction tasks using Vision Transformer (ViT) as its backbone. It provides finer-grained predictions compared to fully-convolutional networks, yielding substantial improvements in performance, especially with large training data. DPT achieves state-of-the-art results in tasks like monocular depth estimation and semantic segmentation on datasets like ADE20K, NYUv2, KITTI, and Pascal Context.
π model: https://huggingface.co/docs/transformers/model_doc/dpt
π model in Github: https://github.com/isl-org/DPT
π research paper of model : https://arxiv.org/abs/2103.13413
DPT-Hybrid, also known as MiDaS 3.0, is a monocular depth estimation model based on the Dense Prediction Transformer (DPT) architecture, utilizing a Vision Transformer (ViT) backbone with additional components for enhanced performance. Trained on 1.4 million images, it offers accurate depth predictions for various applications such as autonomous navigation, augmented reality, and robotics, providing crucial depth perception for tasks like obstacle avoidance, scene understanding, and 3D reconstruction.
π model: https://huggingface.co/Intel/dpt-hybrid-midas
- ππΈπ Multimodal
Multimodal models ππΈπ are machine learning architectures designed to process and integrate information from multiple modalities, such as text, images, audio, and other data types, into a cohesive representation. These models utilize various techniques like fusion mechanisms
, attention mechanisms
, and cross-modal learning
to capture rich interactions between different modalities, enabling them to perform tasks like image captioning, video understanding, and more, by leveraging the complementary information present across different modalities.
-
Fusion mechanisms π: Techniques to combine information from different modalities, like averaging features from text and images to make a unified representation.
-
Attention mechanisms π: Mechanisms that focus on relevant parts of each modality's input, like attending to specific words in a sentence and regions in an image.
-
Cross-modal learning π§ π‘: Learning strategies where information from one modality helps improve understanding in another, like using audio features to enhance image recognition accuracy.
π΅ Application: ChatGPT --> SEE, HEAR AND SPEAK
BLIP Model: Proposed in BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.
Tasks: BLIP excels in various multi-modal tasks such as:
-
Visual Question Answering π€β‘οΈπΈ
-
Image-Text Retrieval (Image-text matching) ππΌοΈπ
-
Image Captioning πΌοΈπ
-
Abstract: BLIP is a versatile VLP framework adept at both understanding and generation tasks. It effectively utilizes noisy web data through bootstrapping captions, resulting in state-of-the-art performance across vision-language tasks.
π model: https://huggingface.co/docs/transformers/model_doc/blip
π model: https://huggingface.co/Salesforce/blip-itm-base-coco
Salesforce AI Research is dedicated to pioneering AI advancements to revolutionize our company, customers, and global communities π. Their innovative products harness AI to enhance customer relationship management, optimize sales processes, and drive business intelligence, empowering organizations to thrive in the digital era ππΌ.
π model: https://huggingface.co/Salesforce/blip-itm-base-coco
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Model card for BLIP trained on image-text matching
- base architecture (with ViT base backbone) trained on COCO dataset.
!pip install transformers
!pip install torch
- AutoProcessor ππ€ is a comprehensive tool developed by Salesforce AI Research to automate and streamline data processing tasks π οΈ. It efficiently handles data extraction, transformation, and loading processes, accelerating data-driven decision-making and improving operational efficiency across various domains such as sales, marketing, and customer service.
inputs = processor(images = raw_images,
text = text,
return_tensors = "pt"
)
itm_scores = model(**inputs)[0]
-
model(**inputs): This calls the model with the provided inputs. The **inputs syntax in Python unpacks the dictionary inputs and passes its contents as keyword arguments to the model function.
-
[0]: This accesses the first element of the output returned by the model. The output is likely a tuple or a list containing various elements, and [0] retrieves the first element.
-
itm_scores: This assigns the result obtained from step 2 to the variable itm_scores, which likely contains the predicted scores for different classes.
π note: To open a raw image
- images.jpg (image name with directory)
raw_image = Image.open("images.jpg")
raw_image
πΈποΈ Image Captioning: Generating descriptive textual descriptions for images, enhancing accessibility and understanding of visual content.
Real-life Use: Image captioning is employed in social media platforms like Instagram to provide accessibility for visually impaired users, in content management systems for organizing and indexing images, and in educational settings for creating inclusive learning materials.
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- π BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. Utilizes noisy web data by bootstrapping captions, achieving state-of-the-art results in tasks like π·π image-text retrieval and image captioning. Accessible for both conditional and unconditional image captioning.
π model: https://huggingface.co/Salesforce/blip-image-captioning-base
-
Salesforce AI Research is dedicated to pioneering AI advancements to revolutionize our company, customers, and global communities π. Their innovative products harness AI to enhance customer relationship management, optimize sales processes, and drive business intelligence, empowering organizations to thrive in the digital era ππΌ.
-
π€ AutoProcessor
π€ from transformers import AutoProcessor
: Imports the AutoProcessor module from Transformers, allowing automatic loading of data processors for NLP tasks with ease. They group processing objects for text, vision, and audio modalities, providing flexibility and ease of use for various NLP tasks.
#decoding text
print(processor.decode(outputs[0], skip_special_tokens = True))
Visual question answering (VQA)ποΈπ¬ is a task in artificial intelligence where systems are designed to answer questions about images. It involves combining computer vision and natural language processing techniques.
-
Icons:
- ποΈ for computer vision
- π¬ for natural language processing
-
Use Cases:
- E-commerce: Users can ask about product details from images.
- Healthcare: Clinicians can inquire about abnormalities in medical images for diagnosis.
-
Self-attention: Useful in tasks requiring capturing global dependencies within a sequence, such as sentiment analysis or document classification.
-
Cross-attention: Beneficial in tasks where relationships between elements from two different sequences are crucial, like in machine translation or summarization.
-
Bi-directional self-attention:** Enhances contextual understanding in tasks like language understanding or sequence generation.
-
Causal self-attention: Ensures that the model generates outputs sequentially and is beneficial in tasks like autoregressive generation, where the order of elements matters.
- Self-attention attends within a single sequence.
- Cross-attention attends between two different sequences.
- Bi-directional self-attention attends bidirectionally within a single sequence.
- Causal self-attention attends in a unidirectional, causal manner within a single sequence.
Example: In a VQA model, each region of an image attends to other regions and words in the question to understand the visual context and answer the question accurately. For instance, if the question is "What color is the car?" and the image contains multiple cars, self-attention helps the model focus on the relevant car region while considering the question.
Example: In VQA, the question embeddings attend to different regions of the image, enabling the model to focus on relevant visual information corresponding to the words in the question. For example, if the question is "How many people are there?" the model's attention mechanism would focus on regions representing people in the image.
Example: In VQA, each region of an image attends to all other regions as well as words in the question, allowing the model to capture both intra-modal (visual) and inter-modal (visual-textual) relationships effectively. For instance, when asked "What is the person holding?" the model can attend to both the person's hand region and the corresponding words describing the action in the question.
Example: In VQA tasks with sequential processing, each word in the question attends only to preceding words and relevant regions of the image, ensuring that the model generates answers sequentially while maintaining the context of the question and the image. For instance, if the question is "What is the color of the sky?" the model would attend to preceding words and relevant sky regions to generate the answer progressively.
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BLIP (Bootstrapping Language-Image Pre-training) is a novel framework for Vision-Language Pre-training (VLP), enhancing performance across understanding and generation tasks. By intelligently utilizing noisy web data through bootstrapping captionsβwhere a captioner generates synthetic captions filtered for noiseβBLIP achieves state-of-the-art results in various tasks like image-text retrieval, image captioning, and VQA. It also exhibits strong generalization when applied to video-language tasks in a zero-shot manner. ππΌοΈπ
π model: https://huggingface.co/Salesforce/blip-vqa-base
π Zero Shot Image Classification
In a zero-shot image classification scenario with dogs and cats, imagine training a model on a dataset containing images of various dog breeds and cat breeds. During training, the model learns to associate visual features of these animals with their respective class labels (e.g., "Labrador Retriever," "Siamese Cat").
Now, suppose the model encounters images of new dog and cat breeds during inference that it has never seen before, such as "Pomeranian" or "Maine Coon." In a zero-shot setting, the model can still classify these images accurately without specific training examples of these breeds.
This is achieved by providing the model with additional information about the classes, such as textual descriptions or class attributes. For example, the model might learn that Pomeranians are small-sized dogs with fluffy coats, while Maine Coon cats are large-sized cats with tufted ears and bushy tails.
π―note: Zero-shot learning relies on the model's ability to generalize from known classes to unseen classes based on semantic embeddings or attributes associated with those classes.
CLIP, developed by OpenAI, is a model that learns from both images and text to understand the world. It's trained to recognize patterns in pictures and words together, helping it to classify images even when it hasn't seen them before. However, it's not meant for all kinds of tasks and needs to be carefully studied before using it in different situations. It can be helpful in tasks like image search, where you describe what you're looking for in words, and the model finds matching images.
π model: https://huggingface.co/openai/clip-vit-large-patch14
labels = ["A Photo of Cat", "A Photo of Dog"]
outputs.logits_per_image
probs = outputs.logits_per_image.softmax(dim=1)[0]
print(f"probs : {probs}")
probs = list(probs)
for i in range(len(labels)):
print(f"label: {labels[i]} - probability of {probs[i].item():.4f}")