News Report Technology

Introducing GLIGEN, the innovative text-to-image generation model equipped with bounding box capabilities.

In Brief

GLIGEN, short for Grounded-Language-to-Image Generation, is an exciting new method that expands upon and enhances the capabilities of existing pre-trained diffusion models.

By incorporating both caption and bounding box input conditions, the GLIGEN model is capable of generating text-to-image outputs in a rich open-world context.

Drawing from knowledge of a pre-trained text-to-image model, GLIGEN is proficient in generating diverse objects that can be placed in specific contexts and styling.

Another unique feature of GLIGEN is its ability to incorporate human keypoints during the text-to-image generation process.

Though large-scale text-to-image diffusion technology has advanced significantly, the predominant technique of relying solely on text inputs often restricts the level of control available. GLIGEN The Grounded-Language-to-Image Generation technique enhances already established pre-trained models by allowing them to respond to grounding inputs.

To ensure the pre-trained model retains its extensive concept knowledge, all original weights are preserved, while new, trainable layers incorporate the grounding data through a controlled procedure. With caption and bounding box as inputs, GLIGEN creates grounded text-to-image outputs that can generalize effectively to new spatial configurations and ideas.

Check out the demo here.

The foundation of GLIGEN rests on existing pre-trained diffusion models, whose original weights are maintained to hold on to a vast array of pre-existing knowledge.
  • Each transformer block introduces a new trainable Gated Self-Attention layer, designed specifically to take in additional grounding information. diffusion models When it comes to grounding tokens, they contain dual types of information: semantic details about the item being grounded (which could be from text or images) and positional information regarding space (provided as bounding boxes or key points).
  • VToonify: a real-time AI solution crafted to produce artistic portrait videos.
  • The newly integrated modulated layers are continuously trained using extensive grounding data (image-text-box). This approach proves to be more cost-efficient compared to other methods of applying a pre-trained diffusion model, like full-model finetuning. Much like a Lego, these diverse trained layers can be easily swapped in and out, introducing various new functionalities.
Related article: The newly incorporated modulated layers continuously undergo training with vast grounding datasets (image-text-box), providing a more budget-friendly solution than traditional approaches such as complete model finetuning. Similarly to building with Lego, different layers can be added or removed to enable a multitude of new capabilities.
GLIGEN allows for scheduled sampling within the diffusion process during inference. This means that the model can dynamically choose whether to utilize grounding tokens (by adding new layers) or revert to the original diffusion framework that has proven quality (by removing the new layer). This balance optimizes both generation quality and grounding potential.
Microsoft has introduced a diffusion model that can transform a single photograph into a 3D avatar of an individual. diffusion model GLIGEN is versatile and can also be trained with reference photographs.
Training GLIGEN with reference images can enhance detail. The first row indicates that using reference photos along with descriptive text can yield finer attributes like the style and shape of a vehicle. The second row illustrates how a reference photo can serve as a styling guide, where anchoring it to a corner or edge yields sufficient results.
Training GLIGEN with reference images can enhance detail. The first row indicates that using reference photos along with descriptive text can yield finer attributes like the style and shape of a vehicle. The second row illustrates how a reference photo can serve as a styling guide, where anchoring it to a corner or edge yields sufficient results.
Drawing from knowledge of a pre-trained text-to-image model, GLIGEN is proficient in generating diverse objects that can be placed in specific contexts and styling.
Related article: Just like its counterparts in diffusion models, GLIGEN excels at grounded image inpainting, generating objects that closely align with the provided bounding boxes.
Additionally, GLIGEN is capable of grounding human key points as part of its functionality.
Introducing the first-ever AI-generated podcast featuring an intriguing discussion with Steve Jobs, conducted by Joe Rogan.
Music-to-Dance: EDGE AI innovates with an infinite array of dance concepts for TikTok, inspired by audio inputs.
Music-to-Dance: EDGE AI innovates with an infinite array of dance concepts for TikTok, inspired by audio inputs.
OpenAI is in the process of developing an AI model specifically designed for video content. generating text-to-images .

Read more about AI:

Disclaimer

In line with the Trust Project guidelines In order for DeFAI to realize its full potential, it must address the challenges of cross-chain compatibility.

  • Art
  • 2022-2025