Skywork UniPic 2.0:
Building Kontext Model with Online RL for Unified Multimodal Model

Skywork Multimodality Team

Technical Report GitHub 🤗HuggingFace

Abstract

Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, some prominent open-source models focus primarily on scaling model parameters to achieve better image generation and editing results, rather than focusing on optimal training strategies.In this technical report, we build upon SD3.5-Medium to explore how a 2B-parameter DiT model can achieve impressive image generation and editing capabilities, as well as how to extend it into a unified multimodal model that delivers leading performance in the open-source community. We first modify the architecture of SD3.5-Medium and pre-train it on large-scale, high-quality image generation and editing data to jointly support text-to-image generation and image editing tasks. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy, which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, we obtain the UniPic2-SD3.5M-Kontext model, which surpasses BAGEL (with 7B generation parameters) and Flux-Kontext (with 12B generation parameters) in image generation and editing performance. Furthermore, following the MetaQueries, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery.UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.

Model Overview

Skywork UniPic 2.0 is a unified multimodal model that integrates understanding, generation, and editing. It is trained on a large-scale, high-quality image generation and editing dataset, and uses a progressive dual-task reinforcement strategy to enhance instruction following and editing consistency.

unipic

Performance Comparison

Benchmark Comparison

We compare the performance of UniPic2-SD3.5M-Kontext and UniPic2-Metaquery with the state-of-the-art models on the image generation and editing tasks.

Type Model GenEval ↑ DPG↑ GEdit-En ↑ ImgEdit ↑ MMBench ↑ MMMU ↑ MM-Vet ↑
Generation SD3.5-Medium 0.65 83.86 x x x x x
Gen.& Edit. FLUX.1-Kontext - - 6.26 3.52 x x x
Unified BAGEL 0.88 85.07 6.52 3.20 85.0 55.3 67.2
Gen. & Edit. UniPic2-SD3.5M-Kontext 0.89 84.23 6.59 4.00 x x x
Unified UniPic2-Metaquery 0.90 83.79 6.87 4.03 83.5 58.6 67.1

Generation and Editing Across Diverse Tasks

Text-to-Image Generation

Case 1: A tranquil pond with vibrant clear blue water, where three pristine white waterlilies float gracefully on the surface. Each waterlily has a perfectly formed shape with delicate petals radiating outwards from a yellow center. The smooth surface of the water reflects the sky above, enhancing the serenity of the scene, as gentle ripples emanate outward from the blossoms.

Text-to-Image Generation Case 1

Case 2: A visually striking isometric design spells out the word 'skywork' and the second row spells 'unipic' using an array of artist pencils with softly rounded edges, demonstrating the principles of modular constructivism. Each pencil features a pastel color palette, blending harmoniously against a serene blue background. The entire composition benefits from soft, smooth lighting that accentuates the textures and forms, created with a physically based rendering technique that provides a realistic appearance, with the entire artwork centrally positioned within the frame, creating a trendy and aesthetically pleasing image.

Text-to-Image Generation Case 2

Case 3: A vintage illustration of a retro computer where 'skywork' is in the screen, vaporwave aesthetic, in the style of Jim Steranko, light pink and light blue, Behance, fancy background.

Text-to-Image Generation Case 3

Case 4: a cute cat dressed as a Victorian gentleman, standing in the foggy streets of Victorian London, fog, refractions, cinematic still, dynamic lighting, ultra detailed, intricate, realistic fur, oil painting, fine strokes, paint texture, textured canvas

Text-to-Image Generation Case 4

Case 5:Create a humorous and vividly colored illustration of a cute mad scientist in his laboratory. The scientist, with an exaggerated comical expression, is preparing explosive solutions that emit lots of smoke. He is covering his ears and squeezing his eyes shut, no open eyes, anticipating a big explosion. The laboratory is filled with quirky scientific equipment and bubbling potions, enhancing the whimsical atmosphere. The overall mood should be light-hearted and funny, capturing the essence of a playful, cartoonish experiment gone awry. Use bright, lively colors to emphasize the humor and energy of the scene

Text-to-Image Generation Case 5

Case 6:A photorealistic scene of cute miniature chefs in traditional white uniforms and hats, eagerly pouring melted chocolate generously onto one giant croissant. One chef stands on a ladder, holding a giant chocolate pot, carefully pours the chocolate onto the croissant, while the other chef spreads it evenly over the surface. They finish off with a sprinkling of beautiful red berries, and one chef finishes off with a sprinkling of powdered sugar. The background is a simple, monochromatic gray wallpaper.

Text-to-Image Generation Case 6

Image Editing

Case 1: put the word 'skywork' on the metal kitten

Image Editing Case 1 - Before
Image Editing Case 1 - After

Case 2: Make the person lower his right arm.

Image Editing Case 2 - Before
Image Editing Case 2 - After

Case 3: change the word "dreamshot" to "skywork"

Image Editing Case 3 - Before
Image Editing Case 3 - After

Case 4: Change the style of this image to Ice Age.

Image Editing Case 5 - Before
Image Editing Case 5 - After

Case 5: Add a painting to the easel.

Image Editing Case 4 - Before
Image Editing Case 4 - After

Multimodal Understanding

Multimodal Understanding

Acknowledgement

We would like to express our gratitude to:

  • SD3.5-Medium for their strong base model
  • Qwen2.5-VL for their strong vision-language model
  • OpenUni for their simple unified multimodal framework
  • Metauery for their excellent unified multimodal model
  • Flow-GRPO for their RL training framework

We are grateful to the broader research community for their open exploration and contributions to the field of unified multimodal model.

Citation[TBD]