aimode.news
Published on

Qwen 3.7-Plus: Think deep, understand, do

Authors

Today, we officially launched Qwen 3.7-Plus, a multimodular model that unites vision and language into an integrated intelligent base. Based on Qwen3.7 strong text capability, Qwen3.7-Plus has fully upgraded visual-linguistic skills while maintaining full intelligence capabilities in coding, tool use and productivity workflows.

The core feature of Qwen3.7-Plus lies in its ability to act as a multimodular interactive mix of intelligence. It can sense the real world scene, read the screen and operate the GUI, based on visual reference generation codes, end-to-end navigational mobile applications, and web-based answers to visual questions - seamless integration of GUI and CLI in a single intelligence cycle. As an all-power-coded smart body and productivity assistant, it processes all-dimensional tasks from front-end prototypes to complex software engineering and to multi-step workflow automation in a full-modular input. It has a trans-framework capability, which can be deployed through Claude Code, OpenClaw, Qwen Code or other frameworks, to maintain stable performance.

Qwen3.7-Plus — services are now provided through Ali Yunpun:

MultimediaAgent: Harmonized processing of images, videos, screens, web pages and text input and perform tasks in the GUI/ CLI/ Tool Environment

VISUAL Agent: solving visual puzzles, real-world questions and complex reasoning tasks in combination with visual understanding, code interpretor and search enhancement

Visual Coding: Generation of SVG, web page and interactive front-end from image or video generation to end-to-end transformation of visual reference to code

GUI Agent: Understand mobile and desktop-end interfaces for control positioning, task planning and multistep operations

Real-world Protection & Reasoning: Cover real scenes, document charts, OCR, video and driving scene understanding

Blog:

https://qwen.ai/blog?id=qwen3.7-plus

Ali Yunpun:

https://bailian.console.aliyun.com/cn-beijing/?tab=model#/model-market/detail/qwen3.7-plus?serviceSite=asia-pacific-china

https://chat.qwen.ai/?models=qwen3.7-plus

In the list of authoritative visual models for the world, Ali is among the top five and first in China with Qwen 3.7-Plus.

Qwen3.7-Plus has done well in pure text capability and is close to the Max level model. In terms of coding Agent, it is strong on Terminal Bench 2.0, SWE-bench series and SciCode and can effectively handle real software engineering and scientific programming tasks. In terms of universal Agent, it has demonstrated robust tool use and planning on MCP-Mark, Deep-Planning and Kernel Bench L3, particularly in terms of complex multi-step planning and GPU Kernel optimization. Its reasoning capabilities are excellent on GPQA Diamond, HMMT and IMOANswerBench and are at the top of the Plus level model in the high-probability STEM benchmark test. In terms of command compliance and multilingual missions, it maintains stable high-quality performance on IFBench, WMT24++ and PolyMath, covering a wide range of languages and fields.

Qwen3.7-Plus ' multi-modular capability enhancement is not only the optimization of single-point visual understanding, but rather the systemic enhancement of critical capabilities that is required around multi-modular intelligence: understanding complex visual input, visual-based reasoning, calling on tools to solve problems, and ultimately performing tasks in a code or GUI environment.

In Multimedia Reassoning, Qwen 3.7-Plus achieved strong performance on difficult visual reasoning benchmarks such as BabyVision, MathVision, HiPho, ERQA and VisFactor, reflecting a combined understanding of image details, spatial relations, physical commons and multistep logic. In particular, on BabyVision, Qwen 3.6-Plus has been significantly elevated to show that models have a more general capability for tasks closer to early human visualization and spatial reasoning.

Qwen3.7-Plus has been significantly upgraded in the direction of Visual Agent & Coding on ScreenSpot Pro, OSWorld-Verified and AndroidWorld to show that models not only recognize screen content, but also locate key UI elements, understand mission intent and perform multistep interactive operations. On QwenVision2Code, the model also demonstrates a strong visual-to-code generation capability that translates images, videos and design references into enforceable codes. Such capabilities are the basis for the move of multi-modular intelligence from “understand interface” to “operational interface” and “build interface”.

Qwen 3.7-Plus has been significantly enhanced in Multimedia Seaarch & Knowledge QA on SimpleVQA, WorldVQA, MMSearchPlus, BC-VL and MMBC. Models can combine visual input with external knowledge retrieval and answer questions that cannot be accomplished by relying solely on image content. This makes it more appropriate for a real-world mission: users ask more than what is in the picture, but rather want models to provide reliable answers in combination with images, common sense and up-to-date knowledge.

In the context of General Vision Uniting, Qwen3.7-Plus covers basic capabilities such as real world scenes, document resolution, chart reading, OCR, count and spatial positioning, and maintains strong performance in the RealWorldQA, CountQA, OmniDocBench, CharXiv, OCR-Bench-V2 missions. These capabilities determine whether the model can stabilize the processing of real business inputs, including intercepts, papers, tables, reports, posters, commodity maps and complex UI pages.

In addition, Qwen 3.7-Plus further enhanced video understanding and driving scene understanding. In video missions such as VideoMMMU, MLVU, TVBench, LVBench, it is able to handle events, movements, time sequences and semantic relationships in short and long videos; in driving-related assessments such as LingoQA, Ego3D-Bench, SURDS and VLADBench, it also shows a strong understanding of dynamic scenes, traffic participants and space relations. This provides the basis for the real world's multimodular intelligence, autopilot understanding and embodied scenes.

Qwen3.7-Plus has the capability to perform multimodular hybrid intelligence for real missions closed. It not only understands visual interfaces, sense screen content, executes GUI operations and CLI calls, but also allows code generation, application, test validation and iterative optimization in combination with environmental feedback, integrating " look, think, write, do, do" into a single smart workflow, supporting complex software tasks from the end to the end of understanding to delivery.

The Hybrid-Agent smart body system based on Qwen3.7-Plus, which integrates the coding capacity of the large model with the GUI automated execution depth, has resulted in the development of the entire APP chain from demand analysis to version. Agent has been running steady for 11+ hours, and has automatically completed a complete research and development loop for the English word learning app. Cumulative generation codes exceed 10,000+ rows, triggering Agent calls more than 1,000+ times, covering the core of the software development life cycle: demand document generation, automatic code writing, automated installation deployment, test case creation, GUI automated testing, multi-scenes parallel testing, automatic update of product description, automatic version iterative evolution.

For the professional desktop application scene, the GUI sensory and code generation capability of the big model of deep integration of the Hybrid-Agent smart body system achieves an autonomous one-clicking of the desktop-end professional application. Agent has been able to complete the full macOS native Stocks application of the high-level security engraving, covering the complete closed loop from demand understanding to delivery validation: Auto-interactive original application and understanding of UI layouts and functional details, automatic generation of SwiftUI source code based on interactive records, access to LongBridge Real-time API access to real-time market data, auto-compile construction and start-up of a double-off application, and final implementation of 10 functional validation tests - including real-time line loading, stock selection and switching, multi-cycle view switching, search filtering, detailed data panel display, etc. - all passed. The application of the final delivery is a complete replica of the original Stocks dark theme, column layout, real-time situational data and full interactive experience.

Qwen3.7-Plus can act as a powerful visual agent, combining visual understanding with the use of tools to address complex visual tasks. Through code interpreter integration, it can analyse images to find different, patches, emancipation lanes, mazes, puzzles — the whole process is done by autonomous generation and implementation code. Combined with enhanced search, it is able to support monographs, multigraphs and video input based on web-based knowledge for multi-modular reasoning and answers to visual problems in the real world. And here's a few examples of Qwen 3.7-Plus' multimodular intelligence.

In multi-modular reasoning, we have introduced code enforcement to further enhance model reasoning, understanding the structure and constraints in the image, transforming visual problems into calculable questions, and then preparing and implementing code solvers, searches or validations on their own. For example, in the search for different tasks such as different, patches, Quakers, mazes and puzzles, models need not only to identify the image content, but also to perform space modelling, path search, status extrapolation and validation of results. These types of missions reflect the ability of Qwen 3.7-Plus to move from visual perception to procedural solvency.

Demonstration 1: Find the references

Presentation 2: Jigsaw

In the search for enhanced visual questions and answers to real world knowledge questions, Qwen 3.7-Plus can combine image, video or multi-chart input with web search. The model extracts key entities, scenes, text and context leads from visual input, then captures external knowledge through search and synthesizes visual evidence and retrieval results to give answers. This allows the model to deal with a large number of open-world issues, such as location identification, understanding of event background, analysis of commodity or object information, and answering visual questions that depend on the latest knowledge.

Presentation: Realworld VQA

Qwen 3.7-Plus shows a strong visual to code generation. It transforms images, videos, UI screenshots and design references into implementable codes, covering multiple scenarios from the SVG to the full web.

In the image/video transfer SVG task, the model needs to understand geometry, colour, layout, hierarchical relationships and dynamic changes in visual content and to express these visual elements accurately in code form. This requires not only that the model “understands the image”, but also that the model have a structured expression and code generation capability. For scenarios such as icons, illustrations, animation, graphic design and visualization of information, such capabilities can significantly reduce the cost of assets from visual reference to edible code.

Prompt:

Please general svg coding to the image.

Qwen 3.7:

In the visual-driven web design, Qwen 3.7-Plus can generate full interactive web pages based on visual reference, video material or design intent, while models can produce materials for the web design using the generation tool. The model not only reproduces page style, but also organizes layouts, prepares front-end codes, processes interactive logic and integrates multimodular material into the final page. This demonstrates the potential of Qwen 3.7-Plus as a visual programming assistant: from "Give a reference map" to "Generate a functional web prototype."

Presentation: Web Design with Video-Generation

We built the browser smart assistant based on Qwen3.7-Plus and completed the demo and recording by Qwen for Chrome browser plugin. Qwen for Chrome is a smart assistant embedded in the Chrome browser. Users can talk directly to Qwen on the sidebar of the browser and switch to Agent mode after authorization. In this mode, Qwen is able to sense the current web content, understand user tasks, plan operational steps and perform clicks, input, jump, configuration and validation in the real browser environment in the form of Browner Agent.

On this basis, Qwen3.7 Browser Agent integrates the large model 's page understanding, mission planning and GUI automated implementation capabilities. In the face of the need for non-scientists to “purchase one of the cheapest ECS servers”, Agent is able to enter the cloud console directly, complete the full operation of the case specification, low-cost selection, mirror and storage configuration, security group setting, order confirmation, etc., and reflect and adjust strategies proactively when price changes, stock limitations or procurement are blocked. Subsequently, Agent continued to take over the case expansion and transport-wide upgrade tasks, automatically completing the shutdown, configuration adjustment, disk expansion, service restoration and results validation, covering the true usage of the cloud server from procurement to upgrading. The process that would have required the user to understand the logic of the complex console, to change pages and to screen questions manually can now be converted by Agent into a continuous, efficient and deliverable browser automated task.

Qwen3.7-Plus has done well in real world perception and multitemporal reasoning. Real scenes tend to be more complex than standard image questions and answers: they may have a mask, a messy background, small targets, multi-object relationships, cross-graph comparisons and implicit physical commons. Models need to identify visual details in a stable manner before providing reliable answers that combine space relations, common sense and logic.

Qwen3.7-Plus is the strongest model of our current multimodular intelligence, which unites visual understanding and language reasoning into an integrated intelligent base. It operates as a multimodular interactive mix of intelligence bodies - sense real world scenes, operate a graphical interface, write code based on visual reference and perform tasks from end to end in the GUI and CLI environments. It handles all-dimensional tasks from front-end prototypes to complex software engineering, from document formatting to multi-step workflow automation, as an all-power-coded intelligence and productivity assistant. It has a trans-framework capability, which can be deployed through Claude Code, OpenClaw, Qwen Code or other frameworks, to maintain stable performance. We look forward to community feedback and to seeing applications based on Qwen 3.7-Plus。

![Qwen 3.7-Plus: Think deep, understand, do](https://jjhwftqjccwqwubkfvke.supabase.co/storage/v1/object/public/images/articles/qwen-37-plus-think-deep-understand-do.jpg)