GitHub

Introduction of 1-bit and Ternary Bonsai Image 4B: Image Generation for Local Devices

Today we publish Bonsai Image 4B, a family of compact image generation models designed for the execution of high-quality diffusion reference on local hardware: from laptops to phones. Bonsai Image 4B is available in two variants:

- 1-bit bonsai image 4B uses binary {−1, +1} transformer weights with a group-by-group FP16 scaling factor, giving 1.125 effective bits per weight. It aims at maximum compression and is the right choice when storage pressure, bandwidth and provision requirements represent the primary restrictions. - Ternary Bonsai Image 4B uses {−1, 0, +1} transformer weights with a group-by-group FP16 scaling factor, which gives 1.71 effective bits per weight. The additional zero state gives the model more display flexibility, improves visual quality and playback accuracy and remains extremely compact at the same time. The result is a new delivery regime for image generation: powerful outputs, open weights and practical local inference on devices that have hitherto been inaccessible for this model class. According to our knowledge, Bonsai Image 4B is the first image model of its parameter class that runs directly on an iPhone. Built for local production

Local image production begins with a hard limitation: The model must fit into the device's memory budget. In an image model of the 4B class, the diffusion transformer is the largest part of the model and the part that is repeated during generation. Each unremitting step retrieves the transformer so that the size of the transformer has direct influence on the storage pressure, the bandwidth requirement and the local inference speed. Bonsai Image 4B is built from the FLUX.2 small 4B. This keeps the architecture preserved, but it changes the way in which the transformer weights are displayed. By converting these weights into binary and ternary form, Bonsai reduces the part of the image pipeline that is most important for local provision. Table I: Diffusion transformer footprint for models. The binary layers offer an approximately 14-fold reduction compared to transformer weights with full precision. A small set of precision-sensitive supporting tensors (~5 %), the so-called projection planes, remains in FP16, so that the final 1-bit bonsai image is 0.93 GB: a 8.3-fold reduction compared to the 7.75 GB FLUX.2 small 4B with full precision. The ternary variant follows the same structure. Its ternary layers provide an approximately 10-fold reduction and the final Ternary Bonsai Image 4B transformer is 1.21 GB, which corresponds to a 6.4-fold reduction compared to the full precision transformer. It is slightly larger than the 1-bit model, but the additional zero state improves visual quality and playback. Including the encoder for compressed text and FP16 UAE, the payload of the Apple Silicone deployment 3.42 GB for 1-bit Bonsai Image 4B and 3.88 GB for Ternary Bonsai Image 4B. For comparison: The FLUX.2 small 4B with full precision requires a service load of 15.97 GB. Since the text encoder is relieved at runtime after the prompt coding, the mean memory usage is smaller than the total useful load. When generating a 512x512 image, the mean active memory is 1.5 GB or 1.96 GB for the binary and ternary model, compared to 11.74 GB for the original FLUX.2 small 4B (a reduction of 7.8 times or 6.0 times). For a 1024x1024 image, the mean active memory is 1.95 GB or 2.38 GB for the binary and ternary model, compared to 14.39 GB for the original FLUX.2 small 4B (a reduction of 7.4-fold or 6.0-fold). This reduction in the storage requirement changes the location at which the model can be executed. Our delivery stack supports Apple Silicon iPhones, iPads and Macs as well as CUDA GPUss and uses MLX-Low-Bit path on Apple hardware and Gemlite-Low-Bit-GEMM kernel on CUDA. On the iPhone 17 Pro Max, the fully accurate FLUX.2 small 4B pipeline does not fit into the device's memory budget while both Bonsai image variants are executed on the device. Video I: Image generation with Bonsai Studio

In practice, Bonsai Image 4B creates a 512x512 image in 9.4 seconds on an iPhone 17 Pro Max and about 6 seconds on a Mac M4 Pro. On Mac M4 Pro, Bonsai Image 4B is up to 5.6 times faster than the standard MFLUX pipeline with full precision. Benchmarking performance

Compression is only important when the model remains useful. We rated Bonsai Image 4B based on three complementary benchmarks: GenEval for object composition and attribute binding; HPSv3 human preference and aesthetic quality; DPG-Bench dense prompt following and semantic fidelity. Table II: Benchmark comparison of image quality between Ternary Bonsai Image 4B and other models. Ternary Bonsai Image 4B is the quality-oriented variant. At 1.21 GB, it maintains 95% of the FLUX.2 small 4B accuracy via GenEval, HPSv3 and DPG-Bench and simultaneously reduces the space requirement of the diffusion transformer by 6.4 times. 1-Bit Bonsai Image 4B is the footprint-oriented version. As a result, the diffusion transformer is reduced to below 1 GB, which corresponds to a reduction of 8.3 times, and nevertheless provides strong benchmark results with the same three ratings (88% of the accuracy of FLUX.2 small 4B remain). Together, the two variants shift the boundary between quality and footprint. Bonsai Image remains competitive with modern 4B-class image models and uses only a fraction of its diffusion transformer footprint. At the same time, it significantly surpasses smaller models with similar storage requirements. This is the same pareto shift we have seen in our previous Bonsai language models. Bonsai Image brings modern diffusion transformer behavior into a storage area that was previously too much smaller models with lower performance. Why is that important?

Image production is not just a problem of model quality. It is also a provision problem. CloudAPIss will remain the right choice for many products. However, the pure cloud generation involves certain product restrictions: each prompt is a remote request, each iteration causes marginal delivery costs and each interaction increases the round trip lag. This is important because image production is naturally iterative. Users rarely stand with a picture. You revise command prompts, compare editions, generate variations, discard errors and try again. If every attempt is a server-side job, the creative loop becomes something that users need to measure and wait. Local conclusions change that. Once the model matches the device, the generation can take place directly in the product experience. It will be more cost-effective to use faster in iteration and easier in environments where prompts and generated assets should remain private. Bonsai Image 4B is a step towards this deployment regime: powerful image generation that runs closer to the user, on hardware that he already owns. Availability

Both 1-bit and Ternary Bonsai Image 4B are published with open weights and code under the Apache 2.0 license. With this introduction we also start Bonsai Studio, its iOS-App that allows you to try Bonsai Image 4B directly on iPhone. Do with

PrismML emerged from a team of Caltech researchers and was supported by Khosla Ventures, Cerberus and Google founded. We have spent years dealing with one of the most difficult problems in this area: compressing neural networks without impairing their ability to think. If you want to help build the next generation of state-of-the-art AI, we would be happy to hear from you. Check out our career page.