Local AI Models Qwen 3.5 and Gemma 4 E2B Deliver Flawed Survival Advice in iPhone Test

Image for Local AI Models Qwen 3.5 and Gemma 4 E2B Deliver Flawed Survival Advice in iPhone Test

A recent social media post by user "Flowers ☾" has highlighted significant safety concerns regarding the use of locally run large language models (LLMs) for critical "post-apocalyptic survival scenarios." The user reported that while testing Qwen 3.5 4b and Gemma 4 E2B on an iPhone, both models provided dangerously inaccurate information, with Qwen 3.5 4b giving eight "fatal advices" compared to Gemma 4 E2B's four.

"I tested the two best models that can run locally on an iphone for post apocalyptic survival scenarios, and while Qwen 3.5 4b has a slight edge over Gemma 4 e2b, it gave 8 fatal advices, while Gemma 'only' gave 4," stated Flowers ☾ in the tweet.

The incident underscores the ongoing challenge of "hallucinations" in LLMs, where models generate plausible but factually incorrect information. Experts caution that such errors can have severe consequences, particularly in high-stakes applications like survival guidance or healthcare. The reliability of AI in critical situations remains a key area of research and development.

Qwen 3.5 4b is part of Alibaba's Qwen3.5 family, designed for enhanced reasoning, coding, and multimodal tasks, with smaller variants like the 4B model optimized for local deployment. This model series supports 256K context across 201 languages and is known for its ability to run efficiently on devices with limited resources.

Google DeepMind's Gemma 4 E2B is an edge-optimized model, specifically engineered for on-device execution on smartphones and other resource-constrained hardware. Released in April 2026, the E2B variant (Effective 2 Billion parameters) prioritizes multimodal capabilities, low-latency processing, and seamless ecosystem integration, offering features like native audio input and reasoning modes. It has a context window of 128K tokens and is designed for privacy and offline use.

Despite advancements in making LLMs more accessible for local deployment, the tweet serves as a stark reminder of their limitations. While these models offer benefits like privacy and offline functionality, the accuracy of advice from these models, especially in domains requiring precise and verified information, is paramount. AI safety research emphasizes the need for robust evaluation frameworks, human oversight, and clear ethical guidelines to ensure responsible deployment of LLMs in critical settings.