@HelloSurgeAI
Red teaming is a critical part of ensuring LLMs are safe, but it’s not often discussed. At Surge AI, we red team LLMs for many of the major AI labs, including Anthropic and Microsoft. We care deeply about this problem as it aligns with our core mission to build safe and useful AI systems for the world. Here are some of our recent findings: • Unsafe content can be generated by passing safe instructions to the LLM and then asking for a contrasting perspective. (example in the figure) • Sometimes, the models contradict themselves when responding to adversarial prompts: they’ll respond with “[UNSAFE CONTENT] is not appropriate to discuss, etc.” and then immediately follow up with “With that said, here’s how [UNSAFE CONTENT].” • LLMs often mirror the language in the requests, leading to easily injecting unsafe words that lead to harmful outputs. • Hiding attacks in positive and empowering language is an effective approach to coerce the model to spit out desired harmful output. Our brilliant Surgers red team some of the top LLMs, including Anthropic’s Claude which is regarded as one of the most safe and capable models available. Learn more: https://t.co/zkq51kDD7Z Stay tuned for more insights and breakthroughs from our world-class team as we continue to redefine and innovate our red teaming strategies. We are keen to continue making LLMs safer, better, and more creative for everyone. Interested in working together? Reach out: https://t.co/q8XmX6NYqV