🐦 Twitter Post Details

Viewing enriched Twitter post

@HelloSurgeAI

Red teaming is a critical part of ensuring LLMs are safe, but it’s not often discussed. At Surge AI, we red team LLMs for many of the major AI labs, including Anthropic and Microsoft. We care deeply about this problem as it aligns with our core mission to build safe and useful AI systems for the world. Here are some of our recent findings: • Unsafe content can be generated by passing safe instructions to the LLM and then asking for a contrasting perspective. (example in the figure) • Sometimes, the models contradict themselves when responding to adversarial prompts: they’ll respond with “[UNSAFE CONTENT] is not appropriate to discuss, etc.” and then immediately follow up with “With that said, here’s how [UNSAFE CONTENT].” • LLMs often mirror the language in the requests, leading to easily injecting unsafe words that lead to harmful outputs. • Hiding attacks in positive and empowering language is an effective approach to coerce the model to spit out desired harmful output. Our brilliant Surgers red team some of the top LLMs, including Anthropic’s Claude which is regarded as one of the most safe and capable models available. Learn more: https://t.co/zkq51kDD7Z Stay tuned for more insights and breakthroughs from our world-class team as we continue to redefine and innovate our red teaming strategies. We are keen to continue making LLMs safer, better, and more creative for everyone. Interested in working together? Reach out: https://t.co/q8XmX6NYqV

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1679511913426714625/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1679511913426714625/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    }
  ],
  "downloaded_to_supabase": true,
  "processed_at": "2025-08-14T07:00:00Z"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "1679511913426714625",
  "url": "https://x.com/HelloSurgeAI/status/1679511913426714625",
  "twitterUrl": "https://twitter.com/HelloSurgeAI/status/1679511913426714625",
  "text": "Red teaming is a critical part of ensuring LLMs are safe, but it’s not often discussed.\n\nAt Surge AI, we red team LLMs for many of the major AI labs, including Anthropic and Microsoft.\n\nWe care deeply about this problem as it aligns with our core mission to build safe and useful AI systems for the world.\n\nHere are some of our recent findings:\n\n• Unsafe content can be generated by passing safe instructions to the LLM and then asking for a contrasting perspective. (example in the figure)\n• Sometimes, the models contradict themselves when responding to adversarial prompts: they’ll respond with “[UNSAFE CONTENT] is not appropriate to discuss, etc.” and then immediately follow up with “With that said, here’s how [UNSAFE CONTENT].”\n• LLMs often mirror the language in the requests, leading to easily injecting unsafe words that lead to harmful outputs.\n• Hiding attacks in positive and empowering language is an effective approach to coerce the model to spit out desired harmful output.\n\nOur brilliant Surgers red team some of the top LLMs, including Anthropic’s Claude which is regarded as one of the most safe and capable models available. Learn more: https://t.co/zkq51kDD7Z\n\nStay tuned for more insights and breakthroughs from our world-class team as we continue to redefine and innovate our red teaming strategies.\n\nWe are keen to continue making LLMs safer, better, and more creative for everyone.\n\nInterested in working together? Reach out: https://t.co/q8XmX6NYqV",
  "source": "Twitter for iPhone",
  "retweetCount": 0,
  "replyCount": 2,
  "likeCount": 7,
  "quoteCount": 0,
  "viewCount": 1988,
  "createdAt": "Thu Jul 13 15:23:38 +0000 2023",
  "lang": "en",
  "bookmarkCount": 1,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "1679511913426714625",
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "HelloSurgeAI",
    "url": "https://x.com/HelloSurgeAI",
    "twitterUrl": "https://twitter.com/HelloSurgeAI",
    "id": "1267866160894222343",
    "name": "Surge AI",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1942208543618220032/tBPg9A4s_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/1267866160894222343/1751563043",
    "description": "",
    "location": "",
    "followers": 5395,
    "following": 142,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Tue Jun 02 17:10:41 +0000 2020",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 249,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 183,
    "statusesCount": 613,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1681343766123143168"
    ],
    "profile_bio": {
      "description": "Human data for AGI. Our mission: to raise AGI with the richness of human intelligence — curious, witty, imaginative, and full of unexpected brilliance.",
      "entities": {
        "description": {},
        "url": {
          "urls": [
            {
              "display_url": "surgehq.ai",
              "expanded_url": "https://www.surgehq.ai",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/6bGF7OxrIX"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.twitter.com/e8rVanIHuz",
        "expanded_url": "https://twitter.com/HelloSurgeAI/status/1679511913426714625/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {},
          "orig": {}
        },
        "id_str": "1679511263003435009",
        "indices": [
          281,
          304
        ],
        "media_key": "3_1679511263003435009",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARdO0phjlmABCgACF07TL9PWAAEAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABF07SmGOWYAEKAAIXTtMv09YAAQAA",
            "media_key": "3_1679511263003435009"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/F07SmGOWYAESSZW.png",
        "original_info": {
          "focus_rects": [
            {
              "h": 1046,
              "w": 1868,
              "x": 0,
              "y": 0
            },
            {
              "h": 1046,
              "w": 1046,
              "x": 364,
              "y": 0
            },
            {
              "h": 1046,
              "w": 918,
              "x": 428,
              "y": 0
            },
            {
              "h": 1046,
              "w": 523,
              "x": 626,
              "y": 0
            },
            {
              "h": 1046,
              "w": 1868,
              "x": 0,
              "y": 0
            }
          ],
          "height": 1046,
          "width": 1868
        },
        "sizes": {
          "large": {
            "h": 1046,
            "w": 1868
          }
        },
        "type": "photo",
        "url": "https://t.co/e8rVanIHuz"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "urls": [
      {
        "display_url": "surgehq.ai/case-studies/a…",
        "expanded_url": "https://www.surgehq.ai/case-studies/anthropic-claude-surgeai-rlhf-platform",
        "indices": [
          1158,
          1181
        ],
        "url": "https://t.co/zkq51kDD7Z"
      },
      {
        "display_url": "surgehq.ai/rlhf",
        "expanded_url": "https://www.surgehq.ai/rlhf",
        "indices": [
          1452,
          1475
        ],
        "url": "https://t.co/q8XmX6NYqV"
      }
    ]
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "article": null
}