🐦 Twitter Post Details

Viewing enriched Twitter post

@_philschmid

What is better than an LLM as a Judge? Right, an Agent as a Judge! @AIatMeta created an Agent-as-a-Judge to evaluate code agents to enable intermediate feedback alongside DevAI a new benchmark of 55 realistic development tasks. The Agent-as-a-Judge is a graph-based agent with tools to locate, read, retrieve, and evaluate files and information for a code project to evaluate the results of other agents by comparing its judgments to human evaluations (alignment rate, judge shift). Insights 🛠️ Agent cuts down costs to ~2.29% of human evaluation and time to ~2.36%. 💰 Agent costs $30.58 vs $1,297.50 for human evaluation ⚡ Reduced time to 118.43 minutes vs 86.5 hours 🧑‍⚖️ LLM-as-a-Judge achieved a 60-70% alignment rate to humans 🥇 Agent-as-a-Judge achieves a 90% alignment rate to humans

Media 1

📊 Media Metadata

{
  "data": [
    {
      "id": "",
      "type": "photo",
      "url": null,
      "media_url": "https://pbs.twimg.com/media/GebNkjeXQAEyuJ6.jpg",
      "media_url_https": null,
      "display_url": null,
      "expanded_url": null
    }
  ],
  "score": 0.86,
  "scored_at": "2025-08-09T13:46:07.554002",
  "import_source": "network_archive_import",
  "media": [
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1866405130217853316/media_0.jpg?",
      "filename": "media_0.jpg",
      "original_url": "https://pbs.twimg.com/media/GebNkjeXQAEyuJ6.jpg"
    }
  ],
  "storage_migrated": true
}

🔧 Raw API Response

{
  "user": {
    "created_at": "2019-06-18T18:39:49.000Z",
    "default_profile_image": false,
    "description": "Tech Lead and LLMs at @huggingface 👨🏻‍💻 🤗  AWS ML Hero 🦸🏻 | Cloud & ML enthusiast | 📍Nuremberg | 🇩🇪 https://t.co/l1ppq3q3hk",
    "fast_followers_count": 0,
    "favourites_count": 5136,
    "followers_count": 27719,
    "friends_count": 820,
    "has_custom_timelines": false,
    "is_translator": false,
    "listed_count": 656,
    "location": "Nürnberg",
    "media_count": 999,
    "name": "Philipp Schmid",
    "normal_followers_count": 27719,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1141052916570214400/1725456070",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1831321531852496896/1yBZG884_normal.jpg",
    "screen_name": "_philschmid",
    "statuses_count": 3072,
    "translator_type": "none",
    "url": "https://t.co/8BDXIK6omb",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "1141052916570214400"
  },
  "id": "1866405130217853316",
  "conversation_id": "1866405130217853316",
  "full_text": "What is better than an LLM as a Judge? Right, an Agent as a Judge! @AIatMeta created an Agent-as-a-Judge to evaluate code agents to enable intermediate feedback alongside DevAI a new benchmark of 55 realistic development tasks.\n\nThe Agent-as-a-Judge is a graph-based agent with tools to locate, read, retrieve, and evaluate files and information for a code project to evaluate the results of other agents by comparing its judgments to human evaluations (alignment rate, judge shift).\n\nInsights\n🛠️ Agent cuts down costs to ~2.29% of human evaluation and time to ~2.36%.\n💰 Agent costs $30.58 vs $1,297.50 for human evaluation\n⚡ Reduced time to 118.43 minutes vs 86.5 hours\n🧑‍⚖️ LLM-as-a-Judge achieved a 60-70% alignment rate to humans\n🥇 Agent-as-a-Judge achieves a 90% alignment rate to humans",
  "reply_count": 6,
  "retweet_count": 54,
  "favorite_count": 266,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [
    {
      "id_str": "1034844617261248512",
      "name": "AI at Meta",
      "screen_name": "AIatMeta",
      "profile": "https://twitter.com/AIatMeta"
    }
  ],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GebNkjeXQAEyuJ6.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/_philschmid/status/1866405130217853316",
  "created_at": "2024-12-10T08:50:33.000Z",
  "#sort_index": "1866405130217853316",
  "view_count": 15165,
  "quote_count": 6,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://x.com/_philschmid/status/1866405130217853316"
}