🐦 Twitter Post Details

Viewing enriched Twitter post

@joelniklaus

Community asked, we delivered. 🚀 We just released almost 7TB of raw rephrased data from #FinePhrase to enable further experimentation and analysis. Code is also public for full transparency and reproducibility. We're using the new Hugging Face Buckets feature for this release. Unlike git-based repos, buckets provide S3-like object storage with content-addressable deduplication. Perfect for this use case because: - No version control overhead for massive files (7TB would be painful in git) - Fast, mutable storage for artifacts that don't need history tracking - Simple CLI and Python API for syncing, filtering, and browsing - Server-side file copying without re-uploading @ratishsp ran a detailed quality analysis on the dataset and found some interesting patterns. Format compliance on tables was rough, hallucination rates were high across splits. Makes sense since the rephrasing was done by a 1.7B model. Weird thing: models pretrained on this data still hit decent benchmark scores after 20B and 100B tokens, even with the quality issues. We covered some of this in the original blog post, but there's clearly more to understand here. That's exactly why we're releasing everything. If you want to dig into: - How synthetic data quality actually impacts pretraining - Why benchmark performance doesn't always track with perceived quality - Better filtering or multi-round rephrasing approaches - The counterintuitive relationship between data quality and model performance Now you can. Data: https://t.co/DNNKKZEO3H Code: https://t.co/WuqFvYobh9 Curious what you find. Hit me up if you do something interesting with it!

Media 1
Media 2
Media 3

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2043705930064564718/media_0.jpg",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2043705930064564718/media_0.jpg",
      "type": "photo",
      "filename": "media_0.jpg"
    },
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2043705930064564718/media_1.jpg",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2043705930064564718/media_1.jpg",
      "type": "photo",
      "filename": "media_1.jpg"
    },
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2043705930064564718/media_2.jpg",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2043705930064564718/media_2.jpg",
      "type": "photo",
      "filename": "media_2.jpg"
    }
  ],
  "processed_at": "2026-04-13T20:23:22.167430",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2043705930064564718",
  "url": "https://x.com/joelniklaus/status/2043705930064564718",
  "twitterUrl": "https://twitter.com/joelniklaus/status/2043705930064564718",
  "text": "Community asked, we delivered. 🚀\n\nWe just released almost 7TB of raw rephrased data from #FinePhrase to enable further experimentation and analysis. Code is also public for full transparency and reproducibility.\n\nWe're using the new Hugging Face Buckets feature for this release. Unlike git-based repos, buckets provide S3-like object storage with content-addressable deduplication. Perfect for this use case because:\n- No version control overhead for massive files (7TB would be painful in git)\n- Fast, mutable storage for artifacts that don't need history tracking\n- Simple CLI and Python API for syncing, filtering, and browsing\n- Server-side file copying without re-uploading\n\n@ratishsp ran a detailed quality analysis on the dataset and found some interesting patterns. Format compliance on tables was rough, hallucination rates were high across splits. Makes sense since the rephrasing was done by a 1.7B model.\n\nWeird thing: models pretrained on this data still hit decent benchmark scores after 20B and 100B tokens, even with the quality issues. We covered some of this in the original blog post, but there's clearly more to understand here.\n\nThat's exactly why we're releasing everything. If you want to dig into:\n- How synthetic data quality actually impacts pretraining\n- Why benchmark performance doesn't always track with perceived quality\n- Better filtering or multi-round rephrasing approaches\n- The counterintuitive relationship between data quality and model performance\n\nNow you can.\n\nData: https://t.co/DNNKKZEO3H\nCode: https://t.co/WuqFvYobh9\n\nCurious what you find. Hit me up if you do something interesting with it!",
  "source": "Twitter for iPhone",
  "retweetCount": 6,
  "replyCount": 0,
  "likeCount": 17,
  "quoteCount": 0,
  "viewCount": 1643,
  "createdAt": "Mon Apr 13 15:00:38 +0000 2026",
  "lang": "en",
  "bookmarkCount": 12,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2043705930064564718",
  "displayTextRange": [
    0,
    279
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "joelniklaus",
    "url": "https://x.com/joelniklaus",
    "twitterUrl": "https://twitter.com/joelniklaus",
    "id": "390741977",
    "name": "Joël Niklaus",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1468993727926620161/I1lBhFJM_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/390741977/1762851950",
    "description": "",
    "location": "Berne, Switzerland",
    "followers": 1442,
    "following": 418,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Fri Oct 14 13:44:28 +0000 2011",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 839,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 222,
    "statusesCount": 733,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2030554880285585544"
    ],
    "profile_bio": {
      "description": "Data @huggingface",
      "entities": {
        "description": {
          "hashtags": [],
          "symbols": [],
          "urls": [],
          "user_mentions": [
            {
              "id_str": "0",
              "indices": [
                5,
                17
              ],
              "name": "",
              "screen_name": "huggingface"
            }
          ]
        },
        "url": {
          "urls": [
            {
              "display_url": "niklaus.ai",
              "expanded_url": "http://www.niklaus.ai",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/dQjHdo03H7"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.twitter.com/ZcEa8wce5Z",
        "expanded_url": "https://twitter.com/joelniklaus/status/2043705930064564718/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": []
          },
          "orig": {
            "faces": []
          }
        },
        "id_str": "2043705927401193472",
        "indices": [
          280,
          303
        ],
        "media_key": "3_2043705927401193472",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARxcs7aEG8AACgACHFyztyLbke4AAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABHFyztoQbwAAKAAIcXLO3ItuR7gAA",
            "media_key": "3_2043705927401193472"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/HFyztoQbwAALRnN.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 579,
              "w": 1034,
              "x": 0,
              "y": 0
            },
            {
              "h": 642,
              "w": 642,
              "x": 0,
              "y": 0
            },
            {
              "h": 642,
              "w": 563,
              "x": 0,
              "y": 0
            },
            {
              "h": 642,
              "w": 321,
              "x": 0,
              "y": 0
            },
            {
              "h": 642,
              "w": 1034,
              "x": 0,
              "y": 0
            }
          ],
          "height": 642,
          "width": 1034
        },
        "sizes": {
          "large": {
            "h": 642,
            "w": 1034
          }
        },
        "type": "photo",
        "url": "https://t.co/ZcEa8wce5Z"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [
      {
        "indices": [
          89,
          100
        ],
        "text": "FinePhrase"
      }
    ],
    "symbols": [],
    "urls": [
      {
        "display_url": "huggingface.co/buckets/Huggin…",
        "expanded_url": "https://huggingface.co/buckets/HuggingFaceFW/finephrase-rephrased",
        "indices": [
          1509,
          1532
        ],
        "url": "https://t.co/DNNKKZEO3H"
      },
      {
        "display_url": "github.com/huggingface/fi…",
        "expanded_url": "https://github.com/huggingface/finephrase",
        "indices": [
          1539,
          1562
        ],
        "url": "https://t.co/WuqFvYobh9"
      }
    ],
    "user_mentions": [
      {
        "id_str": "153102421",
        "indices": [
          681,
          690
        ],
        "name": "Ratish Puduppully",
        "screen_name": "ratishsp"
      }
    ]
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "communityInfo": null,
  "article": null
}