@DrJimFan
Let's reverse engineer GPT-4V's uncanny ability to convert screenshots/sketches to code. Believe it or not, it's actually a (relatively) easier training task, because synthetic data can be scaled up massively. No insider info, but this is how I'd do it: 1. Scrape lots of websites and their code. Use a lightweight model (3.5) to clean up the code, and Selenium to render the screenshots. This becomes an initial training dataset of (Image, code). 2. Now given a screenshot, ask the model to generate code, and execute it in an actual browser. This step may throw errors, but GPT-4 is good at self-debugging. Fix all obvious runtime errors after a few rounds of iterative refinement. 3. The code is runnable now, but the rendered website may not follow the instruction image completely. Enters a very powerful idea from agent learning, called "Hindsight Relabeling" (https://t.co/f65BsP1tRk, authored by OpenAI in 2017). Basically, the wrong end product is actually correct given the current code. Instead of following (Image1, code), GPT-4V generates code -> Image2. Now the data pair (Image2, code) is actually groundtruth, which can be added to the training dataset. 4. Do aggressive data augmentation: change fonts, move around HTML elements, swap out background, add lots of noises. Combined with extraordinary OCR abilities, it's conceivable that enough data augmentation will help GPT-4V generalize to hand-drawn sketches, such as the napkin demo that @gdb did in February. Video credit: @mckaywrigley