Dimension Hopper Part 1
2D Platformer using Stable Diffusion for live level art creation
The next few posts are going to depart a bit from the regular theme (which has been lessons learned in a career doing general purpose robotics). Instead I’ve decided to use some of my new free time1 to learn by doing and play with some of the cool new ML all the cool kids are talking about.
My project is to make a 2D platformer where the players can design their own levels and then generative AI will create beautiful rendered images to represent the levels. I wanted to do something that wouldn’t be possible without AI: letting the players participate in the creation of art. We’ll skip to the end and you can see what the game looks like now:
And here are some of the different themes, though you can also create your own.
You can play with it here: dimensionhopper.com
I recommend looking at random levels or the gallery and seeing what’s out there.
But lets talk about the process to get there.
Part 1: proof of concept
I’d played a little bit with Stable Diffusion before and it is a really fun toy to make cool pictures, but I always felt like I didn’t have quite enough control. For me the enjoyment of creation comes from the interaction between what I do and what I get, and I wanted to have more input. That’s why I was so excited when I read about control-net, which gives a ton more knobs to control the output. I immediately wanted to use it to make a 2D game.2
Getting up and going
I installed Stable-Diffusion on my laptop, fired up the webui3 got control net working and fed this depth image into Stable Diffusion:
This has the platforms as the closest pixels (white) and the black background means that part is far away. I was using a pixel-art model that had this amazing demo pic in the CIVAI page:
So I copy the prompt and settings from that, tweaked a bit and hit generate…….. And get this:
Not quite what I was hoping for… I try and fail to get the depth mode working for a while.
And switch over to “scribble” mode for control-net. Scribble mode takes outlines of shapes and lets them guide the images (instead of depth).
More interesting but still not good.
Changing the prompt:
“pixelart video game environment, platformer level. Create an image of a mystic stone temple in the jungle. shafts of sunlight, dramatic lighting. Vines and cracks in brown stone”
Much closer! I have a picture. It kind of interacted with the level. Still not good but better.
Ooh, I have the level casting some shadows now.
“pixelart video game environment, Create an image of an abandoned space station, with broken systems, flickering lights, and a sense of danger. Show the wreckage, the abandoned rooms, and the unknown threats that linger.”
Eventually with some more playing around with settings I get levels like this pretty reliably:
This is a big improvement over the start, but (1) it didn’t really look like the level was part of the art, it was, at best, pasted over, and (2) the level textures looked like a repeating tileset. Human videogame developers do this so that they don’t have to draw a different bit of grass for every square of platform, but I didn’t have that constraint. Huh.
No more pixelart.
I decided that part of my problem was using a model trained on pixelart. It was faithfully copying the genre of repeated tileset, which was exactly what I didn’t want. So I changed to another model, this one built around children’s illustration, and my first image out looked like this:
Wow! So much better! The platforms have shadows on them, have objects in front and behind and it's actually a nice picture. I’m on to something!
The new model made nice pictures but I quickly realize I’m walking the line between two failure modes.
Either I have a nice picture with the level kind of mostly drawn on top of it:
Or I get a nice, integrated picture where its really unclear where you are allowed to stand:
The last one is especially problematic because the ‘level’ part you can stand on has been rendered as a window, exactly inverting the semantics. This is a fundamental problem with using ‘scribble-mode’ as I’m giving Stable Diffusion no way to know what is close and what is far, just outlined shapes.
Breakthrough: Lips on the Platforms.
I go back to depth, but have a thought: what if I hint that the top of the platform has a little lip. Rather than code it in my game rendering engine I just draw them in Gimp.
Happy little platform toppers
Holy shit it works!
And more importantly: it works pretty much every time. Most Stable-Diffusion workflows include generating 4-10x more images than you actually want, and choosing the good one. For my idea to work we needed all of the levels to be playable (you can tell where the platforms are) and most of them to be good (beautiful illustration) because there wouldn’t be a human curation step.
Control Image Matters
I’ve learned that the look of the depth image really changes the quality of the output. I was working on jungle ruins and kept getting images like this:
(Side note: its amazing how quickly standards rise. Early on that image blew me away, but now it looks meh).
The problem4 is that there aren’t really any reasonable pictures of jungles with that depth map. Jungles (and really everything) doesn’t look like that. Stuff doesn’t float in the air, it has stuff under it holding it up. This leads to breakthrough 2: add supports.
Each platform block projects a dark gray box to the floor below it and that gives structure to the world. The dark gray has no gameplay purpose, it just acts as a hint to Stable Diffusion about what the picture is of. And we get much better images.
Things were pretty good, but I was still having trouble with caves, and I wondered if it was because it was trying to match the straight, sharp edges of the depth map and having a hard time making it look organic, so I added adjustable roughness to the images (as well as adjustable background depth so that sky can be far away for outdoor scenes and closer for indoor/jungle scenes.
Notice the bottoms of the platforms are really straight and its invented a bunch of lightcolumns / waterfalls that can have perfectly straight verticals. This one is actually quite good, but the square corners are not ideal.
Now everything is believably subterranean, and the underlighting on the bumpy rocks looks right.
Gems and Characters
Stable diffusion doesn’t create any transparency.
But can cheat with the level image because I have the depth information I fed in and mask based on that, so characters can go ‘behind’ the platforms when needed. For the gems, if I ask for ‘<blue gem/ruby jewel> floating videogame object on a black background’ and then subtract the background in python.5
For characters I found this model which was really fine-tuned to heck to make 4 frame walk animations. The creator really wanted left/right/up/down animations so combines this model with a lora of whoever the subject is to create all 4 directions depicting the same character. As a result, I think it is trained so all the ‘walk right’ just get the prompt “PixelartRSS”.
I’d like to be able to prompt something about the character I want, and this sometimes kinda works6 but not nearly as reliably as I’d like. My suspicion is that if I had the training data, limited it to just the sideways walk and labeled it with actual descriptions of the characters it would work better for me. But it does reliably make people that walk. So as long as you aren’t picky, you can get new player character sprites all day long.
Wrapping it all up
From there I just had to make it work in my own app. I used the excellent diffusers library to wrap the generation and make my little server. Everything is moving so fast that I’m sure I’ve done some things in very silly ways in my image_generation.py and there are probably 2-3x speedups to be gained be configuring it right, but for now it all works, and it’s fun to play with. Why are you still reading? Go check it out!
And subscribe for more about the development of this game, and more robotics content after that.
Discuss on Hacker News: https://news.ycombinator.com/item?id=36295227
Who needs a job when you have hobbies?
I’m sure a bunch of other folks have had the same idea but as far as my lazy googling can turn up, no one is doing dynamic level rendering during gameplay. Maybe after this blog-post there will be more ;).
Not pictured: several frustrating hours of conda, pip, configuration, cuda, torch, nvidia-drivers, etc etc. Why is the world still like this?
I think, though who knows what’s going on in the mind of the black box
The backgrounds end up being mostly, but not exactly an even color, so I use the average of the border pixels and subtract everything within some threshold of that.
I assume some of stable-diffusion’s semantic understanding leaks through the super strict fine-tuning.