Physical Intelligence Wins the Olympics: Deep Dive
So last month, Physical Intelligence (PI) dropped a bombshell and claimed medals on basically all the remaining olympic tasks.

This was surprising for two reasons.
I did not expect these events to happen this quickly. When I designed my tasks I thought I was picking some (push doors) that could be claimed in a few weeks/months and some (peanut butter) that would be at least a year. To have done all of them in 3 months means that the state of the art is moving way faster than I expected.
PI didn’t didn’t need fancy sensing or actuation. When I selected the tasks I tried to think of things where I thought current approaches would not work. Tasks like peanut butter spreading or wiping glass need force feedback. Tasks like key manipulation or sock inversion need dextrous many-fingered hands. Tasks like cleaning your gripper would require building a robot that wasn’t worried about getting wet. I was wrong on all those counts: PI solved all of them with pure vision and pincer grippers.
I’m still surprised that you can do all of this visually. This is good news for general purpose robotics. I thought that we needed great touch sensing to unlock really useful manipulation. But if will continue to find that vision is ‘good enough’ to keep solving harder and harder tasks, that would make useful manipulation a data collection problem (apply effort: get data) not an invention problem (try things over and over hoping that someday one works).
Exciting!
Let’s dive into each event. I’m noting the robot-human speed ratio in each description because now that they have claimed everything, what is left for others is to do it faster. An increase of 20% speed lets you claim a prize: so who can do better?
🥇 Entering a Self-Closing Door
This one is one of the easiest and I’m not surprised it is solvable. It’s nice to see their model work on mobility tasks and not just arms-bolted-to-a-table. A bit more than 4X slower than human.
They have the same challenges that Sunday Robotics had: their gripper is too chunky to get inside of things. Here we have a very non-human strategy of sock inversion that ends up being 6.7X slower. I got to see this one live when I visited. They told me they only used 176 examples (their blog says 8 hours) of data which means it was one of the quicker tasks to get to work.1
🥉 Folding an Inside-Out T-Shirt
When I posted my olympics, shirt folding was new and exciting and this event was designed to see how far we could push that. Squeaking in at 10x slower than me (the limit), this still qualifies for a win. Nice and clean.
I feel conflicted about this one. I intended it as a measure of in-hand manipulation. As humans we can have a toddler and a bag of groceries in one arm and get out keys from a pocket, manipulate the correct one and use it in a key. So this doesn’t demonstrate exactly the skill I was intending to highlight. On the other hand, key insertion and turning is a crazy hard precision skill to do especially without any force sensing, so I think it is deserving of a gold medal. I also like the toolbox lock video which has the robot pass the key back and forth between the grippers to get it into the orientation. When I visited and pointed out the lack of ‘in-hand’ manipulation Sergey Levine responded that humans use five fingers to do the task: their robot did it with only four (albeit on two arms) which is fair. Impressive work and clocking in at a zippy 5x slower than human. I’m ready to give another gold medal to anyone who can do the key-moving one-handed though…
🥈 Making a Peanut Butter Sandwich
Wow. I can’t get over this one. Only 4x slower, multi-step, spreading, cutting, re-screwing the lid. Wow. This is gorgeous. Absolutely the most impressive one here and the one that I thought was going to take 1.5 years not 3 months. Wow. Watch it again.
The other one I got to see in person. It worked the first try while I was watching (always impressive). They reported that they were surprised how well their hardware used the sprayer. I admit that I thought you’d need more degrees of freedom in the hand, but it works really well. Paper towel tearing is nice too. 5x slower.
I think the best way to open a dog bag is to pinch near the opening and then grind your fingers around in a circle to separate the layers. Their fingers can’t do that, of course, so instead they mash it all over the table surface until it separates. Inelegant but effective. 6.7x slower.
The orange task does specify that it is done without external tools, so this is an incredible accomplishment, but does not win the medal, leaving it open for future competitors. When I asked how many oranges they went through they laughed and said that the corner grocery probably noticed the increase in orange sales and the one guy at the company who really liked mandarins was getting pretty tired of them by now.
🥈 Cleaning Peanut Butter Off Fingers
For these three tasks there are two basic approaches: make your whole robot splash resistant or just make your fingers resistant and be really confident in your end-to-end policies. PI went the second (brave) route.2
All this leaves me wondering what is next. I feel like we need more challenges given that these were clearly too easy. Is vision all you need? How far can we push it? What important, useful tasks do we think will be hard vision only given that wiping, spreading, and insertion seem to be just fine without any force or touch sensing? I’m curious to hear folk’s thoughts in the comments.
Also you should
or maybe
so my ego can grow ever larger.
When demoing I was warned that they’d swapped out the camera and grippers since collecting the training data, but the policy inverted the sock like a champ. Interesting because it means that we are getting towards the point where off-domain stuff like that still works even on small data sets, but not well enough that the engineers aren’t nervous about it.
Chelsea Finn, who trained the sock-inversion task, said that these were easy because they ‘just had to have their hardware team make the end effectors waterproof’ which is a classic software person thing to say. #ThanksHardwareFolks #WeAppreciateYou


I have three issues with this :P
Firstly, you caved so hard here!
"I feel conflicted about this one. I intended it as a measure of in-hand manipulation. [...] So this doesn’t demonstrate exactly the skill I was intending to highlight. [..] I think it is deserving of a gold medal."
Secondly, all of these submissions broke the clearly stated rule "To be eligible to win it must be a 1x speed video with no cuts".
Thirdly, isn't this supposed to be a humanoid Olympic games? This robot is not very human-like at all.
Impressive. We are entering the phase where "if you can teleop it, the model can learn". Problem is the data for these tasks were collected in the same environment using the same objects, and has not shown generalization in new environments. I would definitely not call something "solved" unless it can work in unseen environments with unseen objects with high success rate. How much more data do you need for that? The answer is still unclear.