"I feel conflicted about this one. I intended it as a measure of in-hand manipulation. [...] So this doesn’t demonstrate exactly the skill I was intending to highlight. [..] I think it is deserving of a gold medal."
Secondly, all of these submissions broke the clearly stated rule "To be eligible to win it must be a 1x speed video with no cuts".
Thirdly, isn't this supposed to be a humanoid Olympic games? This robot is not very human-like at all.
Impressive. We are entering the phase where "if you can teleop it, the model can learn". Problem is the data for these tasks were collected in the same environment using the same objects, and has not shown generalization in new environments. I would definitely not call something "solved" unless it can work in unseen environments with unseen objects with high success rate. How much more data do you need for that? The answer is still unclear.
From a product perspective, "cleaning" implies an outcome that is different from what Physical Intelligence achieved. I think the cleaning tasks should include more concrete specifications on the initial state (e.g. window is covered with purple paint from Brand X so you can see if the window was actually cleaned in the end; pan is covered with olive oil spiked with food dye, etc.).
The games would benefit from having a much larger number of events involving the standard household tasks, including pretreating a stain, moving laundry from the washer to the dryer, detecting that the clothes are unexpectedly still wet and need to be dried again, removing sheets from a bed and putting them back on, removing and replacing pillowcases, folding sheets and all the major clothing types, removing dishes/glasses from cupboards, opening ready to eat deli food containers and bottles, putting food and drink in the dishes/glasses, setting a table, loading and unloading a dishwasher, replacing clean dishes in the cupboard, operating each of the popular coffee makers and vacuum cleaners, tidying up a pile of kids clothing and hanging it and pairing the shoes, cleaning an entire countertop, cleaning a faucet, selecting which fruit is ready to eat, etc etc.
As a writer of unit tests, I find it hard to take a household to ot seriously unless it can do all of these tasks and more. It may be surprising which are more challenging.
As someone else pointed out, the orange peeling part was especially impressive, but in hindsight, not that surprising.
Humans who have lost upper body force sensing / proprioception (clinically known as deafferentation) are able to compensate fairly well with vision only. The challenge is that the arms now need to be in the field of view all the time, and movements are much slower and cognitively more demanding.
PI's robots are not completely deafferentated after all - they can observe EE pose and EE gripper pose fairly accurately
PI's VLA only *outputs* EE and joint poses right? Does the robot fix the cartesian / joint stiffness and simply run position control? I'm wondering if tasks requiring variable stiffness would make this style of control challenging to work with? Like scrubbing or wiping harder if the pot or the glass was dirtier.
Really fasinating write up on PI's progress across these challenges. The point about vision being sufficient without force feedback genuinly surprised me too, especially for tasks like spreading where you'd think pressure sensitivity would be crucial. I did some work with industrial automation years back and we always assumed tactile sensing was non-negotiable for that kind of fine motor control.
I'd like to see contamination issues tackled. There are too many videos of robots 'cleaning' toilets before immediately folding 'clean' clothes. I imagine this is difficult with vision alone, and would take good context awareness, perhaps specific sensors for the contamination (dirt/germ) and a way to properly clean manipulators (or other contact vectors).
Fascinating that vision alone cracked peanut butter spreading in 3 months when we assumed force feedback was essential. The fact they trained sock inversion on 176 examples and it still works after hardware swaps suggests something fundamentaly interesting about how robust these policies are becoming. I'm curious whether the next bottleneck will be tasks needing proprioception over longer time horizons rather than instantaneous force sensing.
I have three issues with this :P
Firstly, you caved so hard here!
"I feel conflicted about this one. I intended it as a measure of in-hand manipulation. [...] So this doesn’t demonstrate exactly the skill I was intending to highlight. [..] I think it is deserving of a gold medal."
Secondly, all of these submissions broke the clearly stated rule "To be eligible to win it must be a 1x speed video with no cuts".
Thirdly, isn't this supposed to be a humanoid Olympic games? This robot is not very human-like at all.
Impressive. We are entering the phase where "if you can teleop it, the model can learn". Problem is the data for these tasks were collected in the same environment using the same objects, and has not shown generalization in new environments. I would definitely not call something "solved" unless it can work in unseen environments with unseen objects with high success rate. How much more data do you need for that? The answer is still unclear.
From a product perspective, "cleaning" implies an outcome that is different from what Physical Intelligence achieved. I think the cleaning tasks should include more concrete specifications on the initial state (e.g. window is covered with purple paint from Brand X so you can see if the window was actually cleaned in the end; pan is covered with olive oil spiked with food dye, etc.).
The games would benefit from having a much larger number of events involving the standard household tasks, including pretreating a stain, moving laundry from the washer to the dryer, detecting that the clothes are unexpectedly still wet and need to be dried again, removing sheets from a bed and putting them back on, removing and replacing pillowcases, folding sheets and all the major clothing types, removing dishes/glasses from cupboards, opening ready to eat deli food containers and bottles, putting food and drink in the dishes/glasses, setting a table, loading and unloading a dishwasher, replacing clean dishes in the cupboard, operating each of the popular coffee makers and vacuum cleaners, tidying up a pile of kids clothing and hanging it and pairing the shoes, cleaning an entire countertop, cleaning a faucet, selecting which fruit is ready to eat, etc etc.
As a writer of unit tests, I find it hard to take a household to ot seriously unless it can do all of these tasks and more. It may be surprising which are more challenging.
As someone else pointed out, the orange peeling part was especially impressive, but in hindsight, not that surprising.
Humans who have lost upper body force sensing / proprioception (clinically known as deafferentation) are able to compensate fairly well with vision only. The challenge is that the arms now need to be in the field of view all the time, and movements are much slower and cognitively more demanding.
PI's robots are not completely deafferentated after all - they can observe EE pose and EE gripper pose fairly accurately
PI's VLA only *outputs* EE and joint poses right? Does the robot fix the cartesian / joint stiffness and simply run position control? I'm wondering if tasks requiring variable stiffness would make this style of control challenging to work with? Like scrubbing or wiping harder if the pot or the glass was dirtier.
Especially impressed by the orange peeling one - must be hard visually determining whether you've got enough leverage to get the peel off.
Is the person in some sort of shiny mocap suit, or are they just trying to blend in as a robot? :P
Really fasinating write up on PI's progress across these challenges. The point about vision being sufficient without force feedback genuinly surprised me too, especially for tasks like spreading where you'd think pressure sensitivity would be crucial. I did some work with industrial automation years back and we always assumed tactile sensing was non-negotiable for that kind of fine motor control.
I'd like to see contamination issues tackled. There are too many videos of robots 'cleaning' toilets before immediately folding 'clean' clothes. I imagine this is difficult with vision alone, and would take good context awareness, perhaps specific sensors for the contamination (dirt/germ) and a way to properly clean manipulators (or other contact vectors).
Fascinating that vision alone cracked peanut butter spreading in 3 months when we assumed force feedback was essential. The fact they trained sock inversion on 176 examples and it still works after hardware swaps suggests something fundamentaly interesting about how robust these policies are becoming. I'm curious whether the next bottleneck will be tasks needing proprioception over longer time horizons rather than instantaneous force sensing.