# Evaluation Overview
We evaluated TiPToP on 28 manipulation tasks in 3 settings: (i) simulation using IsaacSim, (ii) a real-world DROID hardware setup operated internally by TiPToP's developers at MIT, and (iii) a separate DROID setup operated by an external evaluation team at the University of Pennsylvania not involved in TiPToP's development. Below we present detailed video demonstrations for tasks evaluated on the internal DROID setup, along with complete results over all 28 tasks in the summary table.
Experimental Setup
- Tasks: 28 tasks total (5 simulation, 8 internal DROID, 15 external DROID)
- Trials: 10 trials per simulation task, 5 trials per real-world task
- Total comparisons: 165 trials (50 simulation, 40 internal, 75 external)
- Hardware: Franka Emika Panda FR3 with Robotiq 2F-85 gripper
- Cameras: 1 x ZED Mini wrist camera, 1 x ZED 2i external camera (not used by TiPToP)
# Summary Results
Below is a comprehensive table of all 28 evaluation tasks across simulation, internal DROID, and external DROID settings. Tasks are organized by category (Simple, Distractor, Semantic, Multi-step). Click on task names with links to jump to detailed video demonstrations. For a detailed breakdown of the language prompt and progress metric for each task, please see the Scene Details table.
Key: SR = Success Rate, TP = Task Progress. † indicates tasks evaluated by system designers on internal DROID setup. (sim) indicates simulation tasks in IsaacSim. Unmarked tasks were evaluated by an external team at the University of Pennsylvania on a separate DROID setup. Bold values indicate better performance on that metric.
| Scene | TiPToP SR | TiPToP TP | \(\pi_{0.5}\)-DROID SR | \(\pi_{0.5}\)-DROID TP |
|---|---|---|---|---|
| Simple Tasks | ||||
| Cube → bowl (sim) | 5/10 | 72.5% | 8/10 | 90% |
| Can → mug (sim) | 9/10 | 97.5% | 2/10 | 50% |
| Banana → bin (sim) | 0/10 | 70% | 9/10 | 97.5% |
| Marker → tray | 3/5 | 80% | 5/5 | 100% |
| Crackers → tray† | 5/5 | 100% | 3/5 | 60% |
| 22/40 | 84% | 27/40 | 79.5% | |
| Distractor Tasks | ||||
| Meat can → sugar box (sim) | 5/10 | 72.5% | 0/10 | 5% |
| Coffee capsules → plate | 4/5 | 90% | 2/5 | 58% |
| Turkish figs → plate | 3/5 | 64% | 2/5 | 52% |
| Cashews → plate | 0/5 | 16% | 0/5 | 12% |
| Red cubes → plate | 1/5 | 50% | 5/5 | 92% |
| Fish → box | 4/5 | 80% | 0/5 | 10% |
| Crackers → tray (medium)† | 5/5 | 100% | 3/5 | 80% |
| PB crackers → tray (hard)† | 5/5 | 100% | 0/5 | 20% |
| 27/45 | 71.6% | 12/45 | 41.1% | |
| Semantic Tasks | ||||
| Toy → matching plate | 4/5 | 90% | 1/5 | 62% |
| Creeper → plate | 3/5 | 70% | 0/5 | 0% |
| Largest toy → plate | 3/5 | 70% | 0/5 | 20% |
| Red A → color pile | 5/5 | 100% | 3/5 | 80% |
| Banana → box | 2/5 | 40% | 0/5 | 30% |
| N block → indicated cup | 3/5 | 80% | 2/5 | 60% |
| Sort blocks by color | 5/5 | 100% | 0/5 | 32% |
| Banana → matching plate | 1/5 | 20% | 4/5 | 90% |
| 26/40 | 71.3% | 10/40 | 46.8% | |
| Multi-step Tasks | ||||
| Color cubes → bowl (sim) | 9/10 | 94.6% | 0/10 | 24.2% |
| AirPods → cup | 1/5 | 55% | 3/5 | 75% |
| Pack pods → tray† | 4/5 | 80% | 1/5 | 65.7% |
| Pack pods → tray (obs.)† | 1/5 | 67% | 0/5 | 64% |
| Aleve bottle → tray (obs.)† | 4/5 | 80% | 2/5 | 70% |
| Three marbles → cup† | 2/5 | 80% | 0/5 | 6.7% |
| Marbles + cable† | 2/5 | 70% | 0/5 | 60% |
| 23/40 | 75.2% | 6/40 | 52.2% | |
| Overall | 98/165 | 74.6% | 55/165 | 52.4% |
# Execution Time Comparison
The table below shows average execution times on 7 representative scenes where both methods succeeded. TiPToP's planning time is shown separately to illustrate the breakdown between planning and execution.
Key observation: TiPToP is faster than \(\pi_{0.5}\text{-DROID}\) in 6 of 6 scenes. Even though TiPToP spends significant upfront time on perception and planning, \(\pi_{0.5}\text{-DROID}\) often spends considerable time idling or re-grasping objects.
| Scene | \(\pi_{0.5}\)-DROID Time | TiPToP Total Time | TiPToP Planning Time |
|---|---|---|---|
| Simulation (IsaacSim) | |||
| Cube → bowl | 27.4s | 17.9s | 9.7s |
| Can → mug | 41.0s | 18.6s | 9.2s |
| Real-World (Internal DROID) | |||
| Crackers → tray (simple) | 32.2s | 14.9s | 7.0s |
| Crackers → tray (medium) | 45.2s | 14.9s | 7.3s |
| Pack pods → tray | 53.4s | 47.0s | 20.5s |
| Aleve bottle → tray (obs.) | 31.2s | 31.2s | 16.4s |
# Video Results
Below we show the videos for the 8 tasks evaluated on our internal DROID setup, where we have side-by-side video recordings of both systems. For results over all 28 tasks (including simulation and external evaluation), see the summary results table above.
# Crackers → tray
A simple pick-and-place task where the robot must pick up a cracker box and place it on a designated target surface. This serves as a baseline test for basic manipulation capabilities.
Observations
TiPToP: Achieved 100% success rate (5/5) with consistent execution times averaging 14.9s. All trials completed efficiently with reliable grasp planning and placement.
\(\pi_{0.5}\text{-DROID}\): 60% success rate (3/5). Common failure modes included idling with no progress, and timeout failures at 120s. When successful, execution was slower (avg 32.2s) and less consistent. Some trials showed late progress around the timeout period.
Trial 1
Ours:
✓ 1.0
Successful completion
π₀.₅:
✗ 0.0
Robot just idles
▶
Trial 2
Ours:
✓ 1.0
Successful completion
π₀.₅:
✗ 0.0
Moves close around 120s
▶
Trial 3
Ours:
✓ 1.0
Successful completion
π₀.₅:
✓ 1.0
After initial idling
▶
Trial 4
Ours:
✓ 1.0
Successful completion
π₀.₅:
✓ 1.0
After initial idling
▶
Trial 5
Ours:
✓ 1.0
Successful completion
π₀.₅:
✓ 1.0
Carton overturns slightly
▶
# Crackers → tray (medium)
Place a cracker box onto a tray in the presence of medium clutter (a medicine box and a strawberry). The robot must identify the correct target object and avoid disturbing distractors.
Observations
TiPToP: Perfect success rate (5/5) with average time of 14.9s. Consistently identified and manipulated the correct object despite distractors.
\(\pi_{0.5}\text{-DROID}\): 60% success (3/5) with average time of 45.2s for successful trials. Frequent confusion about task objectives, often manipulating wrong objects (strawberry, medicine box) or placing items incorrectly. One trial showed the tray falling over, and another resulted in the crackers being thrown off the table.
Trial 1
Ours:
✓ 1.0
Successful completion
π₀.₅:
✓ 1.0
Medicine box and strawberry first
▶
Trial 2
Ours:
✓ 1.0
Successful completion
π₀.₅:
✓ 1.0
Successful completion
▶
Trial 3
Ours:
✓ 1.0
Successful completion
π₀.₅:
~ 0.5
Throws crackers off table
▶
Trial 4
Ours:
✓ 1.0
Successful completion
π₀.₅:
✓ 1.0
Tray falls over afterward
▶
Trial 5
Ours:
✓ 1.0
Successful completion
π₀.₅:
~ 0.5
Placed on medicine box, not in tray
▶
# PB crackers → tray (hard)
A more challenging version with heavy clutter: place the cracker box on the tray while navigating around multiple distractor objects including medicine boxes, popcorn containers, and other items that significantly crowd the workspace.
Observations
TiPToP: Maintained 100% success (5/5) even with heavy clutter, averaging 15.2s. Demonstrates robust planning in crowded scenes.
\(\pi_{0.5}\text{-DROID}\): Complete failure (0/5). The VLA struggled significantly with the increased clutter. Common failures included picking the crackers but not placing them in the tray, dropping objects on other items (e.g., popcorn), or showing no progress at all. The presence of many distractors appeared to overwhelm the policy.
Trial 1
Ours:
✓ 1.0
Successful completion
π₀.₅:
~ 0.5
Picks crackers, doesn't place in tray
▶
Trial 2
Ours:
✓ 1.0
Successful completion
π₀.₅:
✗ 0.0
Doesn't even pick crackers
▶
Trial 3
Ours:
✓ 1.0
Successful completion
π₀.₅:
~ 0.5
Drops on popcorn instead
▶
Trial 4
Ours:
✓ 1.0
Successful completion
π₀.₅:
✗ 0.0
Doesn't do anything
▶
Trial 5
Ours:
✓ 1.0
Successful completion
π₀.₅:
✗ 0.0
Doesn't do anything
▶
# Pack pods → tray
Pack three coffee pods onto a tray in specific slots. This requires precise placement and understanding of spatial arrangements. Each pod must be placed correctly in its designated position on the tray.
Observations
TiPToP: Strong performance with 80% success (4/5), averaging 46.8s for successful trials. One failure due to planning timeout. Most trials completed all 3 pods successfully, demonstrating good sequential task execution.
\(\pi_{0.5}\text{-DROID}\): 20% success (1/5). Typical failures included placing only 1-2 out of 3 pods correctly, or placing pods in wrong locations (e.g., into a coffee mug instead of the tray). The task's requirement for precise, repeated placements proved challenging for the VLA.
Trial 1
Ours:
✓ 1.0
All 3 pods placed
π₀.₅:
~ 0.85
2/3 coffee pods placed
▶
Trial 2
Ours:
✓ 1.0
All 3 pods placed
π₀.₅:
~ 0.4
Only 1/3 pod placed, approaches other pods but fails to grasp them.
▶
Trial 3
Ours:
✓ 1.0
All 3 pods placed
π₀.₅:
~ 0.85
2/3 succeeded, 3rd incorrectly placed
▶
Trial 4
Ours:
✗ 0.0
Planning failure
π₀.₅:
~ 0.18
Put pod into steel mug
▶
Trial 5
Ours:
✓ 1.0
Successful completion
π₀.₅:
✓ 1.0
Both methods succeed
▶
# Pack pods → tray (obs.)
An extended version of coffee packing that requires moving a Coke can obstacle out of the way before packing all three coffee pods onto the tray. This tests multi-step reasoning and obstacle management.
Observations
TiPToP: 20% success (1/5), taking 61.4s when successful. Common failure modes included grasp failures on pods, planning timeouts, and incomplete pod placement (2/3 pods placed). The need to reason about obstacle removal before placement proved difficult.
\(\pi_{0.5}\text{-DROID}\): Complete failure (0/5). Struggled with the multi-step nature: often managed 1-2 pods but failed on the third, sometimes placing pods incorrectly (on the can rather than the tray, or in the coffee cup). The requirement to move an obstacle first added significant complexity.
Trial 1
Ours:
~ 0.8
2/3 pods, third slipped out
π₀.₅:
~ 0.45
1/3 pod, tried placing on can
▶
Trial 2
Ours:
~ 0.8
2/3 pods, failed pick on first pod
π₀.₅:
~ 0.65
2/3 pods, third failed due to can
▶
Trial 3
Ours:
✓ 1.0
All 3 pods packed
π₀.₅:
~ 0.55
1/3 pod, others in coffee cup
▶
Trial 4
Ours:
✗ 0.0
Failed to find a plan
π₀.₅:
~ 0.9
2/3 pods, failed final placement
▶
Trial 5
Ours:
~ 0.75
2/3 pods, Gemini vision issue
π₀.₅:
~ 0.65
2/3 pods, third balanced on can
▶
# Aleve bottle → tray (obs.)
Pack a medicine box onto a wooden tray while navigating around obstacles in the workspace. Requires reasoning about which objects need to be moved to create a clear path to the goal.
Observations
TiPToP: Good performance with 80% success (4/5), averaging 31.2s. Successfully moved obstacles when necessary. One failure involved picking a suboptimal grasp to move an obstacle, leading to accidentally picking up the wrong object.
\(\pi_{0.5}\text{-DROID}\): Moderate performance at 40% success (2/5). Failure modes included pushing objects together to create more clutter, flipping the wooden platform, and knocking objects off the table. When successful, completion times were similar to ours (~31s), but execution was less reliable.
Trial 1
Ours:
✓ 1.0
Successful completion
π₀.₅:
✓ 1.0
Shoves box off at 81.6s
▶
Trial 2
Ours:
✓ 1.0
Successful completion
π₀.₅:
~ 0.5
Wooden platform flips up, grasped bottle
▶
Trial 3
Ours:
✗ 0.0
Bad grasp on obstacle
π₀.₅:
~ 0.5
Pushed objects around, grasped bottle
▶
Trial 4
Ours:
✓ 1.0
Successful completion
π₀.₅:
✓ 1.0
Successful completion. Tips tray at 118s, medicine falls
▶
Trial 5
Ours:
✓ 1.0
Moved obstacle, completed task
π₀.₅:
~ 0.5
Medicine falls out, struggles to pick
▶
# Three marbles → cup
Place three marbles into a cup. This task is particularly challenging due to the small size of marbles, their tendency to roll, and the precision required for cup placement. Requires careful perception and delicate manipulation.
Observations
TiPToP: Challenging task with 40% success (2/5), averaging 43.7s for successful trials. Main failure modes included misperception of the cup leading to missed placements, marbles rolling away, and unmodeled cables causing collisions with the cup. The precision required for small rolling objects proved difficult.
\(\pi_{0.5}\text{-DROID}\): Complete failure (0/5). Common issues included marbles rolling away before manipulation, picking wrong objects (e.g., larger balls), and general inability to execute precise placement. One trial managed to place one marble but failed on subsequent ones.
Trial 1
Ours:
✓ 1.0
Successful completion
π₀.₅:
✗ 0.0
Balls roll away
▶
Trial 2
Ours:
~ 0.83
Final ball rolled away
π₀.₅:
✗ 0.0
Picked bigger ball, caused chaos
▶
Trial 3
Ours:
✓ 1.0
Successful completion
π₀.₅:
~ 0.33
One marble in, fails others
▶
Trial 4
Ours:
~ 0.83
Third marble, cup misperception
π₀.₅:
✗ 0.0
Complete failure
▶
Trial 5
Ours:
~ 0.33
Cable collision with cup
π₀.₅:
✗ 0.0
Complete failure
▶
# Marbles + cable
A complex multi-object task: place a small bag of marbles into a mesh bag, then a wire/cable onto the plastic surface. Requires coordinating multiple objects with different properties (flexible cable, plastic bag, deformable mesh bag).
Observations
TiPToP: Difficult task with 40% success (2/5), averaging 35.4s when successful. Failure modes included missed grasps on the cable, missed placements into the mesh bag, and one case where Gemini vision model incorrectly identified a beaker as the target cable. The combination of deformable objects and precise sequential manipulation proved challenging.
\(\pi_{0.5}\text{-DROID}\): Complete failure (0/5). While some trials showed partial progress (e.g., placing the wire on plastic around 30-90s), the VLA consistently failed to complete the full sequence. Common failures included picking up the plastic sheet and dumping everything out, throwing marbles off the table, or complete inability to progress.
Trial 1
Ours:
~ 0.75
Wire placed, balls missed pouch
π₀.₅:
~ 0.75
Cable on plastic, then ruined
▶
Trial 2
Ours:
✓ 1.0
Successful completion
π₀.₅:
~ 0.5
Grasped marbles and cable but placement failures
▶
Trial 3
Ours:
~ 0.25
Missed grasp and placement
π₀.₅:
~ 0.5
Wire placed, then dumps everything
▶
Trial 4
Ours:
~ 0.5
Picked beaker, Gemini vision issue
π₀.₅:
~ 0.75
Marbles off table, cable at 63.77s
▶
Trial 5
Ours:
✓ 1.0
Successful completion
π₀.₅:
~ 0.5
Threw cable off table, placed marbles on plastic bag
▶
# Evaluation Scene Details
Each scene shows an image of the task, its identifier (as referenced in the Summary table above), the language prompt given to both systems, and the task progress metric used for evaluation. Scenes are grouped by category: Simple, Distractor, Semantic, and Multi-step. † indicates tasks evaluated by the system designers at MIT. Unmarked scenes are evaluated by external evaluators at the University of Pennsylvania not involved in the development of TiPToP. (sim) denotes tasks evaluated in simulation.
Task progress metric numbers are reported in %; a + or − sign indicates that the particular denoted amount is added or subtracted from the overall score, and no sign indicates that the number is the absolute score for achieving that particular condition. Progress metrics may vary by evaluator and task. Some metrics penalize manipulating distractors while others do not.
| Scene | Identifier / Language Prompt | Progress Metric |
|---|---|---|
| Simple | ||
|
Cube → bowl (sim) "put the cube in the bowl" |
25% approach cube, 50% grasp, 75% approach bowl with cube, 100% place |
|
Can → mug (sim) "put the can in the mug" |
25% approach can, 50% grasp, 75% approach mug with can, 100% place |
|
Banana → bin (sim) "put banana in the bin" |
25% approach banana, 50% grasp, 75% approach bin with banana, 100% place |
|
Marker → tray "put the marker in the tray" |
+25% touch marker, +25% grasp, +25% touch tray, +25% place |
|
Crackers → tray† "place the crackers onto the tray" |
50% grasp crackers, 100% place |
| Distractor | ||
|
Meat can → sugar box (sim) "put the meat can on the sugar box" |
25% approach meat can, 50% grasp, 75% approach box with meat can, 100% place |
|
Coffee capsules → plate "put all of the coffee capsules onto the white plate" |
+50% per capsule placed, −20% per distractor |
|
Turkish figs → plate "put the turkish figs onto the white plate" |
+50% per fig placed, −20% per cashew |
|
Cashews → plate "put the roasted cashews onto the white plate" |
+50% per cashew placed, −20% per fig |
|
Red cubes → plate "put the red cubes onto the white plate" |
+50% per cube placed, −20% if distractor placed |
|
Fish → box "place the fish into the white box" |
+50% pick fish, +50% place into white box |
|
Crackers → tray (med.)† "place the crackers onto the tray" |
+50% pick crackers, +50% place on the tray (no penalty for distractor) |
|
PB crackers → tray (hard)† "place the peanut butter crackers onto the tray" |
+50% pick crackers, +50% place on the tray (no penalty for distractor) |
| Semantic | ||
|
Toy → matching plate "pick up the toy and place on the plate with similar color" |
+50% pick toy, +50% place on teal or +30% place on blue |
|
Creeper → plate "pick up the creeper and place onto the purple plate" |
+50% pick creeper toy, +50% place onto purple plate |
|
Largest toy → plate "pick up the largest toy and place onto the purple plate" |
+50% pick creeper, +50% place onto purple plate, −20% if attempt to place on distractor |
|
Red A → color pile "pick up the red A and place on same color pile" |
+50% pick red A block, +50% place onto red pile, −20% knock pile over |
|
Banana → box "pick up the banana and put it in the box" |
+50% place banana into any box, +50% place into box with fruit (aims to test common sense of human selection) |
|
N block → indicated cup "put the N block into the cup pointed to by the arrow" |
+50% grasp N block, +50% place into cup pointed at |
|
Sort blocks by color "sort the blocks into opposite color plates" |
+10% per block touched, +40% per correct place |
|
Banana → matching plate "place banana into plate has similar color" |
+50% pick banana, +50% place into orange plate |
| Multi-step | ||
|
Color cubes → bowl (sim) "put 3 cubes into the bowl" |
For up to 3 cubes (normalized to 100%): +5% approach cube, +10% grasp, +10% approach bowl with cube, +15% place |
|
AirPods → cup "place airpods into the yellow cup" |
+25% per AirPods picked, +25% per place, −20% distractor |
|
Pack pods → tray† "pack the coffee pods onto the rectangular tray" |
For each of the 3 pods: +3.33% approach, +15% grasp, +0% place not in tray, +15% place touching tray |
|
Pack pods → tray (obs.)† "pack the coffee pods onto the rectangular tray" |
+12.5% pick can, +12.5% place s.t. it doesn't obstruct tray (or +25% for clearing can obstruction without pick/place), for each of 3 pods: +5% for approaching pod, +10% for correct pick, +10% for correct place into tray |
|
Aleve bottle → tray (obs.)† "put the small white aleve bottle into the cardboard tray" |
+10% pick an obstacle object, +10% place obstacle s.t. unobstructs aleve, +30% pick aleve bottle (+50% if picked without clearing obstacles), +50% place bottle in tray |
|
Three marbles → cup† "put only the marbles in the cup" |
+16.67% for each pick of a marble, +16.67% for each place of a marble into the cup |
|
Marbles + cable† "put the small plastic bag of marbles into the black mesh bag, and the cable on top of the empty large plastic bag" |
wire: +5% approach, +20% stable pick, +25% stable place atop plastic; marbles pouch: +5% approach, +20% pick, +25% place into mesh bag |
© 2026 TiPToP Authors