๐Ÿค– QuillAI โ€” Autonomous Agent Evaluation

Generated March 20, 2026 at 13:53 ยท 4 datasets ยท 32 tasks ยท Plan-Execute-Observe-Adapt loop

32
Total Tasks
11
Fully Complete
18
Partial (replanned)
3
Failed
81%
Avg Confidence
76s
Avg Latency
24
Self-Corrections
8
Figures Generated
212
Total Findings

๐Ÿ“‹ Task Results

DatasetTaskStatusConfidence FindingsFiguresCorrectionsLatency
breast_cancer exploratory โ€” medium โœ… success
87%
10 1 0 67.6s
breast_cancer predictive โ€” medium โš ๏ธ partial
82%
6 0 1 122.7s
breast_cancer diagnostic โ€” simple โš ๏ธ partial
85%
10 0 1 134.0s
breast_cancer diagnostic โ€” medium โœ… success
88%
10 0 0 102.1s
breast_cancer predictive โ€” medium (DT) โš ๏ธ partial
88%
5 0 1 116.5s
breast_cancer exploratory โ€” simple โœ… success
81%
10 1 0 101.3s
breast_cancer comparative โ€” medium โœ… success
75%
10 0 0 95.8s
breast_cancer prescriptive โ€” advanced โš ๏ธ partial
90%
2 0 1 87.1s
telco_churn exploratory โ€” medium โœ… success
74%
10 0 0 79.3s
telco_churn exploratory โ€” medium (sr) โœ… success
77%
10 1 0 80.2s
telco_churn predictive โ€” advanced โš ๏ธ partial
80%
2 0 1 62.2s
telco_churn diagnostic โ€” medium (DT) โš ๏ธ partial
90%
2 0 1 54.0s
telco_churn diagnostic โ€” medium (TC) โš ๏ธ partial
78%
10 0 1 73.5s
telco_churn prescriptive โ€” advanced โš ๏ธ partial
82%
10 0 1 126.9s
telco_churn comparative โ€” medium โœ… success
69%
5 0 0 47.5s
telco_churn predictive โ€” advanced (SA) โš ๏ธ partial
93%
5 0 2 121.8s
tips exploratory โ€” simple โš ๏ธ partial
85%
10 0 1 61.6s
tips exploratory โ€” medium โœ… success
83%
10 1 0 43.0s
tips predictive โ€” medium (clf) โš ๏ธ partial
90%
2 0 1 40.5s
tips predictive โ€” medium (reg) โŒ failed
70%
0 0 1 37.6s
tips diagnostic โ€” medium โœ… success
77%
10 0 0 41.8s
tips diagnostic โ€” simple โœ… success
78%
10 1 0 40.8s
tips prescriptive โ€” advanced โš ๏ธ partial
70%
2 0 1 41.2s
tips comparative โ€” medium โŒ failed
65%
0 0 1 47.3s
titanic exploratory โ€” medium โš ๏ธ partial
77%
10 1 1 84.3s
titanic exploratory โ€” medium (corr) โš ๏ธ partial
90%
1 0 1 45.4s
titanic predictive โ€” medium (lr) โš ๏ธ partial
90%
8 0 3 113.9s
titanic predictive โ€” medium (DT) โŒ failed
60%
0 0 1 44.6s
titanic diagnostic โ€” medium โš ๏ธ partial
81%
9 1 1 89.4s
titanic diagnostic โ€” advanced โš ๏ธ partial
82%
3 0 1 63.7s
titanic prescriptive โ€” advanced โš ๏ธ partial
86%
10 0 1 92.2s
titanic comparative โ€” medium โœ… success
83%
10 1 0 77.3s

๐Ÿ“Š Generated Figures