Computer-using agents (CUAs) typically rely on manually authored skill definitions (e.g., SKILL.md files) that enumerate action patterns for web-based tasks. This approach scales poorly and fails to capture the distributional structure of real user interactions. We propose InteraSkill, a three-phase framework that (1) collects interaction trajectories from CUA sessions on live web platforms, (2) segments and clusters action sequences into a discrete skill vocabulary using change-point detection and hierarchical clustering, and (3) trains sequence models over the induced skill space to predict task-relevant skill compositions. On a corpus of 5,226 interaction segments derived from 847 WebArena sessions, our learned skill vocabulary of 12 canonical skills achieves 0.850 information-weighted accuracy with a Qwen3-8B LoRA predictor, compared to 0.140 for the original SKILL.md baseline. We further show that skill-conditioned policies improve WebArena task completion by 14 percentage points over prompting with hand-written skill descriptions.
Most CUA agents rely on hand-written configuration files that map actions to fixed UI coordinates. Three fundamental problems make this unscalable.
We are the first to unify online skill discovery, interaction-derived learning, reusable skill definitions, and continuous self-improvement.
| Approach | Skill Source | Online | Interaction-Derived | Transferable |
|---|---|---|---|---|
| SKILL.md (manual) | Human engineering | ✗ | ✗ | ✗ |
| AWM (Wang et al., 2024) | Trajectory mining | Partial | ✗ | ✓ |
| SkillWeaver (Zheng et al., 2025) | Self-exploration | ✗ | ✗ | ✓ |
| ICAL (Sarch et al., 2024) | Demos + feedback | ✓ | Partial | ✗ |
| InteraSkill (Ours) | Interaction trajectories | ✓ | ✓ | ✓ |
InteraSkill operates in three phases. Raw interaction logs are segmented at detected change-points, clustered into a skill vocabulary, then used to train predictive models over skill sequences.
Simulated log stream from a WebArena agent run with online skill discovery.
| Model | IW Accuracy | WebArena SR |
|---|---|---|
| SKILL.md (baseline) | 0.140 | 0.22 |
| Frequency | 0.349 | 0.28 |
| AWM | 0.334 | 0.26 |
| Transformer | 0.349 | 0.30 |
| Qwen3-8B LoRA | 0.850 | 0.36 |
* Results are projected based on reported trends; exact numbers forthcoming.
Latest multi-model benchmark below, followed by the InteraSkill pipeline diagnostics gallery.
Click any figure to zoom. Use arrow keys or navigation buttons to browse.
Select a WebArena scenario below. Watch the agent interact, then see how those interactions get segmented into reusable skills.
Navigate to search bar, type query, submit search. Appears in 87% of e-commerce trajectories.
Apply category/attribute filters and sort results. Reused across product search, user listing, and order history.
Select an item from results, scroll to action button, and execute purchase/add action.
Use sidebar navigation to reach a specific subforum/category. Transfers to any hierarchical menu structure.
Same skill discovered in e-commerce! Sort by new/relevance and scan results. The embedding matched across domains.
Perform a single-action interaction on a content item (upvote, like, bookmark, flag).
Learned on Reddit forums, successfully transferred to GitLab project navigation. CLIP matched sidebar menu structure across domains.
Originally discovered in e-commerce, reused in forums, now transfers to GitLab issues. The most versatile skill in the library.
Discovered on this unseen domain! Dropdown-select pattern for assigning metadata (assignee, label, milestone). Added to the growing skill library.
Real failure scenarios from agent execution. Each failure becomes a learning signal for skill refinement.
Coordinate-only targeting is unreliable when overlay elements exist. The agent now verifies element class/role before clicking.
Added pre-click verification: check element.className matches expected target before executing click. Embedding delta: 0.014.
Filter DOM elements with class containing "ad", "sponsor", "promo" before coordinate targeting.
Write actions can trigger auth walls. The recovery pattern is now encoded as a reusable skill.
Detect login redirect (URL contains /login or /signin). Execute credential flow, then return to original URL and retry last action.
Before write actions, verify auth state by checking for user avatar in navigation. Prevents 401 errors proactively.
Dynamic pages need DOM stabilization before action. Wait for mutation observer to settle before capturing accessibility tree.
After navigation or scroll, wait until no DOM mutations for 300ms before capturing observation. Reduces stale-element errors by ~40%.
Pagination is invisible to single-page reasoning. The agent must detect and iterate through all pages for any data collection task.
Detect pagination indicators ("page X of Y", next buttons, "showing X of Y"). Loop through all pages, collecting items from each.
Forms may have required fields below the fold. The agent now scans the full form before submitting.
Before submit: scroll through entire form, identify all required fields (asterisk, "required" attr), fill all before clicking submit.
Clipboard is unreliable across application switches. The agent now uses explicit file-based data transfer for cross-app workflows.
For multi-app data movement: export source data to structured intermediate format (JSON/CSV), then import into target application.
Never rely on implicit OS state (clipboard, focus) across app boundaries. Always use explicit data channels.
When the agent makes mistakes during simulation, the correction trajectories feed into skill discovery. No manual SKILL.md editing needed.
These are actual information worker workflows from the IW Benchmark dataset. Each one becomes a trajectory for skill discovery.
Each formula solves a specific real problem. Here's what they mean in plain language.
The agent operates in two layers. Low-level: raw mouse moves and keystrokes in continuous [0,1]⁴. High-level: learned skill abstractions like "fill form."
x, y: screen position as fractions (not pixels). The same policy works on 1080p and 4K. SKILL.md stores "click at (847, 312)." We store "click at (0.44, 0.29)."
For each trajectory τ, the "correct" skill z is pulled close while wrong skills are pushed away. Skills specialize without any human labeling.
Two UI screens are "the same" if their CLIP embeddings are similar. A "Submit" button on Shopify and GitHub look different but mean the same thing.
βz(s') is the termination function -- probability of switching to a new skill. The agent learns when to stop one skill and start another from data.
Forces each skill dimension to affect behavior independently. Without this, skills collapse into the same average behavior.
Compute Δat = ||at - at-1||. Detect boundaries where Δat > θ. Each segment = candidate skill.
Model each segment as Gaussian mixture p(a|segment) = ∑πmN(a; μm, Σm). Compare via Wasserstein distance.
Agglomerative clustering on Wasserstein distance matrix. Automatically discovers k skills without specifying k.
InfoNCE loss maps clusters into continuous latent space z ∈ R^16. Multimodal encoder ingests vision + text + action.
Macro policy πmacro(z|s) selects skills. Each skill executes until termination βz(s) fires. Options framework.
t-SNE projection of 5,226 interaction segments into the 12-skill vocabulary. Hover over points for details; click to highlight a cluster. Double-click to reset.
Click a point to see details.
Build a skill sequence and see what each model predicts as the next skill.
Replay of a WebArena map-navigation trajectory with skill segmentation overlay.